[
  {
    "path": ".gitignore",
    "content": "*.pyc\npaddlepalm.egg-info\ndata\n__pycache__\n*egg-info\npretrain_model\npretrain\noutput*\noutput_model\nbuild\ndist\npaddle_palm.egg-info\nmrqa_output\n*.log\n"
  },
  {
    "path": "README.md",
    "content": "# PaddlePALM\n\nEnglish | [简体中文](./README_zh.md)\n\nPaddlePALM (PArallel Learning from Multi-tasks) is a fast, flexible, extensible and easy-to-use NLP large-scale pretraining and multi-task learning framework. PaddlePALM is a high level framework **aiming at fastly developing high-performance NLP models**. \n\nWith PaddlePALM, it is easy to achieve effecient exploration of robust learning of NLP models with multiple auxilary tasks. For example, based on PaddlePALM, the produced robust MRC model, [D-Net](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/Research/MRQA2019-D-NET), has achieved **the 1st place** in [EMNLP2019 MRQA](https://mrqa.github.io) track.\n\n<p align=\"center\">\n\t<img src=\"https://tva1.sinaimg.cn/large/006tNbRwly1gbjkuuwrmlj30hs0hzdh2.jpg\" alt=\"Sample\"  width=\"300\" height=\"333\">\n\t<p align=\"center\">\n\t\t<em>MRQA2019 Leaderboard</em>\n\t</p>\n</p>\n\nBeyond the research scope, PaddlePALM has been applied on **Baidu Search Engine** to seek for more accurate user query understanding and answer mining, which implies the high reliability and performance of PaddlePALM.\n\n#### Features:\n\n- **Easy-to-use:** with PALM, *8 steps* to achieve a typical NLP task. Moreover, all basic components (e.g., the model backbone, dataset reader, task output head, optimizer...) have been decoupled, which allows the replacement of any component to other candidates with quite minor changes of your code. \n- **Built-in Popular NLP Backbones and Pre-trained models:** multiple state-of-the-art general purpose model architectures and pretrained models (e.g., BERT,ERNIE,RoBERTa,...) are built-in. \n- **Easy to play Multi-task Learning:** only one API is needed for jointly training of several tasks with parameters reusement.\n- **Support train/eval with Multi-GPUs:** automatically recognize and adapt to multiple gpus mode to accelerate training and inference.\n- **Pre-training friendly:** self-supervised tasks (e.g., mask language model) are built-in to facilitate pre-training. Easy to train from scratch.\n- **Easy to Customize:** support customized development of any component (e.g, backbone, task head, reader and optimizer) with reusement of pre-defined ones, which gives developers high flexibility and effeciency to adapt for diverse NLP scenes. \n\nYou can easily re-produce following competitive results with minor codes, which covers most of NLP tasks such as classification, matching, sequence labeling, reading comprehension, dialogue understanding and so on. More details can be found in `examples`.\n\n<table>\n  <tbody>\n    <tr>\n      <th><strong>Dataset</strong>\n        <br></th>\n      <th colspan=\"2\"><center><strong>chnsenticorp</strong></center></th>\n      <th colspan=\"2\"><center><strong>Quora Question Pairs matching</strong><center></th>\n      <th colspan=\"1\"><strong>MSRA-NER<br>(SIGHAN2006)</strong></th>\n      <th colspan=\"2\"><strong>CMRC2018</strong></th>\n    </tr>\n    <tr>\n      <td rowspan=\"2\">\n        <p>\n          <strong>Metric</strong>\n          <br></p>\n      </td>\n      <td colspan=\"1\">\n        <center><strong>accuracy</strong></center>\n        <br></td>\n      <td colspan=\"1\">\n        <strong>f1-score</strong>\n        <strong></strong>\n        <br></td>\n      <td colspan=\"1\">\n        <center><strong>accuracy</strong></center>\n        <br></td>\n      <td colspan=\"1\">\n        <strong>f1-score</strong>\n        <strong></strong>\n        <br></td>\n      <td colspan=\"1\">\n        <strong>f1-score</strong>\n        <strong></strong>\n        <br></td>\n      <td colspan=\"1\">\n        <strong>em</strong>\n        <br></td>\n      <td colspan=\"1\">\n        <strong>f1-score</strong>\n        <br></td>\n    </tr>\n    <tr>\n      <td colspan=\"2\" width=\"\">\n        <strong>test</strong>\n        <br></td>\n      <td colspan=\"2\" width=\"\">\n        <strong>test</strong>\n        <br></td>\n      <td colspan=\"1\" width=\"\">\n        <strong>test</strong>\n        <br></td>\n      <td colspan=\"2\" width=\"\">\n        <strong>dev</strong>\n        <br></td>\n    </tr>\n    <tr>\n      <td><strong>ERNIE Base</strong></td>\n      <td>95.8</td>\n      <td>95.8</td>\n      <td>86.2</td>\n      <td>82.2</td>\n      <td>99.2</td>\n      <td>64.3</td>\n      <td>85.2</td>\n    </tr>\n\n  </tbody>\n</table>\n\n\n\n## Overview\n\n<p align=\"center\">\n\t<img src=\"https://tva1.sinaimg.cn/large/0082zybply1gbyo8d4ltoj31ag0n3tby.jpg\" alt=\"Sample\"  width=\"600px\" height=\"auto\">\n\t<p align=\"center\">\n\t\t<em>Architecture Diagram</em>\n\t</p>\n</p>\n\nPaddlePALM is a well-designed high-level NLP framework. You can efficiently achieve **supervised learning, unsupervised/self-supervised learning, multi-task learning and transfer learning** with minor codes based on PaddlePALM. There are three layers in PaddlePALM architecture, i.e., component layer, trainer layer and high-level trainer layer from bottom to top. \n\nIn component layer, PaddlePALM supplies 6 **decoupled** components to achieve a NLP task. Each component contains rich `pre-defined` classes and a `Base` class. Pre-defined classes are aiming at typical NLP tasks, and the base class is to help users develop a new Class (based on pre-defined ones or from the base). \n\nThe trainer layer is to establish a computation graph with selected components and do training and predicting. The training strategy, model saving and loading, evaluation and predicting procedures are described in this layer. Noted a trainer can only process one task. \n\nThe high-level trainer layer is for complicated learning and inference strategy, e.g., multi-task learning. You can add auxilary tasks to train robust NLP models (improve test set and out-of-domain performance of a model), or jointly training multiple related tasks to gain more performance for each task.\n\n| module | illustration | \n| - | - |\n| **paddlepalm** | an open source NLP pretraining and multitask learning framework, built on paddlepaddle. |\n| **paddlepalm.reader** | a collection of elastic task-specific dataset readers. |\n| **paddlepalm.backbone** | a collection of classic NLP representation models, e.g., BERT, ERNIE, RoBERTa. |\n| **paddlepalm.head** | a collection of task-specific output layers. |\n| **paddlepalm.lr_sched** | a collection of learning rate schedualers. |\n| **paddlepalm.optimizer** | a collection of optimizers. |\n| **paddlepalm.downloader** | a download module for pretrained models with configure and vocab files. |\n| **paddlepalm.Trainer** | the core unit to start a single task training/predicting session. A trainer is to build computation graph, manage training and evaluation process, achieve model/checkpoint saving and pretrain_model/checkpoint loading.|\n| **paddlepalm.MultiHeadTrainer** | the core unit to start a multi-task training/predicting session. A MultiHeadTrainer is built based on several Trainers. Beyond the inheritance of Trainer, it additionally achieves model backbone reuse across tasks, trainer sampling for multi-task learning, and multi-head inference for effective evaluation and prediction. |\n\n\n## Installation\n\nPaddlePALM support both python2 and python3, linux and windows, CPU and GPU. The preferred way to install PaddlePALM is via `pip`. Just run following commands in your shell.\n\n```bash\npip install paddlepalm\n```\n\n### Installing via source\n\n```shell\ngit clone https://github.com/PaddlePaddle/PALM.git\ncd PALM && python setup.py install\n```\n\n### Library Dependencies\n- Python >= 2.7\n- cuda >= 9.0\n- cudnn >= 7.0\n- PaddlePaddle >= 1.7.0 (Please refer to [this](http://www.paddlepaddle.org/#quick-start) to install)\n\n\n### Downloading pretrain models\nWe incorporate many pretrained models to initialize model backbone parameters. Training big NLP model, e.g., 12-layer transformers, with pretrained models is practically much more effective than that with randomly initialized parameters. To see all the available pretrained models and download, run following code in python interpreter (input command `python` in shell):\n\n```python\n>>> from paddlepalm import downloader\n>>> downloader.ls('pretrain')\nAvailable pretrain items:\n  => RoBERTa-zh-base\n  => RoBERTa-zh-large\n  => ERNIE-v2-en-base\n  => ERNIE-v2-en-large\n  => XLNet-cased-base\n  => XLNet-cased-large\n  => ERNIE-v1-zh-base\n  => ERNIE-v1-zh-base-max-len-512\n  => BERT-en-uncased-large-whole-word-masking\n  => BERT-en-cased-large-whole-word-masking\n  => BERT-en-uncased-base\n  => BERT-en-uncased-large\n  => BERT-en-cased-base\n  => BERT-en-cased-large\n  => BERT-multilingual-uncased-base\n  => BERT-multilingual-cased-base\n  => BERT-zh-base\n\n>>> downloader.download('pretrain', 'BERT-en-uncased-base', './pretrain_models')\n...\n```\n\n\n## Usage\n\n#### Quick Start\n\n8 steps to start a typical NLP training task.\n\n1. use `paddlepalm.reader` to create a *reader* for dataset loading and input features generation, then call `reader.load_data` method to load your training data.\n2. use `paddlepalm.backbone` to create a model *backbone* to extract text features (e.g., contextual word embedding, sentence embedding).\n3. register your *reader* with your *backbone* through `reader.register_with` method. After this step, your reader is able to yield input features used by backbone.\n4. use `paddlepalm.head` to create a task output *head*. This head can provide task loss for training and predicting results for model inference.\n5. create a task *trainer* with `paddlepalm.Trainer`, then build forward graph with backbone and task head (created in step 2 and 4) through `trainer.build_forward`.\n6. use `paddlepalm.optimizer` (and `paddlepalm.lr_sched` if is necessary) to create a *optimizer*, then build backward through `trainer.build_backward`.\n7. fit prepared reader and data (achieved in step 1) to trainer with `trainer.fit_reader` method.\n8. load pretrain model with `trainer.load_pretrain`, or load checkpoint with `trainer.load_ckpt` or nothing to do for training from scratch, then do training with `trainer.train`.\n\nFor more implementation details, see following demos: \n\n- [Sentiment Classification](https://github.com/PaddlePaddle/PALM/tree/master/examples/classification)\n- [Question Pairs matching](https://github.com/PaddlePaddle/PALM/tree/master/examples/matching)\n- [Named Entity Recognition](https://github.com/PaddlePaddle/PALM/tree/master/examples/tagging)\n- [SQuAD-like Machine Reading Comprehension](https://github.com/PaddlePaddle/PALM/tree/master/examples/mrc).\n\n\n#### Multi-task Learning\nTo run with multi-task learning mode:\n\n1. repeatedly create components (i.e., reader, backbone and head) for each task followed with step 1~5 above. \n2. create empty trainers (each trainer is corresponded to one task) and pass them to create a `MultiHeadTrainer`. \n3. build multi-task forward graph with `multi_head_trainer.build_forward` method.\n4. use `paddlepalm.optimizer` (and `paddlepalm.lr_sched` if is necessary) to create a *optimizer*, then build backward through `multi_head_trainer.build_backward`.\n5. fit all prepared readers and data to multi_head_trainer with `multi_head_trainer.fit_readers` method.\n6. load pretrain model with `multi_head_trainer.load_pretrain`, or load checkpoint with `multi_head_trainer.load_ckpt` or nothing to do for training from scratch, then do training with `multi_head_trainer.train`.\n\nThe save/load and predict operations of a multi_head_trainer is the same as a trainer.\n\nFor more implementation details with `multi_head_trainer`, see\n\n- [ATIS: joint training of dialogue intent recognition and slot filling](https://github.com/PaddlePaddle/PALM/tree/master/examples/multi-task)\n\n#### Save models\n\nTo save models/checkpoints and logs during training, just call `trainer.set_saver` method. More implementation details see [this](https://github.com/PaddlePaddle/PALM/tree/master/examples).\n\n#### Evaluation/Inference\nTo do predict/evaluation after a training stage, just create another three reader, backbone and head instance with `phase='predict'` (repeat step 1~4 above). Then do predicting with `predict` method in trainer (no need to create another trainer). More implementation details see [this](https://github.com/PaddlePaddle/PALM/tree/master/examples/predict).\n\nIf you want to do evaluation during training process, use `trainer.train_one_step()` instead of `trainer.train()`. The `trainer.train_one_step(batch)` achieves to train only one step, thus you can insert evaluation code into any point of training process. The argument `batch` can be fetched from `trainer.get_one_batch`.\n\nPaddlePALM also supports multi-head inference, please reference `examples/multi-task/joint_predict.py`.\n\n#### Play with Multiple GPUs\nIf there exists multiple GPUs in your environment, you can control the number and index of these GPUs through the environment variable [CUDA_VISIBLE_DEVICES](https://devblogs.nvidia.com/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/). For example, if 4 GPUs in your enviroment, indexed with 0,1,2,3, you can run with GPU2 only with following commands\n\n```shell\nCUDA_VISIBLE_DEVICES=2 python run.py\n```\n\nMultiple GPUs should be seperated with `,`. For example, running with GPU2 and GPU3, following commands is refered:\n\n```shell\nCUDA_VISIBLE_DEVICES=2,3 python run.py\n```\n\nOn multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. Therefore, when running with multiple cards, **you need to ensure that the set batch_size can be divided by the number of cards.**\n\n## License\n\nThis tutorial is contributed by [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) and licensed under the [Apache-2.0 license](https://github.com/PaddlePaddle/models/blob/develop/LICENSE).\n\n"
  },
  {
    "path": "README_zh.md",
    "content": "# PaddlePALM\n\n[English](./README.md) | 简体中文\n\nPaddlePALM (PArallel Learning from Multi-tasks) 是一个灵活，通用且易于使用的NLP大规模预训练和多任务学习框架。 PALM是一个旨在**快速开发高性能NLP模型**的上层框架。\n\n使用PaddlePALM，可以非常轻松灵活的探索具有多种任务辅助训练的“高鲁棒性”阅读理解模型，基于PALM训练的模型[D-Net](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/Research/MRQA2019-D-NET)在[EMNLP2019国际阅读理解评测](https://mrqa.github.io/)中夺得冠军。\n\n<p align=\"center\">\n\t<img src=\"https://tva1.sinaimg.cn/large/006tNbRwly1gbjkuuwrmlj30hs0hzdh2.jpg\" alt=\"Sample\"  width=\"300\" height=\"333\">\n\t<p align=\"center\">\n\t\t<em>MRQA2019 排行榜</em>\n\t</p>\n</p>\n\n除了降低NLP研究成本以外，PaddlePALM已被应用于“百度搜索引擎”，有效地提高了用户查询的理解准确度和挖掘出的答案质量，具备高可靠性和高训练/推理性能。\n\n#### 特点:\n\n- **易于使用**：使用PALM， *8个步骤*即可实现一个典型的NLP任务。此外，模型主干网络、数据集读取工具和任务输出层已经解耦，只需对代码进行相当小的更改，就可以将任何组件替换为其他候选组件。\n- **支持多任务学习**：*6个步骤*即可实现多任务学习任务。\n- **支持大规模任务和预训练**：可自动利用多gpu加速训练和推理。集群上的分布式训练需要较少代码。\n- **流行的NLP骨架和预训练模型**：内置多种最先进的通用模型架构和预训练模型(如BERT、ERNIE、RoBERTa等)。\n- **易于定制**：支持任何组件的定制开发(例如：主干网络，任务头，读取工具和优化器)与预定义组件的复用，这给了开发人员高度的灵活性和效率，以适应不同的NLP场景。\n\n你可以很容易地用较少的代码复现出很好的性能，涵盖了大多数NLP任务，如分类、匹配、序列标记、阅读理解、对话理解等等。更多细节可以在`examples`中找到。\n\n<table>\n  <tbody>\n    <tr>\n      <th><strong>数据集</strong>\n        <br></th>\n      <th colspan=\"2\"><center><strong>chnsenticorp</strong></center></th>\n      <th colspan=\"2\"><center><strong>Quora Question Pairs matching</strong><center></th>\n      <th colspan=\"1\"><strong>MSRA-NER<br>(SIGHAN2006)</strong></th>\n      <th colspan=\"2\"><strong>CMRC2018</strong></th>\n    </tr>\n    <tr>\n      <td rowspan=\"2\">\n        <p>\n          <strong>评价标准</strong>\n          <br></p>\n      </td>\n      <td colspan=\"1\">\n        <center><strong>accuracy</strong></center>\n        <br></td>\n      <td colspan=\"1\">\n        <strong>f1-score</strong>\n        <strong></strong>\n        <br></td>\n      <td colspan=\"1\">\n        <center><strong>accuracy</strong></center>\n        <br></td>\n      <td colspan=\"1\">\n        <strong>f1-score</strong>\n        <strong></strong>\n        <br></td>\n      <td colspan=\"1\">\n        <strong>f1-score</strong>\n        <strong></strong>\n        <br></td>\n      <td colspan=\"1\">\n        <strong>em</strong>\n        <br></td>\n      <td colspan=\"1\">\n        <strong>f1-score</strong>\n        <br></td>\n    </tr>\n    <tr>\n      <td colspan=\"2\" width=\"\">\n        <strong>test</strong>\n        <br></td>\n      <td colspan=\"2\" width=\"\">\n        <strong>test</strong>\n        <br></td>\n      <td colspan=\"1\" width=\"\">\n        <strong>test</strong>\n        <br></td>\n      <td colspan=\"2\" width=\"\">\n        <strong>dev</strong>\n        <br></td>\n    </tr>\n    <tr>\n      <td><strong>ERNIE Base</strong></td>\n      <td>95.8</td>\n      <td>95.8</td>\n      <td>86.2</td>\n      <td>82.2</td>\n      <td>99.2</td>\n      <td>64.3</td>\n      <td>85.2</td>\n    </tr>\n\n  </tbody>\n</table>\n\n\n\n## Package概览\n\n<p align=\"center\">\n\t<img src=\"https://tva1.sinaimg.cn/large/0082zybply1gbyo8d4ltoj31ag0n3tby.jpg\" alt=\"Sample\"  width=\"600px\" height=\"auto\">\n\t<p align=\"center\">\n\t\t<em>PALM架构图</em>\n\t</p>\n</p>\n\n\nPaddlePALM是一个设计良好的高级NLP框架。基于PaddlePALM的轻量级代码可以高效实现**监督学习、非监督/自监督学习、多任务学习和迁移学习**。在PaddlePALM架构中有三层，从下到上依次是component层、trainer层、high-level trainer层。\n\n在组件层，PaddlePALM提供了6个 **解耦的**组件来实现NLP任务。每个组件包含丰富的预定义类和一个基类。预定义类是针对典型的NLP任务的，而基类是帮助用户开发一个新类（基于预定义类或基类）。\n\n训练器层是用选定的构件建立计算图，进行训练和预测。该层描述了训练策略、模型保存和加载、评估和预测过程。一个训练器只能处理一个任务。\n\n高级训练器层用于复杂的学习和推理策略，如多任务学习。您可以添加辅助任务来训练健壮的NLP模型（提高模型的测试集和领域外的性能），或者联合训练多个相关任务来获得每个任务的更高性能。\n\n\n| 模块 | 描述 | \n| - | - |\n| **paddlepalm** | 基于PaddlePaddle框架的high-level NLP预训练和多任务学习框架。 |\n| **paddlepalm.reader** | 预置的任务数据集读取与预处理工具。|\n| **paddlepalm.backbone** | 预置的主干网络，如BERT, ERNIE, RoBERTa。|\n| **paddlepalm.head** | 预置的任务输出层。|\n| **paddlepalm.lr_sched** | 预置的学习率规划策略。|\n| **paddlepalm.optimizer** | 预置的优化器。|\n| **paddlepalm.downloader** | 预训练模型管理与下载模块。|\n| **paddlepalm.Trainer** | 任务训练/预测单元。训练器用于建立计算图，管理训练和评估过程，实现模型/检查点保存和pretrain_model/检查点加载等。|\n| **paddlepalm.MultiHeadTrainer** | 完成多任务训练/预测的模块。一个MultiHeadTrainer建立在几个Trainer的基础上。实现了模型主干网络跨任务复用、多任务学习、多任务推理等。|\n\n## 安装\n\nPaddlePALM 支持 python2 和 python3, linux 和 windows, CPU 和 GPU。安装PaddlePALM的首选方法是通过`pip`。只需运行以下命令：\n\n```bash\npip install paddlepalm\n```\n\n### 通过源码安装\n\n```shell\ngit clone https://github.com/PaddlePaddle/PALM.git\ncd PALM && python setup.py install\n```\n\n### 库依赖\n- Python >= 2.7\n- cuda >= 9.0\n- cudnn >= 7.0\n- PaddlePaddle >= 1.7.0 (请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装)\n\n\n### 下载预训练模型\n我们提供了许多预训练的模型来初始化模型主干网络参数。用预先训练好的模型训练大的NLP模型，如12层Transformer，实际上比用随机初始化的参数更有效。要查看所有可用的预训练模型并下载，请在python解释器中运行以下代码(在shell中输入命令`python`):\n\n```python\n>>> from paddlepalm import downloader\n>>> downloader.ls('pretrain')\nAvailable pretrain items:\n  => RoBERTa-zh-base\n  => RoBERTa-zh-large\n  => ERNIE-v2-en-base\n  => ERNIE-v2-en-large\n  => XLNet-cased-base\n  => XLNet-cased-large\n  => ERNIE-v1-zh-base\n  => ERNIE-v1-zh-base-max-len-512\n  => BERT-en-uncased-large-whole-word-masking\n  => BERT-en-cased-large-whole-word-masking\n  => BERT-en-uncased-base\n  => BERT-en-uncased-large\n  => BERT-en-cased-base\n  => BERT-en-cased-large\n  => BERT-multilingual-uncased-base\n  => BERT-multilingual-cased-base\n  => BERT-zh-base\n\n>>> downloader.download('pretrain', 'BERT-en-uncased-base', './pretrain_models')\n...\n```\n\n\n## 使用\n\n#### 快速开始\n\n8个步骤开始一个典型的NLP训练任务。\n\n1. 使用`paddlepalm.reader` 为数据集加载和输入特征生成创建一个`reader`，然后调用`reader.load_data`方法加载训练数据。\n2. 使用`paddlepalm.load_data`创建一个模型*主干网络*来提取文本特征(例如，上下文单词嵌入，句子嵌入)。\n3. 通过`reader.register_with`将`reader`注册到主干网络上。在这一步之后，reader能够使用主干网络产生的输入特征。\n4. 使用`paddlepalm.head`。创建一个任务*head*，可以为训练提供任务损失，为模型推理提供预测结果。\n5. 使用`paddlepalm.Trainer`创建一个任务`Trainer`，然后通过`Trainer.build_forward`构建包含主干网络和任务头的前向图(在步骤2和步骤4中创建)。\n6. 使用`paddlepalm.optimizer`（如果需要，创建`paddlepalm.lr_sched`）来创建一个*优化器*，然后通过`train.build_back`向后构建。\n7. 使用`trainer.fit_reader`将准备好的reader和数据（在步骤1中实现）给到trainer。\n8. 使用`trainer.load_pretrain`加载预训练模型或使用 `trainer.load_pretrain`加载checkpoint，或不加载任何已训练好的参数，然后使用`trainer.train`进行训练。\n\n更多实现细节请见示例: \n\n- [情感分析](https://github.com/PaddlePaddle/PALM/tree/master/examples/classification)\n- [Quora问题相似度匹配](https://github.com/PaddlePaddle/PALM/tree/master/examples/matching)\n- [命名实体识别](https://github.com/PaddlePaddle/PALM/tree/master/examples/tagging)\n- [类SQuAD机器阅读理解](https://github.com/PaddlePaddle/PALM/tree/master/examples/mrc)\n\n\n#### 多任务学习\n\n多任务学习模式下运行:\n\n1. 重复创建组件（每个任务按照上述第1~5步执行）。\n2. 创建空的`Trainer`(每个`Trainer`对应一个任务)，并通过它们创建一个`MultiHeadTrainer`。\n3. 使用`multi_head_trainer.build_forward`构建多任务前向图。\n4. 使用`paddlepalm.optimizer`（如果需要，创建`paddlepalm.lr_sched`）来创建一个*optimizer*，然后通过` multi_head_trainer.build_backward`创建反向。\n5. 使用`multi_head_trainer.fit_readers`将所有准备好的读取器和数据放入`multi_head_trainer`中。\n6. 使用`multi_head_trainer.load_pretrain`加载预训练模型或使用 `multi_head_trainer.load_pretrain`加载checkpoint，或不加载任何已经训练好的参数，然后使用`multi_head_trainer.train`进行训练。\n\nmulti_head_trainer的保存/加载和预测操作与trainer相同。\n\n\n更多实现`multi_head_trainer`的细节，请见\n\n- [ATIS: 对话意图识别和插槽填充的联合训练](https://github.com/PaddlePaddle/PALM/tree/master/examples/multi-task)\n\n#### 设置saver\n\n在训练时保存 models/checkpoints 和 logs，调用 `trainer.set_saver` 方法。更多实现细节见[这里](https://github.com/PaddlePaddle/PALM/tree/master/examples)。\n\n#### 评估/预测\n训练结束后进行预测和评价, 只需创建额外的reader, backbone和head（重复上面1~4步骤），注意创建时需设`phase='predict'`。 然后使用trainer的`predict`方法进行预测（不需创建额外的trainer）。更多实现细节请见[这里](https://github.com/PaddlePaddle/PALM/tree/master/examples/predict)。\n\n#### 使用多GPU\n如果您的环境中存在多个GPU，您可以通过环境变量控制这些GPU的数量和索引[CUDA_VISIBLE_DEVICES](https://devblogs.nvidia.com/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/)。例如，如果您的环境中有4个gpu，索引为0、1、2、3，那么您可以运行以下命令来只使用GPU2：\n\n```shell\nCUDA_VISIBLE_DEVICES=2 python run.py\n```\n\n多GPU的使用需要 `,`作为分隔。例如，使用GPU2和GPU3，运行以下命令：\n\n```shell\nCUDA_VISIBLE_DEVICES=2,3 python run.py\n```\n\n在多GPU模式下，PaddlePALM会自动将每个batch数据分配到可用的GPU上。例如，如果`batch_size`设置为64，并且有4个GPU可以用于PaddlePALM，那么每个GPU中的batch_size实际上是64/4=16。因此，**当使用多个GPU时，您需要确保batch_size可以被暴露给PALM的GPU数量整除**。\n\n\n## 许可证书\n\n此向导由[PaddlePaddle](https://github.com/PaddlePaddle/Paddle)贡献，受[Apache-2.0 license](https://github.com/PaddlePaddle/models/blob/develop/LICENSE)许可认证。\n"
  },
  {
    "path": "customization_cn.md",
    "content": "# PALM组件定制化教程\n\nPALM支持对如下组件自定义：\n\n- **head**\n  定义一个新的任务输出头，接收来自backbone和reader的输入，输出训练阶段的loss和预测阶段的预测结果。例如：分类任务头，序列标注任务头，机器阅读理解任务头等。\n- **backbone**\n  定义一个新的主干网络，接收来自reader的文本相关的序列特征输入（如token ids），输出文本的特征向量表示（如词向量、上下文相关的词向量表示、句子向量等）。例如：BERT encoder，CNN encoder等。\n- **reader**\n  定义一个新的数据集载入与预处理模块，接收来自原始数据集文件的输入（纯文本，原始标签等），输出文本相关的序列特征（如token ids，position ids等）。例如：文本分类数据集处理模块；文本匹配数据集处理模块等。\n- **optimizer**\n  定义一个新的优化器\n- **lr_sched**\n  定义一种新的学习率规划策略\n\nPALM中的每个组件均使用类来描述，因此可以允许存在内部记忆（成员变量）。\n\n新增某种类型的组件时，只需要实现该组件类型所在目录下的接口类中所描述的方法。若希望新增的组件跟框架的某个内置组件功能相似，那么实现新增组件时，可以继承自已有的内置组件，且仅对需要变动的方法进行修改即可。\n\n### head自定义\n\nhead的接口类（Interface）位于`paddlepalm/head/base_head.py`。\n\n该接口类定义如下：\n\n```python\n# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport os\nimport json\nimport copy\n\nclass Head(object):\n\n    def __init__(self, phase='train'):\n        \"\"\"该函数完成一个任务头的构造，至少需要包含一个phase参数。\n        注意：实现该构造函数时，必须保证对基类构造函数的调用，以创建必要的框架内建的成员变量。\n        Args:\n            phase: str类型。用于区分任务头被调用时所处的任务运行阶段，目前支持训练阶段train和预测阶段predict\n            \"\"\"\n        self._stop_gradient = {}\n        self._phase = phase\n        self._prog = None\n        self._results_buffer = []\n\n    @property\n    def inputs_attrs(self):\n        \"\"\"step级别的任务输入对象声明。\n\n        描述该任务头所依赖的reader、backbone和来自其他任务头的输出对象（每个step获取一次）。使用字典进行描述，\n        字典的key为输出对象所在的组件（如’reader‘，’backbone‘等），value为该组件下任务头所需要的输出对象集。\n        输出对象集使用字典描述，key为输出对象的名字（该名字需保证在相关组件的输出对象集中），value为该输出对象\n        的shape和dtype。当某个输出对象的某个维度长度可变时，shape中的相应维度设置为-1。\n        Return:\n            dict类型。描述该任务头所依赖的step级输入，即来自各个组件的输出对象。\"\"\"\n        raise NotImplementedError()\n\n    @property\n    def outputs_attr(self):\n        \"\"\"step级别的任务输出对象声明。\n        描述该任务头的输出对象（每个step输出一次），包括每个输出对象的名字，shape和dtype。输出对象会被加入到\n        fetch_list中，从而在每个训练/推理step时得到实时的计算结果，该计算结果可以传入batch_postprocess方\n        法中进行当前step的后处理。当某个对象为标量数据类型（如str, int, float等）时，shape设置为空列表[]，\n        当某个对象的某个维度长度可变时，shape中的相应维度设置为-1。\n\n        Return:\n            dict类型。描述该任务头所产生的输出对象。注意，在训练阶段时必须包含名为loss的输出对象。\n            \"\"\"\n\n        raise NotImplementedError()\n\n    @property\n    def epoch_inputs_attrs(self):\n        \"\"\"epoch级别的任务输入对象声明。\n        描述该任务所依赖的来自reader、backbone和来自其他任务头的输出对象（每个epoch结束后产生一次），如完整的\n        样本集，有效的样本数等。使用字典进行描述，字典的key为输出对象所在的组件（如’reader‘，’backbone‘等），\n        value为该组件下任务头所需要的输出对象集。输出对象集使用字典描述，key为输出对象的名字（该名字需保证在相关\n        组件的输出对象集中），value为该输出对象的shape和dtype。当某个输出对象的某个维度长度可变时，shape中的相\n        应维度设置为-1。\n        \n        Return:\n            dict类型。描述该任务头所产生的输出对象。注意，在训练阶段时必须包含名为loss的输出对象。\n        \"\"\"\n        return {}\n\n    def build(self, inputs, scope_name=\"\"):\n        \"\"\"建立任务头的计算图。\n\n        将符合inputs_attrs描述的来自各个对象集的静态图Variables映射成符合outputs_attr描述的静态图Variable输出。\n        Args:\n            inputs: dict类型。字典中包含inputs_attrs中的对象名到计算图Variable的映射，inputs中至少会包含inputs_attr中定义的对象\n        Return:\n           需要输出的计算图变量，输出对象会被加入到fetch_list中，从而在每个训练/推理step时得到runtime的计算结果，该计算结果会被传入postprocess方法中供用户处理。\n        \"\"\"\n        raise NotImplementedError()\n\n    def batch_postprocess(self, rt_outputs):\n        \"\"\"batch/step级别的后处理。\n\n        每个训练或推理step后针对当前batch的任务头输出对象的实时计算结果来进行相关后处理。\n        默认将输出结果存储到缓冲区self._results_buffer中。\"\"\"\n        if isinstance(rt_outputs, dict):\n            keys = rt_outputs.keys()\n            vals = [rt_outputs[k] for k in keys]\n            lens = [len(v) for v in vals]\n            if len(set(lens)) == 1:\n                results = [dict(zip(*[keys, i])) for i in zip(*vals)]\n                self._results_buffer.extend(results)\n                return results\n            else:\n                print('WARNING: irregular output results. visualize failed.')\n                self._results_buffer.append(rt_outputs)\n        return None\n\n    def reset(self):\n        \"\"\"清空该任务头的缓冲区（在训练或推理过程中积累的处理结果）\"\"\"\n        self._results_buffer = []\n\n    def get_results(self):\n        \"\"\"返回当前任务头积累的处理结果。\"\"\"\n        return copy.deepcopy(self._results_buffer)\n        \n    def epoch_postprocess(self, post_inputs=None, output_dir=None):\n        \"\"\"epoch级别的后处理。\n\n        每个训练或推理epoch结束后，对积累的各样本的后处理结果results进行后处理。默认情况下，当output_dir为None时，直接将results打印到\n        屏幕上。当指定output_dir时，将results存储在指定的文件夹内，并以任务头所处阶段来作为存储文件的文件名。\n\n        Args:\n            post_inputs: 当声明的epoch_inputs_attr不为空时，该参数会携带对应的输入变量的内容。\n            output_dir: 积累结果的保存路径。\n        \"\"\"\n        if output_dir is not None:\n            for i in self._results_buffer:\n                print(i)\n        else:\n            if not os.path.exists(output_dir):\n                os.makedirs(output_dir)\n            with open(os.path.join(output_dir, self._phase), 'w') as writer:\n                for i in self._results_buffer:\n                    writer.write(json.dumps(i)+'\\n')\n```\n\n\n\n在基类的基础上，定义一个全新的Head时需要至少实现的方法有：\n\n- \\_\\_init\\_\\_\n- inputs_attrs\n- outputs_attr\n- build\n\n可以重写的方法有：\n\n- epoch_inputs_attrs\n- batch_postprocess\n- epoch_postprocess\n\n### backbone自定义\n\nbackbone的接口类（Interface）位于`paddlepalm/backbone/base_backbone.py`。\n\n该接口类定义如下：\n\n```python\n# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\n\nclass Backbone(object):\n    \"\"\"interface of backbone model.\"\"\"\n\n    def __init__(self, phase):\n        \"\"\"该函数完成一个主干网络的构造，至少需要包含一个phase参数。\n        注意：实现该构造函数时，必须保证对基类构造函数的调用，以创建必要的框架内建的成员变量。\n        Args:\n            phase: str类型。用于区分主干网络被调用时所处的运行阶段，目前支持训练阶段train和预测阶段predict\n            \"\"\"\n\n        assert isinstance(config, dict)\n\n    @property\n    def inputs_attr(self):\n        \"\"\"描述backbone从reader处需要得到的输入对象的属性，包含各个对象的名字、shape以及数据类型。当某个对象\n        为标量数据类型（如str, int, float等）时，shape设置为空列表[]，当某个对象的某个维度长度可变时，shape\n        中的相应维度设置为-1。\n\n        Return:\n            dict类型。对各个输入对象的属性描述。例如，\n            对于文本分类和匹配任务，bert backbone依赖的reader对象主要包含如下的对象\n                {\"token_ids\": ([-1, max_len], 'int64'),\n                 \"input_ids\": ([-1, max_len], 'int64'),\n                 \"segment_ids\": ([-1, max_len], 'int64'),\n                 \"input_mask\": ([-1, max_len], 'float32')}\"\"\"\n        raise NotImplementedError()\n\n    @property\n    def outputs_attr(self):\n        \"\"\"描述backbone输出对象的属性，包含各个对象的名字、shape以及数据类型。当某个对象为标量数据类型（如\n        str, int, float等）时，shape设置为空列表[]，当某个对象的某个维度长度可变时，shape中的相应维度设置为-1。\n        \n        Return:\n            dict类型。对各个输出对象的属性描述。例如，\n            对于文本分类和匹配任务，bert backbone的输出内容可能包含如下的对象\n                {\"word_emb\": ([-1, max_seqlen, word_emb_size], 'float32'),\n                 \"sentence_emb\": ([-1, hidden_size], 'float32'),\n                 \"sim_vec\": ([-1, hidden_size], 'float32')}\"\"\" \n        raise NotImplementedError()\n\n    def build(self, inputs):\n        \"\"\"建立backbone的计算图。将符合inputs_attr描述的静态图Variable输入映射成符合outputs_attr描述的静态图Variable输出。\n        Args:\n            inputs: dict类型。字典中包含inputs_attr中的对象名到计算图Variable的映射，inputs中至少会包含inputs_attr中定义的对象\n        Return:\n           需要输出的计算图变量，输出对象会被加入到fetch_list中，从而在每个训练/推理step时得到runtime的计算结果，该计算结果会被传入postprocess方法中供用户处理。\n            \"\"\"\n        raise NotImplementedError()\n```\n\n\n\n在基类的基础上，定义一个全新的Backbone时需要至少实现的方法有：\n\n- \\_\\_init\\_\\_\n- input_attrs\n- output_attr\n- build\n\n### reader自定义\n\nreader的接口类（Interface）位于`paddlepalm/reader/base_reader.py`。\n\n该接口类定义如下：\n\n```python\n# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom copy import copy\n\nclass Reader(object):\n    \"\"\"interface of data reader.\"\"\"\n\n    def __init__(self, phase='train'):\n        \"\"\"该函数完成一个Reader的构造，至少需要包含一个phase参数。\n        注意：实现该构造函数时，必须保证对基类构造函数的调用，以创建必要的框架内建的成员变量。\n        Args:\n            phase: str类型。用于区分主干网络被调用时所处的运行阶段，目前支持训练阶段train和预测阶段predict\n            \"\"\"\n        \n        self._phase = phase\n        self._batch_size = None\n        self._num_epochs = 1\n        self._register = set()\n        self._registered_backbone = None\n\n    @classmethod\n    def create_register(self):\n        return set()\n        \n    def clone(self, phase='train'):\n        \"\"\"拷贝一个新的reader对象。\"\"\"\n        if phase == self._phase:\n            return copy(self)\n        else:\n            ret = copy(self)\n            ret._phase = phase\n            return ret\n\n    def require_attr(self, attr_name):\n        \"\"\"在注册器中新增一个需要产生的对象。\n\n        Args:\n            attr_name: 需要产出的对象的对象名，例如’segment_ids‘。\n            \"\"\"\n        self._register.add(attr_name)\n            \n    def register_with(self, backbone):\n        \"\"\"根据backbone对输入对象的依赖，在注册器中对每个依赖的输入对象进行注册。\n\n        Args:\n            backbone: 需要对接的主干网络。\n        \"\"\"\n        for attr in backbone.inputs_attr:\n            self.require_attr(attr)\n        self._registered_backbone = backbone\n\n    def get_registered_backbone(self):\n        \"\"\"返回该reader所注册的backbone。\"\"\"\n        return self._registered_backbone\n\n    def _get_registed_attrs(self, attrs):\n        ret = {}\n        for i in self._register:\n            if i not in attrs:\n                raise NotImplementedError('output attr {} is not found in this reader.'.format(i))\n            ret[i] = attrs[i]\n        return ret\n\n    def load_data(self, input_file, batch_size, num_epochs=None, \\\n                  file_format='tsv', shuffle_train=True):\n        \"\"\"将磁盘上的数据载入到reader中。\n\n        注意：实现该方法时需要同步创建self._batch_size和self._num_epochs。\n\n        Args:\n            input_file: 数据集文件路径。文件格式需要满足`file_format`参数的要求。\n            batch_size: 迭代器每次yield出的样本数量。注意：当环境中存在多个GPU时，batch_size需要保证被GPU卡数整除。\n            num_epochs: 数据集遍历次数。默认为None, 在单任务模式下代表遍历一次，在多任务模式下该参数会被上层的Trainer进行自动赋值。该参数仅对训练阶段有效。\n            file_format: 输入文件的文件格式。目前支持的格式: tsv. 默认为tsv.\n            shuffle_train: 是否打乱训练集中的样本。默认为True。该参数仅对训练阶段有效。\n        \"\"\"\n        raise NotImplementedError()\n\n    @property\n    def outputs_attr(self):\n        \"\"\"描述reader输出对象（被yield出的对象）的属性，包含各个对象的名字、shape以及数据类型。当某个对象为标量数据\n        类型（如str, int, float等）时，shape设置为空列表[]，当某个对象的某个维度长度可变时，shape中的相应维度设置为-1。\n        注意：当使用mini-batch梯度下降学习策略时，，应为常规的输入对象设置batch_size维度（一般为-1）\n        Return:\n            dict类型。对各个输入对象的属性描述。例如，\n            对于文本分类和匹配任务，yield的输出内容可能包含如下的对象（下游backbone和task可按需访问其中的对象）\n                {\"token_ids\": ([-1, max_len], 'int64'),\n                 \"input_ids\": ([-1, max_len], 'int64'),\n                 \"segment_ids\": ([-1, max_len], 'int64'),\n                 \"input_mask\": ([-1, max_len], 'float32'),\n                 \"label\": ([-1], 'int')}\n        \"\"\"\n        raise NotImplementedError()\n    \n    def _iterator(self):\n        \"\"\"数据集遍历接口，注意，当数据集遍历到尾部时该接口应自动完成指针重置，即重新从数据集头部开始新的遍历。\n        Yield:\n            dict类型。符合outputs_attr描述的当前step的输出对象。\n        \"\"\"\n        raise NotImplementedError()\n\n    def get_epoch_outputs(self):\n        \"\"\"返回数据集每个epoch遍历后的输出对象。\"\"\"\n        raise NotImplementedError()\n\n    @property\n    def num_examples(self):\n        \"\"\"数据集中的样本数量，即每个epoch中iterator所生成的样本数。注意，使用滑动窗口等可能导致数据集样本数发生变化的策略时\n        该接口应返回runtime阶段的实际样本数。\"\"\"\n        raise NotImplementedError()\n\n    @property\n    def num_epochs(self):\n        \"\"\"数据集遍历次数\"\"\"\n        return self._num_epochs\n```\n\n\n\n在基类的基础上，定义一个全新的Reader时需要至少实现的方法有：\n\n- \\_\\_init\\_\\_\n- outputs_attr\n- load_data\n- _iterator\n- num_examples\n\n可以重写的方法有：\n\n- get_epoch_outputs\n\n"
  },
  {
    "path": "examples/classification/README.md",
    "content": "## Example 1: Classification\nThis task is a sentiment analysis task. The following sections detail model preparation, dataset preparation, and how to run the task.\n\n### Step 1: Prepare Pre-trained Model & Dataset\n\n#### Pre-trained Model\n\nThe pre-training model of this mission is: [ERNIE-v1-zh-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api).\n\nMake sure you have downloaded the required pre-training model in the current folder.\n\n\n#### Dataset\n\nThis example demonstrates with [ChnSentiCorp](https://github.com/SophonPlus/ChineseNlpCorpus/tree/master/datasets/ChnSentiCorp_htl_all), a Chinese sentiment analysis dataset.\n\nDownload dataset:\n```shell\npython download.py\n```\n\nIf everything goes well, there will be a folder named `data/`  created with all the data files in it.\n\nThe dataset file (for training) should have 2 fields,  `text_a` and `label`, stored with [tsv](https://en.wikipedia.org/wiki/Tab-separated_values) format. Here shows an example:\n\n```\nlabel  text_a\n0   当当网名不符实，订货多日不见送货，询问客服只会推托，只会要求用户再下订单。如此服务留不住顾客的。去别的网站买书服务更好。\n0   XP的驱动不好找！我的17号提的货，现在就降价了100元，而且还送杀毒软件！\n1   <荐书> 推荐所有喜欢<红楼>的红迷们一定要收藏这本书,要知道当年我听说这本书的时候花很长时间去图书馆找和借都没能如愿,所以这次一看到当当有,马上买了,红迷们也要记得备货哦!\n```\n\n### Step 2: Train & Predict\n\nThe code used to perform this task is in `run.py`. If you have prepared the pre-training model and the data set required for the task, run:\n\n```shell\npython run.py\n```\n\nIf you want to specify a specific gpu or use multiple gpus for training, please use **`CUDA_VISIBLE_DEVICES`**, for example:\n\n```shell\nCUDA_VISIBLE_DEVICES=0,1 python run.py\n```\n\nNote: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.**\n\n\nSome logs will be shown below:\n\n```\nstep 1/154 (epoch 0), loss: 5.512, speed: 0.51 steps/s\nstep 2/154 (epoch 0), loss: 2.595, speed: 3.36 steps/s\nstep 3/154 (epoch 0), loss: 1.798, speed: 3.48 steps/s\n```\n\n\nAfter the run, you can view the saved models in the `outputs/` folder and the predictions in the `outputs/predict` folder. Here are some examples of predictions:\n\n\n```\n{\"index\": 0, \"logits\": [-0.2014336884021759, 0.6799028515815735], \"probs\": [0.29290086030960083, 0.7070990800857544], \"label\": 1}\n{\"index\": 1, \"logits\": [0.8593899011611938, -0.29743513464927673], \"probs\": [0.7607553601264954, 0.23924466967582703], \"label\": 0}\n{\"index\": 2, \"logits\": [0.7462944388389587, -0.7083730101585388], \"probs\": [0.8107157349586487, 0.18928426504135132], \"label\": 0}\n```\n\n### Step 3: Evaluate\n\nOnce you have the prediction, you can run the evaluation script to evaluate the model:\n\n```shell\npython evaluate.py\n```\n\nThe evaluation results are as follows:\n\n```\ndata num: 1200\naccuracy: 0.9575, precision: 0.9634, recall: 0.9523, f1: 0.9578\n```\n"
  },
  {
    "path": "examples/classification/download.py",
    "content": "#  -*- coding: utf-8 -*-\nfrom __future__ import print_function\nimport os\nimport tarfile\nimport shutil\nimport sys\nimport urllib\nURLLIB=urllib\nif sys.version_info >= (3, 0):\n    import urllib.request\n    URLLIB=urllib.request\n\ndef download(src, url):\n    def _reporthook(count, chunk_size, total_size):\n        bytes_so_far = count * chunk_size\n        percent = float(bytes_so_far) / float(total_size)\n        if percent > 1:\n            percent = 1\n        print('\\r>> Downloading... {:.1%}'.format(percent), end=\"\")\n\n    URLLIB.urlretrieve(url, src, reporthook=_reporthook)\n\nabs_path = os.path.abspath(__file__)\ndownload_url = \"https://ernie.bj.bcebos.com/task_data_zh.tgz\"\ndownlaod_path = os.path.join(os.path.dirname(abs_path), \"task_data_zh.tgz\")\ntarget_dir = os.path.dirname(abs_path)\ndownload(downlaod_path, download_url)\n\ntar = tarfile.open(downlaod_path)\ntar.extractall(target_dir)\nos.remove(downlaod_path)\n\nabs_path = os.path.abspath(__file__)\ndst_dir = os.path.join(os.path.dirname(abs_path), \"data\")\nif not os.path.exists(dst_dir) or not os.path.isdir(dst_dir):\n    os.makedirs(dst_dir)\n\nfor file in os.listdir(os.path.join(target_dir, 'task_data', 'chnsenticorp')):\n    shutil.move(os.path.join(target_dir, 'task_data', 'chnsenticorp', file), dst_dir)\n\nshutil.rmtree(os.path.join(target_dir, 'task_data'))\nprint(\" done!\")\n"
  },
  {
    "path": "examples/classification/evaluate.py",
    "content": "#  -*- coding: utf-8 -*-\n\nimport json\nimport numpy as np\n\ndef accuracy(preds, labels):\n    preds = np.array(preds)\n    labels = np.array(labels) \n    return (preds == labels).mean()\n\ndef pre_recall_f1(preds, labels):\n    preds = np.array(preds)\n    labels = np.array(labels)\n    # recall=TP/(TP+FN)\n    tp = np.sum((labels == '1') & (preds == '1'))\n    fp = np.sum((labels == '0') & (preds == '1'))\n    fn = np.sum((labels == '1') & (preds == '0'))\n    r = tp * 1.0 / (tp + fn)\n    # Precision=TP/(TP+FP)\n    p = tp * 1.0 / (tp + fp)\n    epsilon = 1e-31\n    f1 = 2 * p * r / (p+r+epsilon)\n    return p, r, f1\n\n\ndef res_evaluate(res_dir=\"./outputs/predict/predictions.json\", eval_phase='test'):\n    if eval_phase == 'test':\n        data_dir=\"./data/test.tsv\"\n    elif eval_phase == 'dev':\n        data_dir=\"./data/dev.tsv\"\n    else:\n        assert eval_phase in ['dev', 'test'], 'eval_phase should be dev or test'\n    \n    labels = []\n    with open(data_dir, \"r\") as file:\n        first_flag = True\n        for line in file:\n            line = line.split(\"\\t\")\n            label = line[0]\n            if label=='label':\n                continue\n            labels.append(str(label))\n    file.close()\n\n    preds = []\n    with open(res_dir, \"r\") as file:\n        for line in file.readlines():\n            line = json.loads(line)\n            pred = line['label']\n            preds.append(str(pred))\n    file.close()\n    assert len(labels) == len(preds), \"prediction result doesn't match to labels\"\n    print('data num: {}'.format(len(labels)))\n    p, r, f1 = pre_recall_f1(preds, labels)\n    print(\"accuracy: {:.4f}, precision: {:.4f}, recall: {:.4f}, f1: {:.4f}\".format(accuracy(preds, labels), p, r, f1))\n\nres_evaluate()\n"
  },
  {
    "path": "examples/classification/run.py",
    "content": "# coding=utf-8\nimport paddlepalm as palm\nimport json\n\n\nif __name__ == '__main__':\n\n    # configs\n    max_seqlen = 256\n    batch_size = 8\n    num_epochs = 10\n    lr = 5e-5\n    weight_decay = 0.01\n    vocab_path = './pretrain/ERNIE-v1-zh-base/vocab.txt'\n\n    train_file = './data/train.tsv'\n    predict_file = './data/test.tsv'\n    config = json.load(open('./pretrain/ERNIE-v1-zh-base/ernie_config.json'))\n    input_dim = config['hidden_size']\n    num_classes = 2\n    dropout_prob = 0.1\n    random_seed = 1\n    task_name = 'chnsenticorp'\n    save_path = './outputs/'\n    pred_output = './outputs/predict/'\n    save_type = 'ckpt'\n    print_steps = 20\n    pre_params = './pretrain/ERNIE-v1-zh-base/params'\n\n    # -----------------------  for training ----------------------- \n\n    # step 1-1: create readers for training\n    cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed)\n    # step 1-2: load the training data\n    cls_reader.load_data(train_file, batch_size, num_epochs=num_epochs)\n\n    # step 2: create a backbone of the model to extract text features\n    ernie = palm.backbone.ERNIE.from_config(config)\n\n    # step 3: register the backbone in reader\n    cls_reader.register_with(ernie)\n\n    # step 4: create the task output head\n    cls_head = palm.head.Classify(num_classes, input_dim, dropout_prob)\n\n    # step 5-1: create a task trainer\n    trainer = palm.Trainer(task_name)\n    # step 5-2: build forward graph with backbone and task head\n    loss_var = trainer.build_forward(ernie, cls_head)\n\n    # step 6-1*: use warmup\n    n_steps = cls_reader.num_examples * num_epochs // batch_size\n    warmup_steps = int(0.1 * n_steps)\n    sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps)\n    # step 6-2: create a optimizer\n    adam = palm.optimizer.Adam(loss_var, lr, sched)\n    # step 6-3: build backward\n    trainer.build_backward(optimizer=adam, weight_decay=weight_decay)\n  \n    # step 7: fit prepared reader and data\n    trainer.fit_reader(cls_reader)\n    \n    # step 8-1*: load pretrained parameters\n    trainer.load_pretrain(pre_params)\n    # step 8-2*: set saver to save model\n    # save_steps = n_steps \n    save_steps = 2396\n    trainer.set_saver(save_steps=save_steps, save_path=save_path, save_type=save_type)\n    # step 8-3: start training\n    trainer.train(print_steps=print_steps)\n   \n    # -----------------------  for prediction ----------------------- \n\n    # step 1-1: create readers for prediction\n    print('prepare to predict...')\n    predict_cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed, phase='predict')\n    # step 1-2: load the training data\n    predict_cls_reader.load_data(predict_file, batch_size)\n    \n    # step 2: create a backbone of the model to extract text features\n    pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict')\n\n    # step 3: register the backbone in reader\n    predict_cls_reader.register_with(pred_ernie)\n    \n    # step 4: create the task output head\n    cls_pred_head = palm.head.Classify(num_classes, input_dim, phase='predict')\n    \n    # step 5: build forward graph with backbone and task head\n    trainer.build_predict_forward(pred_ernie, cls_pred_head)\n \n    # step 6: load checkpoint\n    # model_path = './outputs/ckpt.step'+str(save_steps)\n    model_path = './outputs/ckpt.step'+str(11980)\n    trainer.load_ckpt(model_path)\n\n    # step 7: fit prepared reader and data\n    trainer.fit_reader(predict_cls_reader, phase='predict')\n\n    # step 8: predict\n    print('predicting..')\n    trainer.predict(print_steps=print_steps, output_dir=pred_output)\n"
  },
  {
    "path": "examples/matching/README.md",
    "content": "## Example 2: Matching\nThis task is a sentence pair matching task. The following sections detail model preparation, dataset preparation, and how to run the task with PaddlePALM.\n\n### Step 1: Prepare Pre-trained Models & Datasets\n\n#### Download Pre-trained Model\n\nThe pre-training model of this mission is: [ERNIE-v2-en-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api).\n\nMake sure you have downloaded the required pre-training model in the current folder.\n\n\n#### Dataset\n\nHere takes the [Quora Question Pairs](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset as the testbed for matching.\n\nDownload dataset:\n```shell\npython download.py\n```\n\nAfter the dataset is downloaded, you should convert the data format for training:\n```shell\npython process.py data/quora_duplicate_questions.tsv data/train.tsv data/test.tsv\n```\n\nIf everything goes well, there will be a folder named `data/`  created with all the converted datas in it.\n\nThe dataset file (for training) should have 3 fields,  `text_a`, `text_b` and `label`, stored with [tsv](https://en.wikipedia.org/wiki/Tab-separated_values) format. Here shows an example:\n\n```\ntext_a  text_b  label\nHow can the arrangement of corynebacterium xerosis be described?  How would you describe waves? 0\nHow do you fix a Google Play Store account that isn't working?  What can cause the Google Play store to not open? How are such probelms fixed?  1\nWhich is the best earphone under 1000?  What are the best earphones under 1k? 1\nWhat are the differences between the Dell Inspiron 3000, 5000, and 7000 series laptops? \"Should I buy an Apple MacBook Pro 15\"\" or a Dell Inspiron 17 5000 series?\" 0\n```\n\n\n\n### Step 2: Train & Predict\n\nThe code used to perform this task is in `run.py`. If you have prepared the pre-training model and the data set required for the task, run:\n\n```shell\npython run.py\n```\n\nIf you want to specify a specific gpu or use multiple gpus for training, please use **`CUDA_VISIBLE_DEVICES`**, for example:\n\n```shell\nCUDA_VISIBLE_DEVICES=0,1 python run.py\n```\n\nNote: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.**\n\nSome logs will be shown below:\n\n```\nstep 20/49087 (epoch 0), loss: 1.079, speed: 3.48 steps/s\nstep 40/49087 (epoch 0), loss: 1.251, speed: 5.18 steps/s\nstep 60/49087 (epoch 0), loss: 1.193, speed: 5.04 steps/s\n```\n\n\nAfter the run, you can view the saved models in the `outputs/` folder and the predictions in the `outputs/predict` folder. Here are some examples of predictions:\n\n\n```\n{\"index\": 0, \"logits\": [-0.32688724994659424, -0.8568955063819885], \"probs\": [0.629485011100769, 0.3705149292945862], \"label\": 0}\n{\"index\": 1, \"logits\": [-0.2735646963119507, -0.7983021140098572], \"probs\": [0.6282548904418945, 0.37174513936042786], \"label\": 0}\n{\"index\": 2, \"logits\": [-0.3381381630897522, -0.8614270091056824], \"probs\": [0.6279165148735046, 0.37208351492881775], \"label\": 0}\n```\n\n### Step 3: Evaluate\n\nOnce you have the prediction, you can run the evaluation script to evaluate the model:\n\n```shell\npython evaluate.py\n```\n\nThe evaluation results are as follows:\n\n```\ndata num: 4300\naccuracy: 0.8619, precision: 0.8061, recall: 0.8377, f1: 0.8216\n```\n"
  },
  {
    "path": "examples/matching/download.py",
    "content": "#  -*- coding: utf-8 -*-\nfrom __future__ import print_function\nimport os\nimport sys\nimport urllib\nURLLIB=urllib\nif sys.version_info >= (3, 0):\n    import urllib.request\n    URLLIB=urllib.request\n\ndef download(src, url):\n    def _reporthook(count, chunk_size, total_size):\n        bytes_so_far = count * chunk_size\n        percent = float(bytes_so_far) / float(total_size)\n        if percent > 1:\n            percent = 1\n        print('\\r>> Downloading... {:.1%}'.format(percent), end=\"\")\n\n    URLLIB.urlretrieve(url, src, reporthook=_reporthook)\n\n\nabs_path = os.path.abspath(__file__)\ndata_dir = os.path.join(os.path.dirname(abs_path), \"data\")\nif not os.path.exists(data_dir) or not os.path.isdir(data_dir):\n    os.makedirs(data_dir)\n\ndownload_url = \"http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv\"\ndownlaod_path = os.path.join(data_dir, \"quora_duplicate_questions.tsv\")\ndownload(downlaod_path, download_url)\nprint(\" done!\")\n"
  },
  {
    "path": "examples/matching/evaluate.py",
    "content": "#  -*- coding: utf-8 -*-\n\nimport json\nimport numpy as np\n\ndef accuracy(preds, labels):\n    preds = np.array(preds)\n    labels = np.array(labels) \n    return (preds == labels).mean()\n\ndef pre_recall_f1(preds, labels):\n    preds = np.array(preds)\n    labels = np.array(labels)\n    # recall=TP/(TP+FN)\n    tp = np.sum((labels == '1') & (preds == '1'))\n    fp = np.sum((labels == '0') & (preds == '1'))\n    fn = np.sum((labels == '1') & (preds == '0'))\n    r = tp * 1.0 / (tp + fn)\n    # Precision=TP/(TP+FP)\n    p = tp * 1.0 / (tp + fp)\n    epsilon = 1e-31\n    f1 = 2 * p * r / (p+r+epsilon)\n    return p, r, f1\n\n\ndef res_evaluate(res_dir=\"./outputs/predict/predictions.json\", eval_phase='test'):\n    if eval_phase == 'test':\n        data_dir=\"./data/test.tsv\"\n    elif eval_phase == 'dev':\n        data_dir=\"./data/dev.tsv\"\n    else:\n        assert eval_phase in ['dev', 'test'], 'eval_phase should be dev or test'\n    \n    labels = []\n    with open(data_dir, \"r\") as file:\n        first_flag = True\n        for line in file:\n            line = line.split(\"\\t\")\n            label = line[2][:-1]\n            if label=='label':\n                continue\n            labels.append(str(label))\n    file.close()\n\n    preds = []\n    with open(res_dir, \"r\") as file:\n        for line in file.readlines():\n            line = json.loads(line)\n            pred = line['label']\n            preds.append(str(pred))\n    file.close()\n    assert len(labels) == len(preds), \"prediction result({}) doesn't match to labels({})\".format(len(preds),len(labels))\n    print('data num: {}'.format(len(labels)))\n    p, r, f1 = pre_recall_f1(preds, labels)\n    print(\"accuracy: {:.4f}, precision: {:.4f}, recall: {:.4f}, f1: {:.4f}\".format(accuracy(preds, labels), p, r, f1))\n\nres_evaluate()\n"
  },
  {
    "path": "examples/matching/process.py",
    "content": "#  -*- coding: utf-8 -*-\n\nimport sys\nimport os\n\nif len(sys.argv) != 4:\n    exit(0)\n\ndata_dir = sys.argv[1]\nif not os.path.exists(data_dir):\n    print(\"%s not exists\" % data_dir)\n    exit(0)\n\ntrain_dir = sys.argv[2]\ntrain_file = open(train_dir, \"w\")\ntrain_file.write(\"text_a\\ttext_b\\tlabel\\n\")\n\ntest_dir = sys.argv[3]\ntest_file = open(test_dir, \"w\")\ntest_file.write(\"text_a\\ttext_b\\tlabel\\n\")\nwith open(data_dir, \"r\") as file:\n    before = \"\"\n    cnt = 0\n    for line in file:\n        line = line.strip(\"\\n\")\n        line_t = line.split(\"\\t\")\n        flag = 0\n        if len(line_t) < 6:\n            if flag: \n                flag = 0\n                out_line = \"{}{}\\n\".format(out_line, line)\n            else:\n                flag = 1\n                outline = \"{}\".format(line)\n            continue\n        else:\n            out_line = \"{}\\t{}\\t{}\\n\".format(line_t[3], line_t[4], line_t[5])\n        cnt += 1\n\n        if 2 <= cnt <= 4301:\n            test_file.write(out_line)\n        if 4301 <= cnt <= 104301:\n            train_file.write(out_line)\n\ntrain_file.close()\ntest_file.close()\n"
  },
  {
    "path": "examples/matching/run.py",
    "content": "# coding=utf-8\nimport paddlepalm as palm\nimport json\n\nif __name__ == '__main__':\n\n    # configs \n    max_seqlen = 128\n    batch_size = 16 \n    num_epochs = 3\n    lr = 3e-5\n    weight_decay = 0.0\n    num_classes = 2\n    random_seed = 1\n    dropout_prob = 0.1\n    save_path = './outputs/'\n    save_type = 'ckpt'\n    pred_model_path = './outputs/ckpt.step'+str(18732)\n    print_steps = 50\n    pred_output = './outputs/predict/'\n    pre_params = './pretrain/ERNIE-v2-en-base/params'\n    task_name = 'Quora Question Pairs matching'\n\n    vocab_path = './pretrain/ERNIE-v2-en-base/vocab.txt'\n    train_file = './data/train.tsv'\n    predict_file = './data/test.tsv'\n    config = json.load(open('./pretrain/ERNIE-v2-en-base/ernie_config.json'))\n    input_dim = config['hidden_size']\n\n    # -----------------------  for training ----------------------- \n\n    # step 1-1: create readers for training\n    match_reader = palm.reader.MatchReader(vocab_path, max_seqlen, seed=random_seed)\n    # step 1-2: load the training data\n    match_reader.load_data(train_file, file_format='tsv', num_epochs=num_epochs, batch_size=batch_size)\n    \n    # step 2: create a backbone of the model to extract text features\n    ernie = palm.backbone.ERNIE.from_config(config)\n\n    # step 3: register the backbone in reader\n    match_reader.register_with(ernie)\n    \n    # step 4: create the task output head\n    match_head = palm.head.Match(num_classes, input_dim, dropout_prob)\n \n    # step 5-1: create a task trainer\n    trainer = palm.Trainer(task_name)\n    # step 5-2: build forward graph with backbone and task head\n    loss_var = trainer.build_forward(ernie, match_head)\n    \n    # step 6-1*: use warmup\n    n_steps = match_reader.num_examples * num_epochs // batch_size\n    warmup_steps = int(0.1 * n_steps)\n    print('total_steps: {}'.format(n_steps))\n    print('warmup_steps: {}'.format(warmup_steps))\n    sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps)\n\n    # step 6-2: create a optimizer\n    adam = palm.optimizer.Adam(loss_var, lr, sched)\n    # step 6-3: build backward\n    trainer.build_backward(optimizer=adam, weight_decay=weight_decay)\n    \n    # step 7: fit prepared reader and data\n    trainer.fit_reader(match_reader)\n\n    # step 8-1*: load pretrained parameters\n    trainer.load_pretrain(pre_params, False)\n    # step 8-2*: set saver to save model\n    # save_steps = n_steps-16\n    save_steps = 6244\n    trainer.set_saver(save_path=save_path, save_steps=save_steps, save_type=save_type)\n    # step 8-3: start training\n    trainer.train(print_steps=print_steps)\n     \n    # -----------------------  for prediction ----------------------- \n\n    # step 1-1: create readers for prediction\n    print('prepare to predict...')\n    predict_match_reader = palm.reader.MatchReader(vocab_path, max_seqlen, seed=random_seed, phase='predict')\n    # step 1-2: load the training data\n    predict_match_reader.load_data(predict_file, batch_size)\n\n    # step 2: create a backbone of the model to extract text features\n    pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict')\n\n    # step 3: register the backbone in reader\n    predict_match_reader.register_with(pred_ernie)\n    \n    # step 4: create the task output head\n    match_pred_head = palm.head.Match(num_classes, input_dim, phase='predict')\n\n    # step 5: build forward graph with backbone and task head\n    trainer.build_predict_forward(pred_ernie, match_pred_head)\n\n    # step 6: load checkpoint\n    trainer.load_ckpt(pred_model_path)\n\n    # step 7: fit prepared reader and data\n    trainer.fit_reader(predict_match_reader, phase='predict')\n    \n    # step 8: predict\n    print('predicting..')\n    trainer.predict(print_steps=print_steps, output_dir=pred_output)\n"
  },
  {
    "path": "examples/mrc/README.md",
    "content": "## Example 4: Machine Reading Comprehension\nThis task is a machine reading comprehension task. The following sections detail model preparation, dataset preparation, and how to run the task.\n\n### Step 1: Prepare Pre-trained Models & Datasets\n\n#### Pre-trianed Model\n\nThe pre-training model of this mission is: [ERNIE-v1-zh-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api).\n\nMake sure you have downloaded the required pre-training model in the current folder.\n\n\n#### Dataset\n\nThis task uses the `CMRC2018` dataset. `CMRC2018` is an evaluation conducted by Chinese information society. The task of evaluation is to extract reading comprehension.\n\nDownload dataset:\n```shell\npython download.py\n```\n\nIf everything goes well, there will be a folder named `data/`  created with all the datas in it.\n\nHere is some example datas:\n\n ```json\n\"paragraphs\": [\n         {\n           \"id\": \"TRAIN_36\",\n           \"context\": \"NGC 6231是一个位于天蝎座的疏散星团，天球座标为赤经16时54分，赤纬-41度48分，视觉观测大小约45角分，亮度约2.6视星等，距地球5900光年。NGC 6231年龄约为三百二十万年，是一个非常年轻的星团，星团内的最亮星是5等的天蝎座 ζ1星。用双筒望远镜或小型望远镜就能看到个别的行星。NGC 6231在1654年被意大利天文学家乔瓦尼·巴蒂斯特·霍迪尔纳（Giovanni Battista Hodierna）以Luminosae的名字首次纪录在星表中，但是未见记载于夏尔·梅西耶的天体列表和威廉·赫歇尔的深空天体目录。这个天体在1678年被爱德蒙·哈雷（I.7）、1745年被夏西亚科斯（Jean-Phillippe Loys de Cheseaux）（9）、1751年被尼可拉·路易·拉卡伊（II.13）分别再次独立发现。\",\n           \"qas\": [\n             {\n               \"question\": \"NGC 6231的经纬度是多少？\",\n               \"id\": \"TRAIN_36_QUERY_0\",\n               \"answers\": [\n                 {\n                   \"text\": \"赤经16时54分，赤纬-41度48分\",\n                   \"answer_start\": 27\n                 }\n               ]\n             }\n         }\n ```\n\n\n### Step 2: Train & Predict\n\nThe code used to perform this task is in `run.py`. If you have prepared the pre-training model and the data set required for the task, run:\n\n```shell\npython run.py\n```\n\nIf you want to specify a specific gpu or use multiple gpus for training, please use **`CUDA_VISIBLE_DEVICES`**, for example:\n\n```shell\nCUDA_VISIBLE_DEVICES=0,1 python run.py\n```\n\nNote: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.**\n\nSome logs will be shown below:\n\n```\nstep 1/1515 (epoch 0), loss: 6.251, speed: 0.31 steps/s\nstep 2/1515 (epoch 0), loss: 6.206, speed: 0.80 steps/s\nstep 3/1515 (epoch 0), loss: 6.172, speed: 0.86 steps/s\n```\n\n\nAfter the run, you can view the saved models in the `outputs/` folder and the predictions in the `outputs/predict` folder. Here are some examples of predictions:\n\n\n```json\n{\n    \"DEV_0_QUERY_0\": \"光 荣 和 ω-force 开 发\", \n    \"DEV_0_QUERY_1\": \"任 天 堂 游 戏 谜 之 村 雨 城\", \n    \"DEV_0_QUERY_2\": \"战 史 演 武 」&「 争 霸 演 武 」。\", \n    \"DEV_1_QUERY_0\": \"大 陆 传 统 器 乐 及 戏 曲 里 面 常 用 的 打 击 乐 记 谱 方 法 ， 以 中 文 字 的 声 音 模 拟 敲 击 乐 的 声 音 ， 纪 录 打 击 乐 的 各 种 不 同 的 演 奏 方 法 。\", \n    \"DEV_1_QUERY_1\": \"「 锣 鼓 点\", \n    \"DEV_1_QUERY_2\": \"锣 鼓 的 运 用 有 约 定 俗 成 的 程 式 ， 依 照 角 色 行 当 的 身 份 、 性 格 、 情 绪 以 及 环 境 ， 配 合 相 应 的 锣 鼓 点\", \n    \"DEV_1_QUERY_3\": \"鼓 、 锣 、 钹 和 板 四 类 型\", \n    \"DEV_2_QUERY_0\": \"364.6 公 里\", \n}\n```\n\n### Step 3: Evaluate\n\n#### Library Dependencies\nBefore the evaluation, you need to install `nltk` and download the `punkt` tokenizer for nltk:\n\n```shell\npip insall nltk\npython -m nltk.downloader punkt\n```\n\n#### Evaluate\nYou can run the evaluation script to evaluate the model:\n\n```shell\npython evaluate.py\n```\n\nThe evaluation results are as follows:\n\n```\ndata_num: 3219\nem_sroce: 0.6434, f1: 0.8518\n```\n"
  },
  {
    "path": "examples/mrc/download.py",
    "content": "#  -*- coding: utf-8 -*-\nfrom __future__ import print_function\nimport os\nimport tarfile\nimport shutil\nimport sys\nimport urllib\nURLLIB=urllib\nif sys.version_info >= (3, 0):\n    import urllib.request\n    URLLIB=urllib.request\n\ndef download(src, url):\n    def _reporthook(count, chunk_size, total_size):\n        bytes_so_far = count * chunk_size\n        percent = float(bytes_so_far) / float(total_size)\n        if percent > 1:\n            percent = 1\n        print('\\r>> Downloading... {:.1%}'.format(percent), end=\"\")\n\n    URLLIB.urlretrieve(url, src, reporthook=_reporthook)\n\nabs_path = os.path.abspath(__file__)\ndownload_url = \"https://ernie.bj.bcebos.com/task_data_zh.tgz\"\ndownlaod_path = os.path.join(os.path.dirname(abs_path), \"task_data_zh.tgz\")\ntarget_dir = os.path.dirname(abs_path)\ndownload(downlaod_path, download_url)\n\ntar = tarfile.open(downlaod_path)\ntar.extractall(target_dir)\nos.remove(downlaod_path)\n\nabs_path = os.path.abspath(__file__)\ndst_dir = os.path.join(os.path.dirname(abs_path), \"data\")\nif not os.path.exists(dst_dir) or not os.path.isdir(dst_dir):\n    os.makedirs(dst_dir)\n\nfor file in os.listdir(os.path.join(target_dir, 'task_data', 'cmrc2018')):\n    shutil.move(os.path.join(target_dir, 'task_data', 'cmrc2018', file), dst_dir)\n\nshutil.rmtree(os.path.join(target_dir, 'task_data'))\nprint(\" done!\")\n\n"
  },
  {
    "path": "examples/mrc/evaluate.py",
    "content": "# -*- coding: utf-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n'''\nEvaluation script for CMRC 2018\nversion: v5\nNote:\nv5 formatted output, add usage description\nv4 fixed segmentation issues\n'''\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\nfrom __future__ import unicode_literals\nfrom __future__ import absolute_import\n\nfrom collections import Counter, OrderedDict\nimport string\nimport re\nimport argparse\nimport json\nimport sys\nimport nltk\nimport pdb\n\n\n# split Chinese with English\ndef mixed_segmentation(in_str, rm_punc=False):\n    in_str = in_str.lower().strip()\n    segs_out = []\n    temp_str = \"\"\n    sp_char = [\n        '-', ':', '_', '*', '^', '/', '\\\\', '~', '`', '+', '=', '，', '。', '：',\n        '？', '！', '“', '”', '；', '’', '《', '》', '……', '·', '、', '「', '」', '（',\n        '）', '－', '～', '『', '』',' '\n    ]\n    for char in in_str:\n        if rm_punc and char in sp_char:\n            continue\n        if re.search(r'[\\u4e00-\\u9fa5]', char) or char in sp_char:\n            if temp_str != \"\":\n                ss = nltk.word_tokenize(temp_str)\n                segs_out.extend(ss)\n                temp_str = \"\"\n            segs_out.append(char)\n        else:\n            temp_str += char\n\n    #handling last part\n    if temp_str != \"\":\n        ss = nltk.word_tokenize(temp_str)\n        segs_out.extend(ss)\n\n    return segs_out\n\n\n# remove punctuation\ndef remove_punctuation(in_str):\n    in_str = in_str.lower().strip()\n    sp_char = [\n        '-', ':', '_', '*', '^', '/', '\\\\', '~', '`', '+', '=', '，', '。', '：',\n        '？', '！', '“', '”', '；', '’', '《', '》', '……', '·', '、', '「', '」', '（',\n        '）', '－', '～', '『', '』', ' '\n    ]\n    out_segs = []\n    for char in in_str:\n        if char in sp_char:\n            continue\n        else:\n            out_segs.append(char)\n    return ''.join(out_segs)\n\n\n# find longest common string\ndef find_lcs(s1, s2):\n    m = [[0 for i in range(len(s2) + 1)] for j in range(len(s1) + 1)]\n    mmax = 0\n    p = 0\n    for i in range(len(s1)):\n        for j in range(len(s2)):\n            if s1[i] == s2[j]:\n                m[i + 1][j + 1] = m[i][j] + 1\n                if m[i + 1][j + 1] > mmax:\n                    mmax = m[i + 1][j + 1]\n                    p = i + 1\n    return s1[p - mmax:p], mmax\n\n\ndef evaluate(ground_truth_file, prediction_file):\n    f1 = 0\n    em = 0\n    total_count = 0\n    skip_count = 0\n    for instances in ground_truth_file[\"data\"]:\n        for instance in instances[\"paragraphs\"]:\n            context_text = instance['context'].strip()\n            for qas in instance['qas']:\n                total_count += 1\n                query_id = qas['id'].strip()\n                query_text = qas['question'].strip()\n                answers = [ans[\"text\"] for ans in qas[\"answers\"]]\n\n                if query_id not in prediction_file:\n                    print('Unanswered question: {}\\n'.format(\n                        query_id))\n                    skip_count += 1\n                    continue\n\n                prediction = prediction_file[query_id]\n                f1 += calc_f1_score(answers, prediction)\n                em += calc_em_score(answers, prediction)\n\n    f1_score = f1 / total_count\n    em_score = em / total_count\n    return f1_score, em_score, total_count, skip_count\n\n\ndef calc_f1_score(answers, prediction):\n    f1_scores = []\n    for ans in answers:\n        ans_segs = mixed_segmentation(ans, rm_punc=True)\n        prediction_segs = mixed_segmentation(prediction, rm_punc=True)\n        lcs, lcs_len = find_lcs(ans_segs, prediction_segs)\n        if lcs_len == 0:\n            f1_scores.append(0)\n            continue\n        precision = 1.0 * lcs_len / len(prediction_segs)\n        recall = 1.0 * lcs_len / len(ans_segs)\n        f1 = (2 * precision * recall) / (precision + recall)\n        f1_scores.append(f1)\n    return max(f1_scores)\n\n\ndef calc_em_score(answers, prediction):\n    em = 0\n    for ans in answers:\n        ans_ = remove_punctuation(ans)\n        prediction_ = remove_punctuation(prediction)\n        if ans_ == prediction_:\n            em = 1\n            break\n    return em\n\n\ndef eval_file(dataset_file, prediction_file):\n    ground_truth_file = json.load(open(dataset_file, 'r'))\n    prediction_file = json.load(open(prediction_file, 'r'))\n    F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)\n    AVG = (EM + F1) * 0.5\n    return EM, F1, AVG, TOTAL\n\n\nif __name__ == '__main__':\n    EM, F1, AVG, TOTAL = eval_file(\"data/dev.json\", \"outputs/predict/predictions.json\")\n    print('data_num: {}'.format(TOTAL))\n    print('em_sroce: {:.4f}, f1: {:.4f}'.format(EM,F1))\n"
  },
  {
    "path": "examples/mrc/run.py",
    "content": "# coding=utf-8\nimport paddlepalm as palm\nimport json\n\n\nif __name__ == '__main__':\n\n    # configs\n    max_seqlen = 512\n    batch_size = 8   \n    num_epochs = 2\n    lr = 3e-5\n    doc_stride = 128\n    max_query_len = 64\n    max_ans_len = 128\n    weight_decay = 0.01\n    print_steps = 20\n    vocab_path = './pretrain/ERNIE-v1-zh-base/vocab.txt'\n    do_lower_case = True\n\n    train_file = './data/train.json'\n    predict_file = './data/dev.json'\n    save_path = './outputs/'\n    pred_output = './outputs/predict/'\n    save_type = 'ckpt'\n    task_name = 'cmrc2018'\n    pre_params = './pretrain/ERNIE-v1-zh-base/params'\n    config = json.load(open('./pretrain/ERNIE-v1-zh-base/ernie_config.json'))\n\n    # -----------------------  for training ----------------------- \n\n    # step 1-1: create readers for training\n    mrc_reader = palm.reader.MRCReader(vocab_path, max_seqlen, max_query_len, doc_stride, do_lower_case=do_lower_case)\n    # step 1-2: load the training data\n    mrc_reader.load_data(train_file, file_format='json', num_epochs=num_epochs, batch_size=batch_size)\n\n    # step 2: create a backbone of the model to extract text features\n    ernie = palm.backbone.ERNIE.from_config(config)\n\n    # step 3: register the backbone in reader\n    mrc_reader.register_with(ernie)\n\n    # step 4: create the task output head\n    mrc_head = palm.head.MRC(max_query_len, config['hidden_size'], do_lower_case=do_lower_case, max_ans_len=max_ans_len)\n \n    # step 5-1: create a task trainer\n    trainer = palm.Trainer(task_name)\n    # step 5-2: build forward graph with backbone and task head\n    loss_var = trainer.build_forward(ernie, mrc_head)\n    \n    # step 6-1*: use warmup\n    n_steps = mrc_reader.num_examples * num_epochs // batch_size\n    warmup_steps = int(0.1 * n_steps)\n    sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps)\n    # step 6-2: create a optimizer\n    adam = palm.optimizer.Adam(loss_var, lr, sched)\n    # step 6-3: build backward\n    trainer.build_backward(optimizer=adam, weight_decay=weight_decay)\n\n    # step 7: fit prepared reader and data\n    trainer.fit_reader(mrc_reader)\n \n    # step 8-1*: load pretrained parameters\n    trainer.load_pretrain(pre_params)\n    # step 8-2*: set saver to save model\n    save_steps = 3040\n    trainer.set_saver(save_path=save_path, save_steps=save_steps, save_type=save_type)\n    # step 8-3: start training\n    trainer.train(print_steps=print_steps)\n   \n    # -----------------------  for prediction ----------------------- \n\n    # step 1-1: create readers for prediction\n    predict_mrc_reader = palm.reader.MRCReader(vocab_path, max_seqlen, max_query_len, doc_stride, do_lower_case=do_lower_case, phase='predict')\n    # step 1-2: load the training data\n    predict_mrc_reader.load_data(predict_file, batch_size)\n\n    # step 2: create a backbone of the model to extract text features\n    pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict')\n\n    # step 3: register the backbone in reader\n    predict_mrc_reader.register_with(pred_ernie)\n\n    # step 4: create the task output head\n    mrc_pred_head = palm.head.MRC(max_query_len, config['hidden_size'], do_lower_case=do_lower_case, max_ans_len=max_ans_len, phase='predict')\n    \n    # step 5: build forward graph with backbone and task head\n    trainer.build_predict_forward(pred_ernie, mrc_pred_head)\n\n    # step 6: load checkpoint\n    pred_model_path =  './outputs/ckpt.step'+str(3040)\n    trainer.load_ckpt(pred_model_path)\n    \n    # step 7: fit prepared reader and data\n    trainer.fit_reader(predict_mrc_reader, phase='predict')\n\n    # step 8: predict\n    print('predicting..')\n    trainer.predict(print_steps=print_steps, output_dir=\"outputs/predict\")\n"
  },
  {
    "path": "examples/multi-task/README.md",
    "content": "## Example 6: Joint Training of Dialogue Intent Recognition and Slot Filling\nThis example achieves the joint training ofg Dialogue Intent Recognition and Slot Filling. The intent recognition can be regared as a text classification task, and slot filling as sequence labeling task. Both classification and sequence labeling have been built-in in PaddlePALM.\n\n### Step 1: Prepare Pre-trained Models & Datasets\n\n#### Pre-trained Model\n\nWe prepare [ERNIE-v2-en-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api) as our pre-trained model for this example.\n\nMake sure you have downloaded `ERNIE` to current folder.\n\n#### Dataset\n\nHere we use `Airline Travel Information System` dataset as our testbed. \n\nDownload dataset:\n```shell\npython download.py\n```\n\nAfter the dataset is downloaded, you should convert the data format for training:\n```shell\npython process.py\n```\n\nIf everything goes well, there will be a folder named `data/atis/`  created with all the datas in it.\n\nHere is some example datas:\n\n`data/atis/atis_slot/train.tsv` :\n```\ntext_a\tlabel\ni want to fly from boston at 838 am and arrive in denver at 1110 in the morning \tO O O O O B-fromloc.city_name O B-depart_time.time I-depart_time.time O O O B-toloc.city_name O B-arrive_time.time O O B-arrive_time.period_of_day \nwhat flights are available from pittsburgh to baltimore on thursday morning \tO O O O O B-fromloc.city_name O B-toloc.city_name O B-depart_date.day_name B-depart_time.period_of_day \nwhat is the arrival time in san francisco for the 755 am flight leaving washington \tO O O B-flight_time I-flight_time O B-fromloc.city_name I-fromloc.city_name O O B-depart_time.time I-depart_time.time O O B-fromloc.city_name \ncheapest airfare from tacoma to orlando \tB-cost_relative O O B-fromloc.city_name O B-toloc.city_name \n```\n\n`data/atis/atis_intent/train.tsv` :\n```\nlabel\ttext_a\n0\ti want to fly from boston at 838 am and arrive in denver at 1110 in the morning\n0\twhat flights are available from pittsburgh to baltimore on thursday morning\n1\twhat is the arrival time in san francisco for the 755 am flight leaving washington\n2\tcheapest airfare from tacoma to orlando\n```\n\n### Step 2: Train & Predict\n\nThe code used to perform this task is in `run.py`. If you have prepared the pre-training model and the data set required for the task, run:\n\n```shell\npython run.py\n```\n\nIf you want to specify a specific gpu or use multiple gpus for training, please use **`CUDA_VISIBLE_DEVICES`**, for example:\n\n```shell\nCUDA_VISIBLE_DEVICES=0,1 python run.py\n```\n\nNote: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.**\n\nSome logs will be shown below:\n\n```\nglobal step: 5,   slot: step 3/309 (epoch 0), loss: 68.965, speed: 0.58 steps/s\nglobal step: 10, intent: step 3/311 (epoch 0), loss: 3.407, speed: 8.76 steps/s\nglobal step: 15,   slot: step 12/309 (epoch 0), loss: 54.611, speed: 1.21 steps/s\nglobal step: 20, intent: step 7/311 (epoch 0), loss: 3.487, speed: 10.28 steps/s\n```\n\n\nAfter the run, you can view the saved models in the `outputs/` folder.\n\n\nIf you want to use the trained model to predict the `atis_slot & atis_intent` data, run:\n\n```shell\npython predict-slot.py\npython predict-intent.py\n```\n\nIf you want to specify a specific gpu or use multiple gpus for predict, please use **`CUDA_VISIBLE_DEVICES`**, for example:\n\n```shell\nCUDA_VISIBLE_DEVICES=0,1 python predict-slot.py\nCUDA_VISIBLE_DEVICES=0,1 python predict-intent.py\n```\n\nNote: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.**\n\nAfter the run, you can view the predictions in the `outputs/predict-slot` folder and `outputs/predict-intent` folder. Here are some examples of predictions:\n\n`atis_slot`:\n```\n[129, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 5, 19, 1, 1, 1, 1, 1, 21, 21, 68, 129]\n[129, 1, 39, 37, 1, 1, 1, 1, 1, 2, 1, 5, 19, 1, 23, 3, 4, 129, 129, 129, 129, 129]\n[129, 1, 39, 37, 1, 1, 1, 1, 1, 1, 2, 1, 5, 19, 129, 129, 129, 129, 129, 129, 129, 129]\n[129, 1, 1, 1, 1, 1, 1, 14, 15, 1, 2, 1, 5, 19, 1, 39, 37, 129, 24, 129, 129, 129]\n```\n\n`atis_intent`:\n```\n{\"index\": 0, \"logits\": [9.938603401184082, -0.3914794623851776, -0.050973162055015564, -1.0229418277740479, 0.04799401015043259, -0.9632213115692139, -0.6427211761474609, -1.337939739227295, -0.7969412803649902, -1.4441455602645874, -0.6339573264122009, -1.0393054485321045, -0.9242327213287354, -1.9637483358383179, 0.16733427345752716, -0.5280354619026184, -1.7195699214935303, -2.199411630630493, -1.2833174467086792, -1.3081035614013672, -1.6036226749420166, -1.8527079820632935, -2.289180040359497, -2.267214775085449, -2.2578916549682617, -2.2010505199432373], \"probs\": [0.999531626701355, 3.26210938510485e-05, 4.585415081237443e-05, 1.7348344044876285e-05, 5.06243304698728e-05, 1.8415948943584226e-05, 2.5373808966833167e-05, 1.266065828531282e-05, 2.174747896788176e-05, 1.1384962817828637e-05, 2.5597169951652177e-05, 1.7066764485207386e-05, 1.914815220516175e-05, 6.771284006390488e-06, 5.70411684748251e-05, 2.8457265216275118e-05, 8.644025911053177e-06, 5.349628736439627e-06, 1.3371440218179487e-05, 1.3044088518654462e-05, 9.706698619993404e-06, 7.5665011536329985e-06, 4.890325726591982e-06, 4.99892985317274e-06, 5.045753368904116e-06, 5.340866664482746e-06], \"label\": 0}\n{\"index\": 1, \"logits\": [0.8863624930381775, -2.232290506362915, 8.191509246826172, -0.03161466494202614, -0.9149583578109741, -2.172696352005005, -0.3937145471572876, -0.3954394459724426, 1.5333592891693115, 0.8630291223526001, -0.9684226512908936, -2.722721815109253, -0.0060247331857681274, -0.9865402579307556, 1.6328885555267334, 0.3972966969013214, 0.27919167280197144, -1.4911551475524902, -0.9552251696586609, -0.9169244170188904, -0.810670793056488, -1.5118697881698608, -2.0140435695648193, -1.6299077272415161, -1.8589974641799927, -2.07601261138916], \"probs\": [0.0006675600307062268, 2.9517297662096098e-05, 0.9932880997657776, 0.0002665741485543549, 0.0001102013120544143, 3.132982965325937e-05, 0.00018559220188762993, 0.00018527248175814748, 0.0012749042361974716, 0.0006521637551486492, 0.00010446414671605453, 1.8075270418194123e-05, 0.0002734838053584099, 0.00010258861584588885, 0.0014083238784223795, 0.00040934717981144786, 0.00036374686169438064, 6.193659646669403e-05, 0.00010585198469925672, 0.00010998480865964666, 0.0001223145518451929, 6.0666847275570035e-05, 3.671637750812806e-05, 5.391232480178587e-05, 4.287416595616378e-05, 3.4510172554291785e-05], \"label\": 0}\n{\"index\": 2, \"logits\": [9.789957046508789, -0.1730862706899643, -0.7198237776756287, -1.0460278987884521, 0.23521068692207336, -0.5075851678848267, -0.44724929332733154, -1.2945927381515503, -0.6984466314315796, -1.8749892711639404, -0.4631594121456146, -0.6256799697875977, -1.0252169370651245, -1.951456069946289, -0.17572557926177979, -0.6771697402000427, -1.7992591857910156, -2.1457295417785645, -1.4203097820281982, -1.4963451623916626, -1.692310094833374, -1.9219486713409424, -2.2533645629882812, -2.430952310562134, -2.3094685077667236, -2.2399914264678955], \"probs\": [0.9994625449180603, 4.708383130491711e-05, 2.725377635215409e-05, 1.9667899323394522e-05, 7.082601223373786e-05, 3.3697724575176835e-05, 3.579350595828146e-05, 1.5339375750045292e-05, 2.784266871458385e-05, 8.58508519741008e-06, 3.522853512549773e-05, 2.9944207199150696e-05, 2.0081495677004568e-05, 7.953084605105687e-06, 4.695970710599795e-05, 2.8441407266655006e-05, 9.26048778637778e-06, 6.548832516273251e-06, 1.3527245755540207e-05, 1.2536826943687629e-05, 1.030578732752474e-05, 8.19125762063777e-06, 5.880556273041293e-06, 4.923717369820224e-06, 5.559719284065068e-06, 5.9597273320832755e-06], \"label\": 0}\n{\"index\": 3, \"logits\": [9.787659645080566, -0.6223222017288208, -0.03971472755074501, -1.038114070892334, 0.24018540978431702, -0.8904737830162048, -0.7114139795303345, -1.2315020561218262, -0.5120854377746582, -1.4273980855941772, -0.44618460536003113, -1.0241562128067017, -0.9727545380592346, -1.8587366342544556, 0.020689941942691803, -0.6228570342063904, -1.6020199060440063, -2.130260467529297, -1.370570421218872, -1.40530526638031, -1.6782578229904175, -1.94076669216156, -2.2038567066192627, -2.336832284927368, -2.268157720565796, -2.140028953552246], \"probs\": [0.9994485974311829, 3.0113611501292326e-05, 5.392447565100156e-05, 1.986949791898951e-05, 7.134198676794767e-05, 2.303065048181452e-05, 2.7546762794372626e-05, 1.6375688574044034e-05, 3.362310235388577e-05, 1.3462414244713727e-05, 3.591357381083071e-05, 2.0148761905147694e-05, 2.12115264730528e-05, 8.74570196174318e-06, 5.728216274292208e-05, 3.0097504350123927e-05, 1.1305383850412909e-05, 6.666126409982098e-06, 1.4249604646465741e-05, 1.3763145034317859e-05, 1.0475521776243113e-05, 8.056933438638225e-06, 6.193143690325087e-06, 5.422014055511681e-06, 5.807448815176031e-06, 6.601325367228128e-06], \"label\": 0}\n```\n\n### Step 3: Evaluate\n\nOnce you have the prediction, you can run the evaluation script to evaluate the model:\n\n```shell\npython evaluate-slot.py\npython evaluate-intent.py\n```\n\nThe evaluation results are as follows:\n\n`atis_slot`:\n```\ndata num: 891\nf1: 0.8934\n```\n\n`atis_intent`:\n```\ndata num: 893\naccuracy: 0.7088, precision: 1.0000, recall: 1.0000, f1: 1.0000\n```\n"
  },
  {
    "path": "examples/multi-task/download.py",
    "content": "#  -*- coding: utf-8 -*-\nfrom __future__ import print_function\nimport os\nimport tarfile\nimport shutil\nimport sys\nimport urllib\nURLLIB=urllib\nif sys.version_info >= (3, 0):\n    import urllib.request\n    URLLIB=urllib.request\n\ndef download(src, url):\n    def _reporthook(count, chunk_size, total_size):\n        bytes_so_far = count * chunk_size\n        percent = float(bytes_so_far) / float(total_size)\n        if percent > 1:\n            percent = 1\n        print('\\r>> Downloading... {:.1%}'.format(percent), end=\"\")\n\n    URLLIB.urlretrieve(url, src, reporthook=_reporthook)\n\nabs_path = os.path.abspath(__file__)\ndownload_url = \"https://baidu-nlp.bj.bcebos.com/dmtk_data_1.0.0.tar.gz\"\ndownlaod_path = os.path.join(os.path.dirname(abs_path), \"dmtk_data_1.0.0.tar.gz\")\ntarget_dir = os.path.dirname(abs_path)\ndownload(downlaod_path, download_url)\n\ntar = tarfile.open(downlaod_path)\ntar.extractall(target_dir)\nos.remove(downlaod_path)\n\nshutil.rmtree(os.path.join(target_dir, 'data/dstc2/'))\nshutil.rmtree(os.path.join(target_dir, 'data/mrda/'))\nshutil.rmtree(os.path.join(target_dir, 'data/multi-woz/'))\nshutil.rmtree(os.path.join(target_dir, 'data/swda/'))\nshutil.rmtree(os.path.join(target_dir, 'data/udc/'))\nprint(\" done!\")\n"
  },
  {
    "path": "examples/multi-task/evaluate_intent.py",
    "content": "#  -*- coding: utf-8 -*-\n\nimport json\nimport numpy as np\n\ndef accuracy(preds, labels):\n    preds = np.array(preds)\n    labels = np.array(labels) \n    return (preds == labels).mean()\n  \ndef pre_recall_f1(preds, labels):\n    preds = np.array(preds)\n    labels = np.array(labels)\n    # recall=TP/(TP+FN)\n    tp = np.sum((labels == '1') & (preds == '1'))\n    fp = np.sum((labels == '0') & (preds == '1'))\n    fn = np.sum((labels == '1') & (preds == '0'))\n    r = tp * 1.0 / (tp + fn)\n    # Precision=TP/(TP+FP)\n    p = tp * 1.0 / (tp + fp)\n    epsilon = 1e-31\n    f1 = 2 * p * r / (p+r+epsilon)\n    return p, r, f1\n\n\ndef res_evaluate(res_dir=\"./outputs/predict-intent/predictions.json\", eval_phase='test'):\n    if eval_phase == 'test':\n        data_dir=\"./data/atis/atis_intent/test.tsv\"\n    elif eval_phase == 'dev':\n        data_dir=\"./data/dev.tsv\"\n\n    else:\n        assert eval_phase in ['dev', 'test'], 'eval_phase should be dev or test'\n    \n    labels = []\n    with open(data_dir, \"r\") as file:\n        first_flag = True\n        for line in file:\n            line = line.split(\"\\t\")\n            label = line[0]\n            if label=='label':\n                continue\n            labels.append(str(label))\n    file.close()\n\n    preds = []\n    with open(res_dir, \"r\") as file:\n        for line in file.readlines():\n            line = json.loads(line)\n            pred = line['label']\n            preds.append(str(pred))\n    file.close()\n    assert len(labels) == len(preds), \"prediction result doesn't match to labels\"\n    print('data num: {}'.format(len(labels)))\n    p, r, f1 = pre_recall_f1(preds, labels)\n    print(\"accuracy: {:.4f}, precision: {:.4f}, recall: {:.4f}, f1: {:.4f}\".format(accuracy(preds, labels), p, r, f1))\n\nres_evaluate()\n"
  },
  {
    "path": "examples/multi-task/evaluate_slot.py",
    "content": "#  -*- coding: utf-8 -*-\n\nimport json\n\n\ndef load_label_map(map_dir=\"./data/atis/atis_slot/label_map.json\"):\n    \"\"\"\n    :param map_dir: dict indictuing chunk type\n    :return:\n    \"\"\"\n    return json.load(open(map_dir, \"r\"))\n\n\ndef cal_chunk(pred_label, refer_label):\n    tp = dict()\n    fn = dict()\n    fp = dict()\n    for i in range(len(refer_label)):\n        if refer_label[i] == pred_label[i]:\n            if refer_label[i] not in tp:\n                tp[refer_label[i]] = 0\n            tp[refer_label[i]] += 1\n        else:\n            if pred_label[i] not in fp:\n                fp[pred_label[i]] = 0\n            fp[pred_label[i]] += 1\n            if refer_label[i] not in fn:\n                fn[refer_label[i]] = 0\n            fn[refer_label[i]] += 1\n\n    tp_total = sum(tp.values())\n    fn_total = sum(fn.values())\n    fp_total = sum(fp.values())\n    p_total = float(tp_total) / (tp_total + fp_total)\n    r_total = float(tp_total) / (tp_total + fn_total)\n    f_micro = 2 * p_total * r_total / (p_total + r_total)\n\n    return f_micro\n\n\ndef res_evaluate(res_dir=\"./outputs/predict-slot/predictions.json\", data_dir=\"./data/atis/atis_slot/test.tsv\"):\n    label_map = load_label_map()\n\n    total_label = []\n    with open(data_dir, \"r\") as file:\n        first_flag = True\n        for line in file:\n            if first_flag:\n                first_flag = False\n                continue\n            line = line.strip(\"\\n\")\n            if len(line) == 0:\n                continue\n            line = line.split(\"\\t\")\n            if len(line) < 2:\n                continue\n            labels = line[1][:-1].split(\"\\x02\")\n            total_label.append(labels)\n    total_label = [[label_map[j] for j in i] for i in total_label]\n\n    total_res = []\n    with open(res_dir, \"r\") as file:\n        cnt = 0\n        for line in file:\n            line = line.strip(\"\\n\")\n            if len(line) == 0:\n                continue\n            try:\n                res_arr = json.loads(line)\n\n                if len(total_label[cnt]) < len(res_arr):\n                    total_res.append(res_arr[1: 1 + len(total_label[cnt])])\n                elif len(total_label[cnt]) == len(res_arr):\n                    total_res.append(res_arr)\n                else:\n                    total_res.append(res_arr)\n                    total_label[cnt] = total_label[cnt][: len(res_arr)]\n            except:\n                print(\"json format error: {}\".format(cnt))\n                print(line)\n\n            cnt += 1\n\n    total_res_equal = []\n    total_label_equal = []\n    assert len(total_label) == len(total_res), \"prediction result doesn't match to labels\"\n    for i in range(len(total_label)):\n        num = len(total_label[i])\n        total_label_equal.extend(total_label[i])\n        total_res[i] = total_res[i][:num]\n        total_res_equal.extend(total_res[i])\n\n    f1 = cal_chunk(total_res_equal, total_label_equal)\n    print('data num: {}'.format(len(total_label)))\n    print(\"f1: {:.4f}\".format(f1))\n\n\nres_evaluate()\n"
  },
  {
    "path": "examples/multi-task/joint_predict.py",
    "content": "# coding=utf-8\nimport paddlepalm as palm\nimport json\nimport numpy as np\n\n\nif __name__ == '__main__':\n\n    # configs\n    max_seqlen = 128\n    batch_size = 128\n    num_epochs = 20\n    print_steps = 5\n    lr = 2e-5\n    num_classes = 130\n    weight_decay = 0.01\n    num_classes_intent = 26\n    dropout_prob = 0.1\n    random_seed = 0\n    label_map = './data/atis/atis_slot/label_map.json'\n    vocab_path = './pretrain/ERNIE-v2-en-base/vocab.txt'\n\n    train_slot = './data/atis/atis_slot/train.tsv'\n    train_intent = './data/atis/atis_intent/train.tsv'\n\n    config = json.load(open('./pretrain/ERNIE-v2-en-base/ernie_config.json'))\n    input_dim = config['hidden_size']\n\n    # -----------------------  for training ----------------------- \n\n    # step 1-1: create readers \n    slot_reader = palm.reader.SequenceLabelReader(vocab_path, max_seqlen, label_map, seed=random_seed, phase='predict')\n    intent_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed, phase='predict')\n\n    # step 1-2: load train data\n    slot_reader.load_data(train_slot, file_format='tsv', num_epochs=None, batch_size=batch_size)\n    intent_reader.load_data(train_intent, batch_size=batch_size, num_epochs=None)\n\n    # step 2: create a backbone of the model to extract text features\n    ernie = palm.backbone.ERNIE.from_config(config, phase='predict')\n\n    # step 3: register readers with ernie backbone\n    slot_reader.register_with(ernie)\n    intent_reader.register_with(ernie)\n\n    # step 4: create task output heads\n    slot_head = palm.head.SequenceLabel(num_classes, input_dim, dropout_prob, phase='predict')\n    intent_head = palm.head.Classify(num_classes_intent, input_dim, dropout_prob, phase='predict')\n   \n    # step 5-1: create task trainers and multiHeadTrainer\n    trainer_slot = palm.Trainer(\"slot\", mix_ratio=1.0)\n    trainer_intent = palm.Trainer(\"intent\", mix_ratio=1.0)\n    trainer = palm.MultiHeadTrainer([trainer_slot, trainer_intent])\n    # # step 5-2: build forward graph with backbone and task head\n    vars = trainer_intent.build_predict_forward(ernie, intent_head)\n    vars = trainer_slot.build_predict_forward(ernie, slot_head)\n    loss_var = trainer.build_predict_forward()\n\n    # load checkpoint\n    trainer.load_ckpt('outputs/ckpt.step300')\n\n    # merge inference readers\n    joint_iterator = trainer.merge_inference_readers([slot_reader, intent_reader])\n\n    # for test\n    # batch = next(joint_iterator('slot'))\n    # results = trainer.predict_one_batch('slot', batch)\n    # batch = next(joint_iterator('intent'))\n    # results = trainer.predict_one_batch('intent', batch)\n\n    # predict slot filling\n    print('processing slot filling examples...')\n    print('num examples: '+str(slot_reader.num_examples))\n    cnt = 0\n    for batch in joint_iterator('slot'):\n        cnt += len(trainer.predict_one_batch('slot', batch)['logits'])\n        if cnt % 1000 <= 128:\n            print(str(cnt)+'th example processed.')\n    print(str(cnt)+'th example processed.')\n\n    # predict intent recognition\n    print('processing intent recognition examples...')\n    print('num examples: '+str(intent_reader.num_examples))\n    cnt = 0\n    for batch in joint_iterator('intent'):\n        cnt += len(trainer.predict_one_batch('intent', batch)['logits'])\n        if cnt % 1000 <= 128:\n            print(str(cnt)+'th example processed.')\n    print(str(cnt)+'th example processed.')\n\n"
  },
  {
    "path": "examples/multi-task/predict_intent.py",
    "content": "# coding=utf-8\nimport paddlepalm as palm\nimport json\nfrom paddlepalm.distribute import gpu_dev_count\n\n\nif __name__ == '__main__':\n\n    # configs\n    max_seqlen = 256\n    batch_size = 16\n    num_epochs = 6 \n    print_steps = 5\n    num_classes = 26\n    vocab_path = './pretrain/ERNIE-v2-en-base/vocab.txt'\n    predict_file = './data/atis/atis_intent/test.tsv'\n    save_path = './outputs/'\n    pred_output = './outputs/predict-intent/'\n    save_type = 'ckpt'\n    random_seed = 0\n    config = json.load(open('./pretrain/ERNIE-v2-en-base/ernie_config.json'))\n    input_dim = config['hidden_size']\n\n    # -----------------------  for prediction ----------------------- \n\n    # step 1-1: create readers for prediction\n    print('prepare to predict...')\n    predict_cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed, phase='predict')\n    # step 1-2: load the training data\n    predict_cls_reader.load_data(predict_file, batch_size)\n    \n    # step 2: create a backbone of the model to extract text features\n    pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict')\n\n    # step 3: register the backbone in reader\n    predict_cls_reader.register_with(pred_ernie)\n    \n    # step 4: create the task output head\n    cls_pred_head = palm.head.Classify(num_classes, input_dim, phase='predict')\n    \n    # step 5-1: create a task trainer\n    trainer = palm.Trainer(\"intent\")\n    # step 5-2: build forward graph with backbone and task head\n    trainer.build_predict_forward(pred_ernie, cls_pred_head)\n \n    # step 6: load checkpoint\n    pred_model_path = './outputs/ckpt.step4641'\n    trainer.load_ckpt(pred_model_path)\n\n    # step 7: fit prepared reader and data\n    trainer.fit_reader(predict_cls_reader, phase='predict')\n\n    # step 8: predict\n    print('predicting..')\n    trainer.predict(print_steps=print_steps, output_dir=pred_output)\n"
  },
  {
    "path": "examples/multi-task/predict_slot.py",
    "content": "# coding=utf-8\nimport paddlepalm as palm\nimport json\nfrom paddlepalm.distribute import gpu_dev_count\n\n\nif __name__ == '__main__':\n\n    # configs\n    max_seqlen = 256\n    batch_size = 16\n    num_epochs = 6 \n    print_steps = 5\n    num_classes = 130\n    label_map = './data/atis/atis_slot/label_map.json'\n    vocab_path = './pretrain/ERNIE-v2-en-base/vocab.txt'\n    predict_file = './data/atis/atis_slot/test.tsv'\n    save_path = './outputs/'\n    pred_output = './outputs/predict-slot/'\n    save_type = 'ckpt'\n    random_seed = 0\n    config = json.load(open('./pretrain/ERNIE-v2-en-base/ernie_config.json'))\n    input_dim = config['hidden_size']\n\n    # -----------------------  for prediction ----------------------- \n\n    # step 1-1: create readers for prediction\n    print('prepare to predict...')\n    predict_seq_label_reader = palm.reader.SequenceLabelReader(vocab_path, max_seqlen, label_map, seed=random_seed, phase='predict')\n    # step 1-2: load the training data\n    predict_seq_label_reader.load_data(predict_file, batch_size)\n   \n    # step 2: create a backbone of the model to extract text features\n    pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict')\n    \n    # step 3: register the backbone in reader\n    predict_seq_label_reader.register_with(pred_ernie)\n\n    # step 4: create the task output head\n    seq_label_pred_head = palm.head.SequenceLabel(num_classes, input_dim, phase='predict')\n    \n    # step 5-1: create a task trainer\n    trainer_seq_label = palm.Trainer(\"slot\")\n    # step 5-2: build forward graph with backbone and task head\n    trainer_seq_label.build_predict_forward(pred_ernie, seq_label_pred_head)\n    \n    # step 6: load checkpoint\n    pred_model_path = './outputs/ckpt.step4641'\n    trainer_seq_label.load_ckpt(pred_model_path)\n    \n    # step 7: fit prepared reader and data\n    trainer_seq_label.fit_reader(predict_seq_label_reader, phase='predict')\n   \n    # step 8: predict\n    print('predicting..')\n    trainer_seq_label.predict(print_steps=print_steps, output_dir=pred_output)\n"
  },
  {
    "path": "examples/multi-task/process.py",
    "content": "import os\nimport json\n\nlabel_new = \"data/atis/atis_slot/label_map.json\"\nlabel_old = \"data/atis/atis_slot/map_tag_slot_id.txt\"\ntrain_old = \"data/atis/atis_slot/train.txt\"\ntrain_new = \"data/atis/atis_slot/train.tsv\"\ndev_old = \"data/atis/atis_slot/dev.txt\"\ndev_new = \"data/atis/atis_slot/dev.tsv\"\ntest_old = \"data/atis/atis_slot/test.txt\"\ntest_new = \"data/atis/atis_slot/test.tsv\"\n\n\nintent_test =  \"data/atis/atis_intent/test.tsv\"\nos.rename(\"data/atis/atis_intent/test.txt\", intent_test)\nintent_train =  \"data/atis/atis_intent/train.tsv\"\nos.rename(\"data/atis/atis_intent/train.txt\", intent_train)\nintent_dev = \"data/atis/atis_intent/dev.tsv\"\nos.rename(\"data/atis/atis_intent/dev.txt\", intent_dev)\n\nwith open(intent_dev, 'r+') as f: \n    content = f.read()  \n    f.seek(0, 0)\n    f.write(\"label\\ttext_a\\n\"+content)\nf.close()\n\nwith open(intent_test, 'r+') as f: \n    content = f.read()  \n    f.seek(0, 0)\n    f.write(\"label\\ttext_a\\n\"+content)\nf.close()\n\nwith open(intent_train, 'r+') as f: \n    content = f.read()  \n    f.seek(0, 0)\n    f.write(\"label\\ttext_a\\n\"+content)\nf.close()\n\nos.mknod(label_new)\nos.mknod(train_new)\nos.mknod(dev_new)\nos.mknod(test_new)\n\n\ntag = []\nid = []\nmap = {}\nwith open(label_old, \"r\") as f:\n    with open(label_new, \"w\") as f2:\n        for line in f.readlines():\n            line = line.split('\\t')\n            tag.append(line[0])\n            id.append(int(line[1][:-1]))\n            map[line[1][:-1]] = line[0]\n\n        re = {tag[i]:id[i] for i in range(len(tag))}\n        re = json.dumps(re)\n        f2.write(re)\n    f2.close()\nf.close()\n\n\nwith open(train_old, \"r\") as f:\n    with open(train_new, \"w\") as f2:\n        f2.write(\"text_a\\tlabel\\n\")\n        for line in f.readlines():\n            line = line.split('\\t')\n            text = line[0].split(' ')\n            label = line[1].split(' ')\n            for t in text:\n                f2.write(t)\n                f2.write('\\2')\n            f2.write('\\t')\n            for t in label:\n                if t.endswith('\\n'):\n                    t = t[:-1] \n                f2.write(map[t])\n                f2.write('\\2')\n            f2.write('\\n')\n    f2.close()\nf.close()\n\nwith open(test_old, \"r\") as f:\n    with open(test_new, \"w\") as f2:\n        f2.write(\"text_a\\tlabel\\n\")\n        for line in f.readlines():\n            line = line.split('\\t')\n            text = line[0].split(' ')\n            label = line[1].split(' ')\n            for t in text:\n                f2.write(t)\n                f2.write('\\2')\n            f2.write('\\t')\n            for t in label:\n                if t.endswith('\\n'):\n                    t = t[:-1] \n                f2.write(map[t])\n                f2.write('\\2')\n            f2.write('\\n')\n    f2.close()\nf.close()\n\nwith open(dev_old, \"r\") as f:\n    with open(dev_new, \"w\") as f2:\n        f2.write(\"text_a\\tlabel\\n\")\n        for line in f.readlines():\n            line = line.split('\\t')\n            text = line[0].split(' ')\n            label = line[1].split(' ')\n            for t in text:\n                f2.write(t)\n                f2.write('\\2')\n            f2.write('\\t')\n            for t in label:\n                if t.endswith('\\n'):\n                    t = t[:-1] \n                f2.write(map[t])\n                f2.write('\\2')\n            f2.write('\\n')\n    f2.close()\nf.close()\n\nos.remove(label_old)\nos.remove(train_old)\nos.remove(test_old)\nos.remove(dev_old)"
  },
  {
    "path": "examples/multi-task/run.py",
    "content": "# coding=utf-8\nimport paddlepalm as palm\nimport json\n\n\nif __name__ == '__main__':\n\n    # configs\n    max_seqlen = 128\n    batch_size = 16\n    num_epochs = 20\n    print_steps = 5\n    lr = 2e-5\n    num_classes = 130\n    weight_decay = 0.01\n    num_classes_intent = 26\n    dropout_prob = 0.1\n    random_seed = 0\n    label_map = './data/atis/atis_slot/label_map.json'\n    vocab_path = './pretrain/ERNIE-v2-en-base/vocab.txt'\n\n    train_slot = './data/atis/atis_slot/train.tsv'\n    train_intent = './data/atis/atis_intent/train.tsv'\n\n    config = json.load(open('./pretrain/ERNIE-v2-en-base/ernie_config.json'))\n    input_dim = config['hidden_size']\n\n    # -----------------------  for training ----------------------- \n\n    # step 1-1: create readers \n    seq_label_reader = palm.reader.SequenceLabelReader(vocab_path, max_seqlen, label_map, seed=random_seed)\n    cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed)\n\n    # step 1-2: load train data\n    seq_label_reader.load_data(train_slot, file_format='tsv', num_epochs=None, batch_size=batch_size)\n    cls_reader.load_data(train_intent, batch_size=batch_size, num_epochs=None)\n\n    # step 2: create a backbone of the model to extract text features\n    ernie = palm.backbone.ERNIE.from_config(config)\n\n    # step 3: register readers with ernie backbone\n    seq_label_reader.register_with(ernie)\n    cls_reader.register_with(ernie)\n\n    # step 4: create task output heads\n    seq_label_head = palm.head.SequenceLabel(num_classes, input_dim, dropout_prob)\n    cls_head = palm.head.Classify(num_classes_intent, input_dim, dropout_prob)\n   \n    # step 5-1: create task trainers and multiHeadTrainer\n    trainer_seq_label = palm.Trainer(\"slot\", mix_ratio=1.0)\n    trainer_cls = palm.Trainer(\"intent\", mix_ratio=1.0)\n    trainer = palm.MultiHeadTrainer([trainer_seq_label, trainer_cls])\n    # # step 5-2: build forward graph with backbone and task head\n    loss1 = trainer_cls.build_forward(ernie, cls_head)\n    loss2 = trainer_seq_label.build_forward(ernie, seq_label_head)\n    loss_var = trainer.build_forward()\n\n    # step 6-1*: enable warmup for better fine-tuning\n    n_steps = seq_label_reader.num_examples * 1.5 * num_epochs // batch_size\n    warmup_steps = int(0.1 * n_steps)\n    sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps)\n    # step 6-2: build a optimizer\n    adam = palm.optimizer.Adam(loss_var, lr, sched)\n    # step 6-3: build backward graph\n    trainer.build_backward(optimizer=adam, weight_decay=weight_decay)\n\n    # step 7: fit readers to trainer\n    trainer.fit_readers_with_mixratio([seq_label_reader, cls_reader], \"slot\", num_epochs)\n\n    # step 8-1*: load pretrained model\n    trainer.load_pretrain('./pretrain/ERNIE-v2-en-base')\n    # step 8-2*: set saver to save models during training\n    trainer.set_saver(save_path='./outputs/', save_steps=300)\n    # step 8-3: start training\n    trainer.train(print_steps=10)\n"
  },
  {
    "path": "examples/predict/README.md",
    "content": "## Example 5: Prediction\nThis example demonstrates how to directly do prediction with PaddlePALM. You can either initialize the model from a checkpoint, a pretrained model or just randomly initialization. Here we reuse the task and data in example 1. Hence repeat the step 1 in example 1 to pretrain data. \n\nAfter you have prepared the pre-training model and the data set required for the task, run:\n\n```shell\npython run.py\n```\n\nIf you want to specify a specific gpu or use multiple gpus for predict, please use **`CUDA_VISIBLE_DEVICES`**, for example:\n\n```shell\nCUDA_VISIBLE_DEVICES=0,1 python run.py\n```\n\nNote: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.**\n\n\nSome logs will be shown below:\n\n```\nstep 1/154, speed: 0.51 steps/s\nstep 2/154, speed: 3.36 steps/s\nstep 3/154, speed: 3.48 steps/s\n```\n\n\nAfter the run, you can view the predictions in the `outputs/predict` folder. Here are some examples of predictions:\n\n\n```\n{\"index\": 0, \"logits\": [-0.2014336884021759, 0.6799028515815735], \"probs\": [0.29290086030960083, 0.7070990800857544], \"label\": 1}\n{\"index\": 1, \"logits\": [0.8593899011611938, -0.29743513464927673], \"probs\": [0.7607553601264954, 0.23924466967582703], \"label\": 0}\n{\"index\": 2, \"logits\": [0.7462944388389587, -0.7083730101585388], \"probs\": [0.8107157349586487, 0.18928426504135132], \"label\": 0}\n```\n\n### Step 3: Evaluate\n\nOnce you have the prediction, you can run the evaluation script to evaluate the model:\n\n```shell\npython evaluate.py\n```\n\nThe evaluation results are as follows:\n\n```\ndata num: 1200\naccuracy: 0.4758, precision: 0.4730, recall: 0.3026, f1: 0.3691\n```\n"
  },
  {
    "path": "examples/predict/download.py",
    "content": "#  -*- coding: utf-8 -*-\nfrom __future__ import print_function\nimport os\nimport tarfile\nimport shutil\nimport sys\nimport urllib\nURLLIB=urllib\nif sys.version_info >= (3, 0):\n    import urllib.request\n    URLLIB=urllib.request\n\ndef download(src, url):\n    def _reporthook(count, chunk_size, total_size):\n        bytes_so_far = count * chunk_size\n        percent = float(bytes_so_far) / float(total_size)\n        if percent > 1:\n            percent = 1\n        print('\\r>> Downloading... {:.1%}'.format(percent), end=\"\")\n\n    URLLIB.urlretrieve(url, src, reporthook=_reporthook)\n\nabs_path = os.path.abspath(__file__)\ndownload_url = \"https://ernie.bj.bcebos.com/task_data_zh.tgz\"\ndownlaod_path = os.path.join(os.path.dirname(abs_path), \"task_data_zh.tgz\")\ntarget_dir = os.path.dirname(abs_path)\ndownload(downlaod_path, download_url)\n\ntar = tarfile.open(downlaod_path)\ntar.extractall(target_dir)\nos.remove(downlaod_path)\n\nabs_path = os.path.abspath(__file__)\ndst_dir = os.path.join(os.path.dirname(abs_path), \"data\")\nif not os.path.exists(dst_dir) or not os.path.isdir(dst_dir):\n    os.makedirs(dst_dir)\n\nfor file in os.listdir(os.path.join(target_dir, 'task_data', 'chnsenticorp')):\n    shutil.move(os.path.join(target_dir, 'task_data', 'chnsenticorp', file), dst_dir)\n\nshutil.rmtree(os.path.join(target_dir, 'task_data'))\nprint(\" done!\")\n"
  },
  {
    "path": "examples/predict/evaluate.py",
    "content": "#  -*- coding: utf-8 -*-\n\nimport json\nimport numpy as np\n\ndef accuracy(preds, labels):\n    preds = np.array(preds)\n    labels = np.array(labels) \n    return (preds == labels).mean()\n\ndef pre_recall_f1(preds, labels):\n    preds = np.array(preds)\n    labels = np.array(labels)\n    # recall=TP/(TP+FN)\n    tp = np.sum((labels == '1') & (preds == '1'))\n    fp = np.sum((labels == '0') & (preds == '1'))\n    fn = np.sum((labels == '1') & (preds == '0'))\n    r = tp * 1.0 / (tp + fn)\n    # Precision=TP/(TP+FP)\n    p = tp * 1.0 / (tp + fp)\n    epsilon = 1e-31\n    f1 = 2 * p * r / (p+r+epsilon)\n    return p, r, f1\n\n\ndef res_evaluate(res_dir=\"./outputs/predict/predictions.json\", eval_phase='test'):\n    if eval_phase == 'test':\n        data_dir=\"./data/test.tsv\"\n    elif eval_phase == 'dev':\n        data_dir=\"./data/dev.tsv\"\n    else:\n        assert eval_phase in ['dev', 'test'], 'eval_phase should be dev or test'\n    \n    labels = []\n    with open(data_dir, \"r\") as file:\n        first_flag = True\n        for line in file:\n            line = line.split(\"\\t\")\n            label = line[0]\n            if label=='label':\n                continue\n            labels.append(str(label))\n    file.close()\n\n    preds = []\n    with open(res_dir, \"r\") as file:\n        for line in file.readlines():\n            line = json.loads(line)\n            pred = line['label']\n            preds.append(str(pred))\n    file.close()\n    assert len(labels) == len(preds), \"prediction result doesn't match to labels\"\n    print('data num: {}'.format(len(labels)))\n    p, r, f1 = pre_recall_f1(preds, labels)\n    print(\"accuracy: {:.4f}, precision: {:.4f}, recall: {:.4f}, f1: {:.4f}\".format(accuracy(preds, labels), p, r, f1))\n\nres_evaluate()\n"
  },
  {
    "path": "examples/predict/run.py",
    "content": "# coding=utf-8\nimport paddlepalm as palm\nimport json\n\n\nif __name__ == '__main__':\n\n    # configs\n    max_seqlen = 256\n    batch_size = 8\n    vocab_path = './pretrain/ERNIE-v1-zh-base/vocab.txt'\n    predict_file = './data/test.tsv'\n    random_seed = 1\n    config = json.load(open('./pretrain/ERNIE-v1-zh-base/ernie_config.json'))\n    input_dim = config['hidden_size']\n    num_classes = 2\n    task_name = 'chnsenticorp'\n    pred_output = './outputs/predict/'\n    print_steps = 20\n    pre_params = './pretrain/ERNIE-v1-zh-base/params'\n\n    # -----------------------  for prediction ----------------------- \n\n    # step 1-1: create readers for prediction\n    print('prepare to predict...')\n    predict_cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed, phase='predict')\n    # step 1-2: load the training data\n    predict_cls_reader.load_data(predict_file, batch_size)\n    \n    # step 2: create a backbone of the model to extract text features\n    pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict')\n\n    # step 3: register the backbone in reader\n    predict_cls_reader.register_with(pred_ernie)\n    \n    # step 4: create the task output head\n    cls_pred_head = palm.head.Classify(num_classes, input_dim, phase='predict')\n    \n    # step 5-1: create a task trainer\n    trainer = palm.Trainer(task_name)\n    # step 5-2: build forward graph with backbone and task head\n    trainer.build_predict_forward(pred_ernie, cls_pred_head)\n \n    # step 6: load checkpoint\n    trainer.load_predict_model(pre_params)\n\n    # step 7: fit prepared reader and data\n    trainer.fit_reader(predict_cls_reader, phase='predict')\n\n    # step 8: predict\n    print('predicting..')\n    trainer.predict(print_steps=print_steps, output_dir=pred_output)\n"
  },
  {
    "path": "examples/tagging/README.md",
    "content": "## Example 3: Tagging\nThis task is a named entity recognition task. The following sections detail model preparation, dataset preparation, and how to run the task.\n\n### Step 1: Prepare Pre-trained Models & Datasets\n\n#### Pre-trianed Model\n\nThe pre-training model of this mission is: [ERNIE-v1-zh-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api).\n\nMake sure you have downloaded the required pre-training model in the current folder.\n\n\n#### Dataset\n\nThis task uses the `MSRA-NER(SIGHAN2006)` dataset. \n\nDownload dataset:\n```shell\npython download.py\n```\n\nIf everything goes well, there will be a folder named `data/`  created with all the datas in it.\n\nThe data should have 2 fields,  `text_a  label`, with tsv format. Here is some example datas:\n\n ```\ntext_a  label\n在 这 里 恕 弟 不 恭 之 罪 ， 敢 在 尊 前 一 诤 ： 前 人 论 书 ， 每 曰 “ 字 字 有 来 历 ， 笔 笔 有 出 处 ” ， 细 读 公 字 ， 何 尝 跳 出 前 人 藩 篱 ， 自 隶 变 而 后 ， 直 至 明 季 ， 兄 有 何 新 出 ？    O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O\n相 比 之 下 ， 青 岛 海 牛 队 和 广 州 松 日 队 的 雨 中 之 战 虽 然 也 是 0 ∶ 0 ， 但 乏 善 可 陈 。   O O O O O B-ORG I-ORG I-ORG I-ORG I-ORG O B-ORG I-ORG I-ORG I-ORG I-ORG O O O O O O O O O O O O O O O O O O O\n理 由 多 多 ， 最 无 奈 的 却 是 ： 5 月 恰 逢 双 重 考 试 ， 她 攻 读 的 博 士 学 位 论 文 要 通 考 ； 她 任 教 的 两 所 学 校 ， 也 要 在 这 段 时 日 大 考 。    O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O\n ```\n\n\n\n### Step 2: Train & Predict\n\nThe code used to perform this task is in `run.py`. If you have prepared the pre-training model and the data set required for the task, run:\n\n```shell\npython run.py\n```\n\nIf you want to specify a specific gpu or use multiple gpus for training, please use **`CUDA_VISIBLE_DEVICES`**, for example:\n\n```shell\nCUDA_VISIBLE_DEVICES=0,1 python run.py\n```\n\nNote: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.**\n\nSome logs will be shown below:\n\n```\nstep 1/652 (epoch 0), loss: 216.002, speed: 0.32 steps/s\nstep 2/652 (epoch 0), loss: 202.567, speed: 1.28 steps/s\nstep 3/652 (epoch 0), loss: 170.677, speed: 1.05 steps/s\n```\n\nAfter the run, you can view the saved models in the `outputs/` folder and the predictions in the `outputs/predict` folder. Here are some examples of predictions:\n\n\n```\n[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 6, 4, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]\n[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]\n[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]\n```\n\n### Step 3: Evaluate\n\nOnce you have the prediction, you can run the evaluation script to evaluate the model:\n\n```python\npython evaluate.py\n```\n\nThe evaluation results are as follows:\n\n```\ndata num: 4636\nf1: 0.9918\n```\n"
  },
  {
    "path": "examples/tagging/download.py",
    "content": "#  -*- coding: utf-8 -*-\nfrom __future__ import print_function\nimport os\nimport tarfile\nimport shutil\nimport sys\nimport urllib\nURLLIB=urllib\nif sys.version_info >= (3, 0):\n    import urllib.request\n    URLLIB=urllib.request\n\ndef download(src, url):\n    def _reporthook(count, chunk_size, total_size):\n        bytes_so_far = count * chunk_size\n        percent = float(bytes_so_far) / float(total_size)\n        if percent > 1:\n            percent = 1\n        print('\\r>> Downloading... {:.1%}'.format(percent), end=\"\")\n\n    URLLIB.urlretrieve(url, src, reporthook=_reporthook)\n\nabs_path = os.path.abspath(__file__)\ndownload_url = \"https://ernie.bj.bcebos.com/task_data_zh.tgz\"\ndownlaod_path = os.path.join(os.path.dirname(abs_path), \"task_data_zh.tgz\")\ntarget_dir = os.path.dirname(abs_path)\ndownload(downlaod_path, download_url)\n\ntar = tarfile.open(downlaod_path)\ntar.extractall(target_dir)\nos.remove(downlaod_path)\n\nabs_path = os.path.abspath(__file__)\ndst_dir = os.path.join(os.path.dirname(abs_path), \"data\")\nif not os.path.exists(dst_dir) or not os.path.isdir(dst_dir):\n    os.makedirs(dst_dir)\n\nfor file in os.listdir(os.path.join(target_dir, 'task_data', 'msra_ner')):\n    shutil.move(os.path.join(target_dir, 'task_data', 'msra_ner', file), dst_dir)\n\nshutil.rmtree(os.path.join(target_dir, 'task_data'))\nprint(\" done!\")\n"
  },
  {
    "path": "examples/tagging/evaluate.py",
    "content": "#  -*- coding: utf-8 -*-\n\nimport json\n\n\ndef load_label_map(map_dir=\"./data/label_map.json\"):\n    \"\"\"\n    :param map_dir: dict indictuing chunk type\n    :return:\n    \"\"\"\n    return json.load(open(map_dir, \"r\"))\n\n\ndef cal_chunk(pred_label, refer_label):\n    tp = dict()\n    fn = dict()\n    fp = dict()\n    for i in range(len(refer_label)):\n        if refer_label[i] == pred_label[i]:\n            if refer_label[i] not in tp:\n                tp[refer_label[i]] = 0\n            tp[refer_label[i]] += 1\n        else:\n            if pred_label[i] not in fp:\n                fp[pred_label[i]] = 0\n            fp[pred_label[i]] += 1\n            if refer_label[i] not in fn:\n                fn[refer_label[i]] = 0\n            fn[refer_label[i]] += 1\n\n    tp_total = sum(tp.values())\n    fn_total = sum(fn.values())\n    fp_total = sum(fp.values())\n    p_total = float(tp_total) / (tp_total + fp_total)\n    r_total = float(tp_total) / (tp_total + fn_total)\n    f_micro = 2 * p_total * r_total / (p_total + r_total)\n\n    return f_micro\n\n\ndef res_evaluate(res_dir=\"./outputs/predict/predictions.json\", data_dir=\"./data/test.tsv\"):\n    label_map = load_label_map()\n\n    total_label = []\n    with open(data_dir, \"r\") as file:\n        first_flag = True\n        for line in file:\n            if first_flag:\n                first_flag = False\n                continue\n            line = line.strip(\"\\n\")\n            if len(line) == 0:\n                continue\n            line = line.split(\"\\t\")\n            if len(line) < 2:\n                continue\n            labels = line[1].split(\"\\x02\")\n            total_label.append(labels)\n    total_label = [[label_map[j] for j in i] for i in total_label]\n  \n    total_res = []\n    with open(res_dir, \"r\") as file:\n        cnt = 0\n        for line in file:\n            line = line.strip(\"\\n\")\n            if len(line) == 0:\n                continue\n            try:\n                res_arr = json.loads(line)\n\n                if len(total_label[cnt]) < len(res_arr):\n                    total_res.append(res_arr[1: 1 + len(total_label[cnt])])\n                elif len(total_label[cnt]) == len(res_arr):\n                    total_res.append(res_arr)\n                else:\n                    total_res.append(res_arr)\n                    total_label[cnt] = total_label[cnt][: len(res_arr)]\n            except:\n                print(\"json format error: {}\".format(cnt))\n                print(line)\n\n            cnt += 1\n\n    total_res_equal = []\n    total_label_equal = []\n    assert len(total_label) == len(total_res), \"prediction result doesn't match to labels\"\n    for i in range(len(total_label)):\n        num = len(total_label[i])\n        total_label_equal.extend(total_label[i])\n        total_res[i] = total_res[i][:num]\n        total_res_equal.extend(total_res[i])\n\n    f1 = cal_chunk(total_res_equal, total_label_equal)\n    print('data num: {}'.format(len(total_label)))\n    print(\"f1: {:.4f}\".format(f1))\n\nres_evaluate()\n"
  },
  {
    "path": "examples/tagging/run.py",
    "content": "# coding=utf-8\nimport paddlepalm as palm\nimport json\n\nif __name__ == '__main__':\n \n    # configs\n    max_seqlen = 256\n    batch_size = 16\n    num_epochs = 6\n    lr = 5e-5\n    num_classes = 7\n    weight_decay = 0.01\n    dropout_prob = 0.1\n    vocab_path = './pretrain/ERNIE-v1-zh-base/vocab.txt'\n    label_map = './data/label_map.json'\n    random_seed = 1\n    train_file = './data/train.tsv'\n    predict_file = './data/test.tsv'\n    \n    save_path='./outputs/'\n    save_type='ckpt' \n    pre_params = './pretrain/ERNIE-v1-zh-base/params'\n    config = json.load(open('./pretrain/ERNIE-v1-zh-base/ernie_config.json'))\n    input_dim = config['hidden_size']  \n    task_name = 'msra_ner'\n    pred_output = './outputs/predict/'\n    train_print_steps = 10\n    pred_print_steps = 20\n    \n    # -----------------------  for training ----------------------- \n\n    # step 1-1: create readers for training\n    seq_label_reader = palm.reader.SequenceLabelReader(vocab_path, max_seqlen, label_map, seed=random_seed)\n    # step 1-2: load the training data\n    seq_label_reader.load_data(train_file, file_format='tsv', num_epochs=num_epochs, batch_size=batch_size)\n    \n    # step 2: create a backbone of the model to extract text features\n    ernie = palm.backbone.ERNIE.from_config(config)\n\n    # step 3: register the backbone in reader\n    seq_label_reader.register_with(ernie)\n\n    # step 4: create the task output head\n    seq_label_head = palm.head.SequenceLabel(num_classes, input_dim, dropout_prob)\n\n    # step 5-1: create a task trainer\n    trainer = palm.Trainer(task_name)\n    # step 5-2: build forward graph with backbone and task head\n    loss_var = trainer.build_forward(ernie, seq_label_head)\n\n    # step 6-1*: use warmup\n    n_steps = seq_label_reader.num_examples * num_epochs // batch_size\n    warmup_steps = int(0.1 * n_steps)\n    print('total_steps: {}'.format(n_steps))\n    print('warmup_steps: {}'.format(warmup_steps))\n    sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps)\n    # step 6-2: create a optimizer\n    adam = palm.optimizer.Adam(loss_var, lr, sched)\n    # step 6-3: build backward\n    trainer.build_backward(optimizer=adam, weight_decay=weight_decay)\n  \n    # step 7: fit prepared reader and data\n    trainer.fit_reader(seq_label_reader)\n\n    # step 8-1*: load pretrained parameters\n    trainer.load_pretrain(pre_params)\n    # step 8-2*: set saver to save model\n    save_steps = 1951\n    # print('save_steps: {}'.format(save_steps))\n    trainer.set_saver(save_path=save_path, save_steps=save_steps, save_type=save_type)\n    # # step 8-3: start training\n    trainer.train(print_steps=train_print_steps)\n   \n    # -----------------------  for prediction ----------------------- \n\n    # step 1-1: create readers for prediction\n    print('prepare to predict...')\n    predict_seq_label_reader = palm.reader.SequenceLabelReader(vocab_path, max_seqlen, label_map, seed=random_seed, phase='predict')\n    # step 1-2: load the training data\n    predict_seq_label_reader.load_data(predict_file, batch_size)\n   \n    # step 2: create a backbone of the model to extract text features\n    pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict')\n    \n    # step 3: register the backbone in reader\n    predict_seq_label_reader.register_with(pred_ernie)\n\n    # step 4: create the task output head\n    seq_label_pred_head = palm.head.SequenceLabel(num_classes, input_dim, phase='predict')\n    \n    # step 5: build forward graph with backbone and task head\n    trainer.build_predict_forward(pred_ernie, seq_label_pred_head)\n    \n    # step 6: load checkpoint\n    pred_model_path = './outputs/ckpt.step' + str(save_steps)\n    trainer.load_ckpt(pred_model_path)\n    \n    # step 7: fit prepared reader and data\n    trainer.fit_reader(predict_seq_label_reader, phase='predict')\n   \n    # step 8: predict\n    print('predicting..')\n    trainer.predict(print_steps=pred_print_steps, output_dir=pred_output)\n"
  },
  {
    "path": "examples/train_with_eval/README.md",
    "content": "## Train with Evaluation version of Example 1: Classification\nThis task is a sentiment analysis task. The following sections detail model preparation, dataset preparation, and how to run the task. Here to demonstrate how to do evaluation during training in PaddlePALM. \n\n### Step 1: Prepare Pre-trained Model & Dataset\n\n#### Pre-trained Model\n\nThe pre-training model of this mission is: [ERNIE-v1-zh-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api).\n\nMake sure you have downloaded the required pre-training model in the current folder.\n\n\n#### Dataset\n\nThis example demonstrates with [ChnSentiCorp](https://github.com/SophonPlus/ChineseNlpCorpus/tree/master/datasets/ChnSentiCorp_htl_all), a Chinese sentiment analysis dataset.\n\nDownload dataset:\n```shell\npython download.py\n```\n\nIf everything goes well, there will be a folder named `data/`  created with all the data files in it.\n\nThe dataset file (for training) should have 2 fields,  `text_a` and `label`, stored with [tsv](https://en.wikipedia.org/wiki/Tab-separated_values) format. Here shows an example:\n\n```\nlabel  text_a\n0   当当网名不符实，订货多日不见送货，询问客服只会推托，只会要求用户再下订单。如此服务留不住顾客的。去别的网站买书服务更好。\n0   XP的驱动不好找！我的17号提的货，现在就降价了100元，而且还送杀毒软件！\n1   <荐书> 推荐所有喜欢<红楼>的红迷们一定要收藏这本书,要知道当年我听说这本书的时候花很长时间去图书馆找和借都没能如愿,所以这次一看到当当有,马上买了,红迷们也要记得备货哦!\n```\n\n### Step 2: Train & Predict\n\nThe code used to perform this task is in `run.py`. If you have prepared the pre-training model and the data set required for the task, run:\n\n```shell\npython run.py\n```\n\nIf you want to specify a specific gpu or use multiple gpus for training, please use **`CUDA_VISIBLE_DEVICES`**, for example:\n\n```shell\nCUDA_VISIBLE_DEVICES=0,1 python run.py\n```\n\nNote: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.**\n\n\nSome logs will be shown below:\n\n```\nstep 1/154 (epoch 0), loss: 5.512, speed: 0.51 steps/s\nstep 2/154 (epoch 0), loss: 2.595, speed: 3.36 steps/s\nstep 3/154 (epoch 0), loss: 1.798, speed: 3.48 steps/s\n```\n\n\nAfter the run, you can view the saved models in the `outputs/` folder and the predictions in the `outputs/predict` folder. Here are some examples of predictions:\n\n\n```\n{\"index\": 0, \"logits\": [-0.2014336884021759, 0.6799028515815735], \"probs\": [0.29290086030960083, 0.7070990800857544], \"label\": 1}\n{\"index\": 1, \"logits\": [0.8593899011611938, -0.29743513464927673], \"probs\": [0.7607553601264954, 0.23924466967582703], \"label\": 0}\n{\"index\": 2, \"logits\": [0.7462944388389587, -0.7083730101585388], \"probs\": [0.8107157349586487, 0.18928426504135132], \"label\": 0}\n```\n\n### Step 3: Evaluate\n\nOnce you have the prediction, you can run the evaluation script to evaluate the model:\n\n```shell\npython evaluate.py\n```\n\nThe evaluation results are as follows:\n\n```\ndata num: 1200\naccuracy: 0.9575, precision: 0.9634, recall: 0.9523, f1: 0.9578\n```\n"
  },
  {
    "path": "examples/train_with_eval/download.py",
    "content": "#  -*- coding: utf-8 -*-\nfrom __future__ import print_function\nimport os\nimport tarfile\nimport shutil\nimport sys\nimport urllib\nURLLIB=urllib\nif sys.version_info >= (3, 0):\n    import urllib.request\n    URLLIB=urllib.request\n\ndef download(src, url):\n    def _reporthook(count, chunk_size, total_size):\n        bytes_so_far = count * chunk_size\n        percent = float(bytes_so_far) / float(total_size)\n        if percent > 1:\n            percent = 1\n        print('\\r>> Downloading... {:.1%}'.format(percent), end=\"\")\n\n    URLLIB.urlretrieve(url, src, reporthook=_reporthook)\n\nabs_path = os.path.abspath(__file__)\ndownload_url = \"https://ernie.bj.bcebos.com/task_data_zh.tgz\"\ndownlaod_path = os.path.join(os.path.dirname(abs_path), \"task_data_zh.tgz\")\ntarget_dir = os.path.dirname(abs_path)\ndownload(downlaod_path, download_url)\n\ntar = tarfile.open(downlaod_path)\ntar.extractall(target_dir)\nos.remove(downlaod_path)\n\nabs_path = os.path.abspath(__file__)\ndst_dir = os.path.join(os.path.dirname(abs_path), \"data\")\nif not os.path.exists(dst_dir) or not os.path.isdir(dst_dir):\n    os.makedirs(dst_dir)\n\nfor file in os.listdir(os.path.join(target_dir, 'task_data', 'chnsenticorp')):\n    shutil.move(os.path.join(target_dir, 'task_data', 'chnsenticorp', file), dst_dir)\n\nshutil.rmtree(os.path.join(target_dir, 'task_data'))\nprint(\" done!\")\n"
  },
  {
    "path": "examples/train_with_eval/evaluate.py",
    "content": "#  -*- coding: utf-8 -*-\n\nimport json\nimport numpy as np\n\ndef accuracy(preds, labels):\n    preds = np.array(preds)\n    labels = np.array(labels) \n    return (preds == labels).mean()\n\ndef pre_recall_f1(preds, labels):\n    preds = np.array(preds)\n    labels = np.array(labels)\n    # recall=TP/(TP+FN)\n    tp = np.sum((labels == '1') & (preds == '1'))\n    fp = np.sum((labels == '0') & (preds == '1'))\n    fn = np.sum((labels == '1') & (preds == '0'))\n    r = tp * 1.0 / (tp + fn)\n    # Precision=TP/(TP+FP)\n    p = tp * 1.0 / (tp + fp)\n    epsilon = 1e-31\n    f1 = 2 * p * r / (p+r+epsilon)\n    return p, r, f1\n\n\ndef res_evaluate(res_dir=\"./outputs/predict/predictions.json\", eval_phase='test'):\n    if eval_phase == 'test':\n        data_dir=\"./data/test.tsv\"\n    elif eval_phase == 'dev':\n        data_dir=\"./data/dev.tsv\"\n    else:\n        assert eval_phase in ['dev', 'test'], 'eval_phase should be dev or test'\n    \n    labels = []\n    with open(data_dir, \"r\") as file:\n        first_flag = True\n        for line in file:\n            line = line.split(\"\\t\")\n            label = line[0]\n            if label=='label':\n                continue\n            labels.append(str(label))\n    file.close()\n\n    preds = []\n    with open(res_dir, \"r\") as file:\n        for line in file.readlines():\n            line = json.loads(line)\n            pred = line['label']\n            preds.append(str(pred))\n    file.close()\n    assert len(labels) == len(preds), \"prediction result doesn't match to labels\"\n    print('data num: {}'.format(len(labels)))\n    p, r, f1 = pre_recall_f1(preds, labels)\n    print(\"accuracy: {:.4f}, precision: {:.4f}, recall: {:.4f}, f1: {:.4f}\".format(accuracy(preds, labels), p, r, f1))\n\nres_evaluate()\n"
  },
  {
    "path": "examples/train_with_eval/run.py",
    "content": "# coding=utf-8\nimport paddlepalm as palm\nimport json\n\n\nif __name__ == '__main__':\n\n    # configs\n    max_seqlen = 256\n    batch_size = 8\n    num_epochs = 10\n    lr = 5e-5\n    weight_decay = 0.01\n    vocab_path = './pretrain/ERNIE-v1-zh-base/vocab.txt'\n\n    train_file = './data/train.tsv'\n    predict_file = './data/test.tsv'\n    config = json.load(open('./pretrain/ERNIE-v1-zh-base/ernie_config.json'))\n    input_dim = config['hidden_size']\n    num_classes = 2\n    dropout_prob = 0.1\n    random_seed = 1\n    task_name = 'chnsenticorp'\n    save_path = './outputs/'\n    pred_output = './outputs/predict/'\n    save_type = 'ckpt'\n    print_steps = 20\n    pre_params = './pretrain/ERNIE-v1-zh-base/params'\n\n    # -----------------------  for training ----------------------- \n\n    # step 1-1: create readers for training\n    cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed)\n    # step 1-2: load the training data\n    cls_reader.load_data(train_file, batch_size, num_epochs=num_epochs)\n\n    # step 2: create a backbone of the model to extract text features\n    ernie = palm.backbone.ERNIE.from_config(config)\n\n    # step 3: register the backbone in reader\n    cls_reader.register_with(ernie)\n\n    # step 4: create the task output head\n    cls_head = palm.head.Classify(num_classes, input_dim, dropout_prob)\n\n    # step 5-1: create a task trainer\n    trainer = palm.Trainer(task_name)\n    # step 5-2: build forward graph with backbone and task head\n    loss_var = trainer.build_forward(ernie, cls_head)\n\n    # step 6-1*: use warmup\n    n_steps = cls_reader.num_examples * num_epochs // batch_size\n    warmup_steps = int(0.1 * n_steps)\n    sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps)\n    # step 6-2: create a optimizer\n    adam = palm.optimizer.Adam(loss_var, lr, sched)\n    # step 6-3: build backward\n    trainer.build_backward(optimizer=adam, weight_decay=weight_decay)\n  \n    # step 7: fit prepared reader and data\n    iterator = trainer.fit_reader(cls_reader)\n    \n    # step 8-1*: load pretrained parameters\n    trainer.load_pretrain(pre_params)\n    # step 8-2*: set saver to save model\n    # save_steps = n_steps \n    save_steps = 2396\n    trainer.set_saver(save_steps=save_steps, save_path=save_path, save_type=save_type)\n\n    # step 8-3: start training\n    # you can repeatly get one train batch with trainer.get_one_batch()\n    # batch = trainer.get_one_batch()\n    for step, batch in enumerate(iterator, start=1):\n        trainer.train_one_step(batch)\n        if step % 100 == 0:\n            print('do evaluation.')\n            # insert evaluation code here\n   \n"
  },
  {
    "path": "paddlepalm/__init__.py",
    "content": "from . import downloader\n# from mtl_controller import Controller \n#import controller\nfrom . import optimizer\nfrom . import lr_sched\nfrom . import backbone\nfrom . import reader\nfrom . import head\n\n\nfrom .trainer import Trainer\nfrom .multihead_trainer import MultiHeadTrainer\n\n#del interface\n#del task_instance\n#del default_settings\n#del utils\n"
  },
  {
    "path": "paddlepalm/_downloader.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom __future__ import print_function\nimport os\nimport tarfile\nimport shutil\nfrom collections import OrderedDict\nimport sys\nimport urllib\nURLLIB=urllib\nif sys.version_info >= (3, 0):\n    import urllib.request\n    URLLIB=urllib.request\n\n__all__ = [\"download\", \"ls\"]\n\n_pretrain = (('RoBERTa-zh-base', 'https://bert-models.bj.bcebos.com/chinese_roberta_wwm_ext_L-12_H-768_A-12.tar.gz'),\n            ('RoBERTa-zh-large', 'https://bert-models.bj.bcebos.com/chinese_roberta_wwm_large_ext_L-24_H-1024_A-16.tar.gz'),\n            ('ERNIE-v2-en-base', 'https://ernie.bj.bcebos.com/ERNIE_Base_en_stable-2.0.0.tar.gz'),\n            ('ERNIE-v2-en-large', 'https://ernie.bj.bcebos.com/ERNIE_Large_en_stable-2.0.0.tar.gz'),\n            ('XLNet-cased-base','https://xlnet.bj.bcebos.com/xlnet_cased_L-12_H-768_A-12.tgz'),\n            ('XLNet-cased-large','https://xlnet.bj.bcebos.com/xlnet_cased_L-24_H-1024_A-16.tgz'),\n            ('ERNIE-v1-zh-base','https://baidu-nlp.bj.bcebos.com/ERNIE_stable-1.0.1.tar.gz'),\n            ('ERNIE-v1-zh-base-max-len-512','https://ernie.bj.bcebos.com/ERNIE_1.0_max-len-512.tar.gz'),\n            ('BERT-en-uncased-large-whole-word-masking','https://bert-models.bj.bcebos.com/wwm_uncased_L-24_H-1024_A-16.tar.gz'),\n            ('BERT-en-cased-large-whole-word-masking','https://bert-models.bj.bcebos.com/wwm_cased_L-24_H-1024_A-16.tar.gz'),\n            ('BERT-en-uncased-base', 'https://bert-models.bj.bcebos.com/uncased_L-12_H-768_A-12.tar.gz'),\n            ('BERT-en-uncased-large', 'https://bert-models.bj.bcebos.com/uncased_L-24_H-1024_A-16.tar.gz'),\n            ('BERT-en-cased-base','https://bert-models.bj.bcebos.com/cased_L-12_H-768_A-12.tar.gz'),\n            ('BERT-en-cased-large','https://bert-models.bj.bcebos.com/cased_L-24_H-1024_A-16.tar.gz'),\n            ('BERT-multilingual-uncased-base','https://bert-models.bj.bcebos.com/multilingual_L-12_H-768_A-12.tar.gz'),\n            ('BERT-multilingual-cased-base','https://bert-models.bj.bcebos.com/multi_cased_L-12_H-768_A-12.tar.gz'),\n            ('BERT-zh-base','https://bert-models.bj.bcebos.com/chinese_L-12_H-768_A-12.tar.gz'),\n            ('utils', None))\n_vocab = (('utils', None),('utils', None))\n_backbone =(('utils', None),('utils', None))\n_head = (('utils', None),('utils', None))\n_reader = (('utils', None),('utils', None))\n\n_items = (('pretrain', OrderedDict(_pretrain)),\n        ('vocab', OrderedDict(_vocab)), \n        ('backbone', OrderedDict(_backbone)),\n        ('head', OrderedDict(_head)),\n        ('reader', OrderedDict(_reader))\n)\n_items = OrderedDict(_items)\n\ndef _download(item, scope, path, silent=False, convert=False):\n    data_url = _items[item][scope]\n    if data_url == None:\n        return\n    if not silent:\n        print('Downloading {}: {} from {}...'.format(item, scope, data_url))\n    data_dir = path + '/' + item + '/' + scope\n    if not os.path.exists(data_dir):\n        os.makedirs(os.path.join(data_dir))\n    data_name = data_url.split('/')[-1]\n    filename = data_dir + '/' + data_name\n\n    # print process\n    def _reporthook(count, chunk_size, total_size):\n        bytes_so_far = count * chunk_size\n        percent = float(bytes_so_far) / float(total_size)\n        if percent > 1:\n            percent = 1\n        if not silent:\n            print('\\r>> Downloading... {:.1%}'.format(percent), end = \"\")\n    \n    URLLIB.urlretrieve(data_url, filename, reporthook=_reporthook)\n    if not silent:\n        print(' done!')\n    \n    if item == 'pretrain':\n        if not silent:\n            print ('Extracting {}...'.format(data_name), end=\" \")\n        if os.path.exists(filename):\n            tar = tarfile.open(filename, 'r')\n            tar.extractall(path = data_dir)\n            tar.close()\n            os.remove(filename)\n        if len(os.listdir(data_dir))==1:\n            source_path = data_dir + '/' + data_name.split('.')[0]\n            fileList = os.listdir(source_path)\n            for file in fileList:\n                filePath = os.path.join(source_path, file)\n                shutil.move(filePath, data_dir)\n            os.removedirs(source_path)\n        if not silent:\n            print ('done!')\n        if convert:\n            if not silent:\n                print ('Converting params...', end=\" \")\n            _convert(data_dir, silent)\n        if not silent:\n            print ('done!')\n\n\ndef _convert(path, silent=False):\n    if os.path.isfile(path + '/params/__palminfo__'):\n        if not silent:\n            print ('already converted.')\n    else:\n        if os.path.exists(path + '/params/'):\n            os.rename(path + '/params/', path + '/params1/')\n            os.mkdir(path + '/params/')\n            tar_model = tarfile.open(path + '/params/' + '__palmmodel__', 'w')\n            tar_info = open(path + '/params/'+ '__palminfo__', 'w')\n            for root, dirs, files in os.walk(path + '/params1/'):\n                for file in files:\n                    src_file = os.path.join(root, file)\n                    tar_model.add(src_file, '__paddlepalm_' + file)\n                    tar_info.write('__paddlepalm_' + file)\n                    os.remove(src_file)\n            tar_model.close()\n            tar_info.close()\n            os.removedirs(path + '/params1/') \n\ndef download(item, scope='all', path='.'):\n    \"\"\"download an item. The available scopes and contained items can be showed with `paddlepalm.downloader.ls`.\n\n    Args:\n        item: the item to download.\n        scope: the scope of the item to download.\n        path: the target dir to download to. Default is `.`, means current dir.\n    \"\"\"\n    # item = item.lower()\n    # scope = scope.lower()\n    assert item in _items, '{} is not found. Support list: {}'.format(item, list(_items.keys()))\n   \n    if _items[item]['utils'] is not None:\n        _download(item, 'utils', path, silent=True)\n\n    if scope != 'all':\n        assert scope in _items[item], '{} is not found. Support scopes: {}'.format(scope, list(_items[item].keys()))\n        _download(item, scope, path)\n    else:\n        for s in _items[item].keys():\n            _download(item, s, path)\n\n\ndef _ls(item, scope, l = 10):\n    if scope != 'all':\n        assert scope in _items[item], '{} is not found. Support scopes: {}'.format(scope, list(_items[item].keys()))\n        print ('{}'.format(scope))\n    else:\n        for s in _items[item].keys():\n            if s == 'utils':\n                continue\n            print ('  => '+s)\n\ndef ls(item='all', scope='all'):\n    \n    if scope == 'utils':\n        return\n    if item != 'all':\n        assert item in _items, '{} is not found. Support scopes: {}'.format(item, list(_items.keys()))\n        print ('Available {} items:'.format(item))\n        _ls(item, scope)\n    else:\n        l = max(map(len, _items.keys()))\n        for i in _items.keys():\n            print ('Available {} items: '.format(i))\n            _ls(i, scope, l)\n\n\n    \n"
  },
  {
    "path": "paddlepalm/backbone/README.md",
    "content": ""
  },
  {
    "path": "paddlepalm/backbone/__init__.py",
    "content": "\nfrom .ernie import ERNIE\nfrom .bert import BERT\n\n"
  },
  {
    "path": "paddlepalm/backbone/base_backbone.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\n\nclass Backbone(object):\n    \"\"\"interface of backbone model.\"\"\"\n\n    def __init__(self, phase):\n        \"\"\"该函数完成一个主干网络的构造，至少需要包含一个phase参数。\n        注意：实现该构造函数时，必须保证对基类构造函数的调用，以创建必要的框架内建的成员变量。\n        Args:\n            phase: str类型。用于区分主干网络被调用时所处的运行阶段，目前支持训练阶段train和预测阶段predict\n            \"\"\"\n\n        assert isinstance(config, dict)\n\n    @property\n    def inputs_attr(self):\n        \"\"\"描述backbone从reader处需要得到的输入对象的属性，包含各个对象的名字、shape以及数据类型。当某个对象\n        为标量数据类型（如str, int, float等）时，shape设置为空列表[]，当某个对象的某个维度长度可变时，shape\n        中的相应维度设置为-1。\n\n        Return:\n            dict类型。对各个输入对象的属性描述。例如，\n            对于文本分类和匹配任务，bert backbone依赖的reader对象主要包含如下的对象\n                {\"token_ids\": ([-1, max_len], 'int64'),\n                 \"input_ids\": ([-1, max_len], 'int64'),\n                 \"segment_ids\": ([-1, max_len], 'int64'),\n                 \"input_mask\": ([-1, max_len], 'float32')}\"\"\"\n        raise NotImplementedError()\n\n    @property\n    def outputs_attr(self):\n        \"\"\"描述backbone输出对象的属性，包含各个对象的名字、shape以及数据类型。当某个对象为标量数据类型（如\n        str, int, float等）时，shape设置为空列表[]，当某个对象的某个维度长度可变时，shape中的相应维度设置为-1。\n        \n        Return:\n            dict类型。对各个输出对象的属性描述。例如，\n            对于文本分类和匹配任务，bert backbone的输出内容可能包含如下的对象\n                {\"word_emb\": ([-1, max_seqlen, word_emb_size], 'float32'),\n                 \"sentence_emb\": ([-1, hidden_size], 'float32'),\n                 \"sim_vec\": ([-1, hidden_size], 'float32')}\"\"\" \n        raise NotImplementedError()\n\n    def build(self, inputs):\n        \"\"\"建立backbone的计算图。将符合inputs_attr描述的静态图Variable输入映射成符合outputs_attr描述的静态图Variable输出。\n        Args:\n            inputs: dict类型。字典中包含inputs_attr中的对象名到计算图Variable的映射，inputs中至少会包含inputs_attr中定义的对象\n        Return:\n           需要输出的计算图变量，输出对象会被加入到fetch_list中，从而在每个训练/推理step时得到runtime的计算结果，该计算结果会被传入postprocess方法中供用户处理。\n            \"\"\"\n        raise NotImplementedError()\n"
  },
  {
    "path": "paddlepalm/backbone/bert.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"v1.1 \nBERT model.\"\"\"\n\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nfrom paddle import fluid\nfrom paddle.fluid import layers\n\nfrom paddlepalm.backbone.utils.transformer import pre_process_layer, encoder\nfrom paddlepalm.backbone.base_backbone import Backbone\n\n\nclass BERT(Backbone):\n\n\n    def __init__(self, hidden_size, num_hidden_layers, num_attention_heads, vocab_size, \\\n          max_position_embeddings, type_vocab_size, hidden_act, hidden_dropout_prob, \\\n          attention_probs_dropout_prob, initializer_range, is_pairwise=False, phase='train'):\n     \n        self._emb_size = hidden_size\n        self._n_layer = num_hidden_layers\n        self._n_head = num_attention_heads\n        self._voc_size = vocab_size\n        self._max_position_seq_len = max_position_embeddings\n        self._sent_types = type_vocab_size\n\n       \n        self._hidden_act = hidden_act\n        self._prepostprocess_dropout = 0. if phase == 'predict' else hidden_dropout_prob\n        self._attention_dropout = 0. if phase == 'predict' else attention_probs_dropout_prob\n\n        self._word_emb_name = \"word_embedding\"\n        self._pos_emb_name = \"pos_embedding\"\n        self._sent_emb_name = \"sent_embedding\"\n        self._task_emb_name = \"task_embedding\"\n        self._emb_dtype = \"float32\"\n        self._phase = phase\n        self._is_pairwise = is_pairwise\n        self._param_initializer = fluid.initializer.TruncatedNormal(\n            scale=initializer_range)\n\n    @classmethod\n    def from_config(self, config, phase='train'):\n        \n        assert 'hidden_size' in config, \"{} is required to initialize ERNIE\".format('')\n        assert 'num_hidden_layers' in config, \"{} is required to initialize ERNIE\".format('num_hidden_layers')\n        assert 'num_attention_heads' in config, \"{} is required to initialize ERNIE\".format('num_attention_heads')\n        assert 'vocab_size' in config, \"{} is required to initialize ERNIE\".format('vocab_size')\n        assert 'max_position_embeddings' in config, \"{} is required to initialize ERNIE\".format('max_position_embeddings')\n        assert 'sent_type_vocab_size' in config or 'type_vocab_size' in config, \\\n            \"{} is required to initialize ERNIE\".format('type_vocab_size')\n        assert 'hidden_act' in config, \"{} is required to initialize ERNIE\".format('hidden_act')\n        assert 'hidden_dropout_prob' in config, \"{} is required to initialize ERNIE\".format('hidden_dropout_prob')\n        assert 'attention_probs_dropout_prob' in config, \\\n            \"{} is required to initialize ERNIE\".format('attention_probs_dropout_prob')\n        assert 'initializer_range' in config, \"{} is required to initialize ERNIE\".format('initializer_range')\n\n        hidden_size = config['hidden_size']\n        num_hidden_layers = config['num_hidden_layers']\n        num_attention_heads = config['num_attention_heads']\n        vocab_size = config['vocab_size']\n        max_position_embeddings = config['max_position_embeddings']\n        if 'sent_type_vocab_size' in config:\n            sent_type_vocab_size = config['sent_type_vocab_size']\n        else:\n            sent_type_vocab_size = config['type_vocab_size']\n\n        hidden_act = config['hidden_act']\n        hidden_dropout_prob = config['hidden_dropout_prob']\n        attention_probs_dropout_prob = config['attention_probs_dropout_prob']\n        initializer_range = config['initializer_range']\n        if 'is_pairwise' in config:\n            is_pairwise = config['is_pairwise']\n        else:\n            is_pairwise = False\n\n        return self(hidden_size, num_hidden_layers, num_attention_heads, vocab_size, \\\n          max_position_embeddings, sent_type_vocab_size, \\\n          hidden_act, hidden_dropout_prob, attention_probs_dropout_prob, initializer_range, is_pairwise, phase)\n\n    @property\n    def inputs_attr(self):\n        ret = {\"token_ids\": [[-1, -1], 'int64'],\n               \"position_ids\": [[-1, -1], 'int64'],\n               \"segment_ids\": [[-1, -1], 'int64'],\n               \"input_mask\": [[-1, -1, 1], 'float32'],\n               }\n        if self._is_pairwise and self._phase=='train':\n            ret.update({\"token_ids_neg\": [[-1, -1], 'int64'],\n                        \"position_ids_neg\": [[-1, -1], 'int64'],\n                        \"segment_ids_neg\": [[-1, -1], 'int64'],\n                        \"input_mask_neg\": [[-1, -1, 1], 'float32'],\n                        })\n        return ret\n\n    @property\n    def outputs_attr(self):\n        ret = {\"word_embedding\": [[-1, -1, self._emb_size], 'float32'],\n               \"embedding_table\": [[-1, self._voc_size, self._emb_size], 'float32'],\n               \"encoder_outputs\": [[-1, -1, self._emb_size], 'float32'],\n               \"sentence_embedding\": [[-1, self._emb_size], 'float32'],\n               \"sentence_pair_embedding\": [[-1, self._emb_size], 'float32']}\n        if self._is_pairwise and self._phase == 'train':\n            ret.update({\"word_embedding_neg\": [[-1, -1, self._emb_size], 'float32'],\n                        \"encoder_outputs_neg\": [[-1, -1, self._emb_size], 'float32'],\n                        \"sentence_embedding_neg\": [[-1, self._emb_size], 'float32'],\n                        \"sentence_pair_embedding_neg\": [[-1, self._emb_size], 'float32']})\n        return ret \n\n    def build(self, inputs, scope_name=\"\"):\n        src_ids = inputs['token_ids']\n        pos_ids = inputs['position_ids']\n        sent_ids = inputs['segment_ids']\n        input_mask = inputs['input_mask']\n\n        self._emb_dtype = 'float32'\n\n        input_buffer = {}\n        output_buffer = {}\n        input_buffer['base'] = [src_ids, pos_ids, sent_ids, input_mask]\n        output_buffer['base'] = {}\n\n        if self._is_pairwise and self._phase =='train':\n            src_ids = inputs['token_ids_neg']\n            pos_ids = inputs['position_ids_neg']\n            sent_ids = inputs['segment_ids_neg']\n            input_mask = inputs['input_mask_neg']\n            input_buffer['neg'] = [src_ids, pos_ids, sent_ids, input_mask]\n            output_buffer['neg'] = {}\n        \n        for key, (src_ids, pos_ids, sent_ids, input_mask) in input_buffer.items():\n            # padding id in vocabulary must be set to 0\n            emb_out = fluid.embedding(\n                input=src_ids,\n                size=[self._voc_size, self._emb_size],\n                dtype=self._emb_dtype,\n                param_attr=fluid.ParamAttr(\n                    name=scope_name+self._word_emb_name, initializer=self._param_initializer),\n                is_sparse=False)\n\n            # fluid.global_scope().find_var('backbone-word_embedding').get_tensor()\n            embedding_table = fluid.default_main_program().global_block().var(scope_name+self._word_emb_name)\n            \n            position_emb_out = fluid.embedding(\n                input=pos_ids,\n                size=[self._max_position_seq_len, self._emb_size],\n                dtype=self._emb_dtype,\n                param_attr=fluid.ParamAttr(\n                    name=scope_name+self._pos_emb_name, initializer=self._param_initializer))\n\n            sent_emb_out = fluid.embedding(\n                sent_ids,\n                size=[self._sent_types, self._emb_size],\n                dtype=self._emb_dtype,\n                param_attr=fluid.ParamAttr(\n                    name=scope_name+self._sent_emb_name, initializer=self._param_initializer))\n\n            emb_out = emb_out + position_emb_out\n            emb_out = emb_out + sent_emb_out\n\n            emb_out = pre_process_layer(\n                emb_out, 'nd', self._prepostprocess_dropout, name=scope_name+'pre_encoder')\n\n            self_attn_mask = fluid.layers.matmul(\n                x=input_mask, y=input_mask, transpose_y=True)\n\n            self_attn_mask = fluid.layers.scale(\n                x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)\n            n_head_self_attn_mask = fluid.layers.stack(\n                x=[self_attn_mask] * self._n_head, axis=1)\n            n_head_self_attn_mask.stop_gradient = True\n\n            enc_out = encoder(\n                enc_input=emb_out,\n                attn_bias=n_head_self_attn_mask,\n                n_layer=self._n_layer,\n                n_head=self._n_head,\n                d_key=self._emb_size // self._n_head,\n                d_value=self._emb_size // self._n_head,\n                d_model=self._emb_size,\n                d_inner_hid=self._emb_size * 4,\n                prepostprocess_dropout=self._prepostprocess_dropout,\n                attention_dropout=self._attention_dropout,\n                relu_dropout=0,\n                hidden_act=self._hidden_act,\n                preprocess_cmd=\"\",\n                postprocess_cmd=\"dan\",\n                param_initializer=self._param_initializer,\n                name=scope_name+'encoder')\n\n            \n            next_sent_feat = fluid.layers.slice(\n                input=enc_out, axes=[1], starts=[0], ends=[1])\n            next_sent_feat = fluid.layers.reshape(next_sent_feat, [-1, next_sent_feat.shape[-1]])\n            next_sent_feat = fluid.layers.fc(\n                input=next_sent_feat,\n                size=self._emb_size,\n                act=\"tanh\",\n                param_attr=fluid.ParamAttr(\n                    name=scope_name+\"pooled_fc.w_0\", initializer=self._param_initializer),\n                bias_attr=scope_name+\"pooled_fc.b_0\")\n            output_buffer[key]['word_embedding'] = emb_out\n            output_buffer[key]['encoder_outputs'] = enc_out\n            output_buffer[key]['sentence_embedding'] = next_sent_feat\n            output_buffer[key]['sentence_pair_embedding'] = next_sent_feat\n        \n        ret = {}\n        ret['embedding_table'] = embedding_table\n        ret['word_embedding'] = output_buffer['base']['word_embedding']\n        ret['encoder_outputs'] = output_buffer['base']['encoder_outputs']\n        ret['sentence_embedding'] = output_buffer['base']['sentence_embedding']\n        ret['sentence_pair_embedding'] = output_buffer['base']['sentence_pair_embedding']\n\n        if self._is_pairwise and self._phase == 'train':\n            ret['word_embedding_neg'] = output_buffer['neg']['word_embedding']\n            ret['encoder_outputs_neg'] = output_buffer['neg']['encoder_outputs']\n            ret['sentence_embedding_neg'] = output_buffer['neg']['sentence_embedding']\n            ret['sentence_pair_embedding_neg'] = output_buffer['neg']['sentence_pair_embedding']\n        \n        return ret\n                    \n    def postprocess(self, rt_outputs):\n        pass\n\n\nclass Model(BERT):\n    \"\"\"BERT wrapper for ConfigController\"\"\"\n    def __init__(self, config, phase):\n        BERT.from_config(config, phase=phase)\n\n\n"
  },
  {
    "path": "paddlepalm/backbone/ernie.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Ernie model.\"\"\"\n\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\nfrom __future__ import unicode_literals\nfrom __future__ import absolute_import\n\nfrom paddle import fluid\nfrom paddle.fluid import layers\n\nfrom paddlepalm.backbone.utils.transformer import pre_process_layer, encoder\nfrom paddlepalm.backbone.base_backbone import Backbone\n\n\nclass ERNIE(Backbone):\n    \n    def __init__(self, hidden_size, num_hidden_layers, num_attention_heads, vocab_size, \\\n          max_position_embeddings, sent_type_vocab_size, task_type_vocab_size, \\\n          hidden_act, hidden_dropout_prob, attention_probs_dropout_prob, initializer_range, is_pairwise=False, use_task_emb=True, phase='train'):\n\n        # self._is_training = phase == 'train' # backbone一般不用关心运行阶段，因为outputs在任何阶段基本不会变\n \n        self._emb_size = hidden_size\n        self._n_layer = num_hidden_layers\n        self._n_head = num_attention_heads\n        self._voc_size = vocab_size\n        self._max_position_seq_len = max_position_embeddings\n        self._sent_types = sent_type_vocab_size\n\n        self._task_types = task_type_vocab_size\n\n        self._hidden_act = hidden_act\n        self._prepostprocess_dropout = 0. if phase == 'predict' else hidden_dropout_prob\n        self._attention_dropout = 0. if phase == 'predict' else attention_probs_dropout_prob\n\n        self._word_emb_name = \"word_embedding\"\n        self._pos_emb_name = \"pos_embedding\"\n        self._sent_emb_name = \"sent_embedding\"\n        self._task_emb_name = \"task_embedding\"\n        self._emb_dtype = \"float32\"\n        self._is_pairwise = is_pairwise\n        self._use_task_emb = use_task_emb\n        self._phase=phase\n        self._param_initializer = fluid.initializer.TruncatedNormal(\n            scale=initializer_range)\n\n    @classmethod\n    def from_config(cls, config, phase='train'):\n        assert 'hidden_size' in config, \"{} is required to initialize ERNIE\".format('hidden_size')\n        assert 'num_hidden_layers' in config, \"{} is required to initialize ERNIE\".format('num_hidden_layers')\n        assert 'num_attention_heads' in config, \"{} is required to initialize ERNIE\".format('num_attention_heads')\n        assert 'vocab_size' in config, \"{} is required to initialize ERNIE\".format('vocab_size')\n        assert 'max_position_embeddings' in config, \"{} is required to initialize ERNIE\".format('max_position_embeddings')\n        assert 'sent_type_vocab_size' in config or 'type_vocab_size' in config, \"{} is required to initialize ERNIE\".format('sent_type_vocab_size')\n        # assert 'task_type_vocab_size' in config, \"{} is required to initialize ERNIE\".format('task_type_vocab_size')\n        assert 'hidden_act' in config, \"{} is required to initialize ERNIE\".format('hidden_act')\n        assert 'hidden_dropout_prob' in config, \"{} is required to initialize ERNIE\".format('hidden_dropout_prob')\n        assert 'attention_probs_dropout_prob' in config, \"{} is required to initialize ERNIE\".format('attention_probs_dropout_prob')\n        assert 'initializer_range' in config, \"{} is required to initialize ERNIE\".format('initializer_range')\n\n        hidden_size = config['hidden_size']\n        num_hidden_layers = config['num_hidden_layers']\n        num_attention_heads = config['num_attention_heads']\n        vocab_size = config['vocab_size']\n        max_position_embeddings = config['max_position_embeddings']\n        if 'sent_type_vocab_size' in config:\n            sent_type_vocab_size = config['sent_type_vocab_size']\n        else:\n            sent_type_vocab_size = config['type_vocab_size']\n        if 'task_type_vocab_size' in config:\n            task_type_vocab_size = config['task_type_vocab_size']\n        else:\n            task_type_vocab_size = config['type_vocab_size']\n        if 'use_task_emb' in config:\n            use_task_emb = config['use_task_emb']\n        else:\n            use_task_emb = True\n        hidden_act = config['hidden_act']\n        hidden_dropout_prob = config['hidden_dropout_prob']\n        attention_probs_dropout_prob = config['attention_probs_dropout_prob']\n        initializer_range = config['initializer_range']\n        if 'is_pairwise' in config:\n            is_pairwise = config['is_pairwise']\n        else:\n            is_pairwise = False\n        \n        return cls(hidden_size, num_hidden_layers, num_attention_heads, vocab_size, \\\n          max_position_embeddings, sent_type_vocab_size, task_type_vocab_size, \\\n          hidden_act, hidden_dropout_prob, attention_probs_dropout_prob, initializer_range, is_pairwise, use_task_emb=use_task_emb, phase=phase)\n\n    @property\n    def inputs_attr(self):\n        ret = {\"token_ids\": [[-1, -1], 'int64'],\n               \"position_ids\": [[-1, -1], 'int64'],\n               \"segment_ids\": [[-1, -1], 'int64'],\n               \"input_mask\": [[-1, -1, 1], 'float32'],\n               \"task_ids\": [[-1,-1], 'int64']}\n        if self._is_pairwise and self._phase=='train':\n            ret.update({\"token_ids_neg\": [[-1, -1], 'int64'],\n                        \"position_ids_neg\": [[-1, -1], 'int64'],\n                        \"segment_ids_neg\": [[-1, -1], 'int64'],\n                        \"input_mask_neg\": [[-1, -1, 1], 'float32'],\n                        \"task_ids_neg\": [[-1,-1], 'int64']\n                        })\n        return ret\n                \n\n    @property\n    def outputs_attr(self):\n        ret = {\"word_embedding\": [[-1, -1, self._emb_size], 'float32'],\n               \"embedding_table\": [[-1, self._voc_size, self._emb_size], 'float32'],\n               \"encoder_outputs\": [[-1, -1, self._emb_size], 'float32'],\n               \"sentence_embedding\": [[-1, self._emb_size], 'float32'],\n               \"sentence_pair_embedding\": [[-1, self._emb_size], 'float32']}\n        if self._is_pairwise and self._phase == 'train':\n            ret.update({\"word_embedding_neg\": [[-1, -1, self._emb_size], 'float32'],\n                        \"encoder_outputs_neg\": [[-1, -1, self._emb_size], 'float32'],\n                        \"sentence_embedding_neg\": [[-1, self._emb_size], 'float32'],\n                        \"sentence_pair_embedding_neg\": [[-1, self._emb_size], 'float32']})\n        return ret \n\n    def build(self, inputs, scope_name=\"\"):\n        src_ids = inputs['token_ids']\n        pos_ids = inputs['position_ids']\n        sent_ids = inputs['segment_ids']\n        input_mask = inputs['input_mask']\n        task_ids = inputs['task_ids']\n\n        input_buffer = {}\n        output_buffer = {}\n        input_buffer['base'] = [src_ids, pos_ids, sent_ids, input_mask, task_ids]\n        output_buffer['base'] = {}\n\n        if self._is_pairwise and self._phase =='train':\n            src_ids = inputs['token_ids_neg']\n            pos_ids = inputs['position_ids_neg']\n            sent_ids = inputs['segment_ids_neg']\n            input_mask = inputs['input_mask_neg']\n            task_ids = inputs['task_ids_neg']\n            input_buffer['neg'] = [src_ids, pos_ids, sent_ids, input_mask, task_ids]\n            output_buffer['neg'] = {}\n\n        for key, (src_ids, pos_ids, sent_ids, input_mask, task_ids) in input_buffer.items():\n            # padding id in vocabulary must be set to 0\n            emb_out = fluid.embedding(\n                input=src_ids,\n                size=[self._voc_size, self._emb_size],\n                dtype=self._emb_dtype,\n                param_attr=fluid.ParamAttr(\n                    name=scope_name+self._word_emb_name, initializer=self._param_initializer),\n                is_sparse=False)\n        \n            # fluid.global_scope().find_var('backbone-word_embedding').get_tensor()\n            embedding_table = fluid.default_main_program().global_block().var(scope_name+self._word_emb_name)\n            \n            position_emb_out = fluid.embedding(\n                input=pos_ids,\n                size=[self._max_position_seq_len, self._emb_size],\n                dtype=self._emb_dtype,\n                param_attr=fluid.ParamAttr(\n                    name=scope_name+self._pos_emb_name, initializer=self._param_initializer))\n\n            sent_emb_out = fluid.embedding(\n                sent_ids,\n                size=[self._sent_types, self._emb_size],\n                dtype=self._emb_dtype,\n                param_attr=fluid.ParamAttr(\n                    name=scope_name+self._sent_emb_name, initializer=self._param_initializer))\n\n            emb_out = emb_out + position_emb_out\n            emb_out = emb_out + sent_emb_out\n\n            if self._use_task_emb:\n                task_emb_out = fluid.embedding(\n                    task_ids,\n                    size=[self._task_types, self._emb_size],\n                    dtype=self._emb_dtype,\n                    param_attr=fluid.ParamAttr(\n                        name=scope_name+self._task_emb_name,\n                        initializer=self._param_initializer))\n\n                emb_out = emb_out + task_emb_out\n\n            emb_out = pre_process_layer(\n                emb_out, 'nd', self._prepostprocess_dropout, name=scope_name+'pre_encoder')\n\n            self_attn_mask = fluid.layers.matmul(\n                x=input_mask, y=input_mask, transpose_y=True)\n\n            self_attn_mask = fluid.layers.scale(\n                x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)\n            n_head_self_attn_mask = fluid.layers.stack(\n                x=[self_attn_mask] * self._n_head, axis=1)\n            n_head_self_attn_mask.stop_gradient = True\n\n            enc_out = encoder(\n                enc_input=emb_out,\n                attn_bias=n_head_self_attn_mask,\n                n_layer=self._n_layer,\n                n_head=self._n_head,\n                d_key=self._emb_size // self._n_head,\n                d_value=self._emb_size // self._n_head,\n                d_model=self._emb_size,\n                d_inner_hid=self._emb_size * 4,\n                prepostprocess_dropout=self._prepostprocess_dropout,\n                attention_dropout=self._attention_dropout,\n                relu_dropout=0,\n                hidden_act=self._hidden_act,\n                preprocess_cmd=\"\",\n                postprocess_cmd=\"dan\",\n                param_initializer=self._param_initializer,\n                name=scope_name+'encoder')\n\n            next_sent_feat = fluid.layers.slice(\n                input=enc_out, axes=[1], starts=[0], ends=[1])\n            next_sent_feat = fluid.layers.reshape(next_sent_feat, [-1, next_sent_feat.shape[-1]])\n            next_sent_feat = fluid.layers.fc(\n                input=next_sent_feat,\n                size=self._emb_size,\n                act=\"tanh\",\n                param_attr=fluid.ParamAttr(\n                    name=scope_name+\"pooled_fc.w_0\", initializer=self._param_initializer),\n                bias_attr=scope_name+\"pooled_fc.b_0\")\n            \n            output_buffer[key]['word_embedding'] = emb_out\n            output_buffer[key]['encoder_outputs'] = enc_out\n            output_buffer[key]['sentence_embedding'] = next_sent_feat\n            output_buffer[key]['sentence_pair_embedding'] = next_sent_feat\n        \n        ret = {}\n        ret['embedding_table'] = embedding_table\n        ret['word_embedding'] = output_buffer['base']['word_embedding']\n        ret['encoder_outputs'] = output_buffer['base']['encoder_outputs']\n        ret['sentence_embedding'] = output_buffer['base']['sentence_embedding']\n        ret['sentence_pair_embedding'] = output_buffer['base']['sentence_pair_embedding']\n\n        if self._is_pairwise and self._phase == 'train':\n            ret['word_embedding_neg'] = output_buffer['neg']['word_embedding']\n            ret['encoder_outputs_neg'] = output_buffer['neg']['encoder_outputs']\n            ret['sentence_embedding_neg'] = output_buffer['neg']['sentence_embedding']\n            ret['sentence_pair_embedding_neg'] = output_buffer['neg']['sentence_pair_embedding']\n        \n        return ret\n\n    def postprocess(self, rt_outputs):\n        pass\n\n\n\nclass Model(ERNIE):\n\n    def __init__(self, config, phase):\n        ERNIE.from_config(config, phase=phase)\n\n\n"
  },
  {
    "path": "paddlepalm/backbone/utils/__init__.py",
    "content": ""
  },
  {
    "path": "paddlepalm/backbone/utils/transformer.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Transformer encoder.\"\"\"\n\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nfrom functools import partial\n\nimport paddle.fluid as fluid\nimport paddle.fluid.layers as layers\n\nfrom paddle.fluid.layer_helper import LayerHelper as LayerHelper\nfrom functools import reduce # py3\ndef layer_norm(x, begin_norm_axis=1, epsilon=1e-6, param_attr=None, bias_attr=None):\n    helper = LayerHelper('layer_norm', **locals())\n    mean = layers.reduce_mean(x, dim=begin_norm_axis, keep_dim=True)\n    shift_x = layers.elementwise_sub(x=x, y=mean, axis=0)\n    variance = layers.reduce_mean(layers.square(shift_x), dim=begin_norm_axis, keep_dim=True)\n    r_stdev = layers.rsqrt(variance + epsilon)\n    norm_x = layers.elementwise_mul(x=shift_x, y=r_stdev, axis=0)\n\n    param_shape = [reduce(lambda x, y: x * y, norm_x.shape[begin_norm_axis:])]\n    param_dtype = norm_x.dtype\n    scale = helper.create_parameter(\n        attr=param_attr,\n        shape=param_shape,\n        dtype=param_dtype,\n        default_initializer=fluid.initializer.Constant(1.))\n    bias = helper.create_parameter(\n        attr=bias_attr,\n        shape=param_shape,\n        dtype=param_dtype,\n        is_bias=True,\n        default_initializer=fluid.initializer.Constant(0.))\n\n    out = layers.elementwise_mul(x=norm_x, y=scale, axis=-1)\n    out = layers.elementwise_add(x=out, y=bias, axis=-1)\n\n    return out\n\n\ndef multi_head_attention(queries,\n                         keys,\n                         values,\n                         attn_bias,\n                         d_key,\n                         d_value,\n                         d_model,\n                         n_head=1,\n                         dropout_rate=0.,\n                         cache=None,\n                         param_initializer=None,\n                         name='multi_head_att'):\n    \"\"\"\n    Multi-Head Attention. Note that attn_bias is added to the logit before\n    computing softmax activiation to mask certain selected positions so that\n    they will not considered in attention weights.\n    \"\"\"\n    keys = queries if keys is None else keys\n    values = keys if values is None else values\n\n    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):\n        raise ValueError(\n            \"Inputs: quries, keys and values should all be 3-D tensors.\")\n\n    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):\n        \"\"\"\n        Add linear projection to queries, keys, and values.\n        \"\"\"\n        q = layers.fc(input=queries,\n                      size=d_key * n_head,\n                      num_flatten_dims=2,\n                      param_attr=fluid.ParamAttr(\n                          name=name + '_query_fc.w_0',\n                          initializer=param_initializer),\n                      bias_attr=name + '_query_fc.b_0')\n        k = layers.fc(input=keys,\n                      size=d_key * n_head,\n                      num_flatten_dims=2,\n                      param_attr=fluid.ParamAttr(\n                          name=name + '_key_fc.w_0',\n                          initializer=param_initializer),\n                      bias_attr=name + '_key_fc.b_0')\n        v = layers.fc(input=values,\n                      size=d_value * n_head,\n                      num_flatten_dims=2,\n                      param_attr=fluid.ParamAttr(\n                          name=name + '_value_fc.w_0',\n                          initializer=param_initializer),\n                      bias_attr=name + '_value_fc.b_0')\n        return q, k, v\n\n    def __split_heads(x, n_head):\n        \"\"\"\n        Reshape the last dimension of inpunt tensor x so that it becomes two\n        dimensions and then transpose. Specifically, input a tensor with shape\n        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor\n        with shape [bs, n_head, max_sequence_length, hidden_dim].\n        \"\"\"\n        hidden_size = x.shape[-1]\n        # The value 0 in shape attr means copying the corresponding dimension\n        # size of the input as the output dimension size.\n        reshaped = layers.reshape(\n            x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)\n\n        # permuate the dimensions into:\n        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]\n        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])\n\n    def __combine_heads(x):\n        \"\"\"\n        Transpose and then reshape the last two dimensions of inpunt tensor x\n        so that it becomes one dimension, which is reverse to __split_heads.\n        \"\"\"\n        if len(x.shape) == 3: return x\n        if len(x.shape) != 4:\n            raise ValueError(\"Input(x) should be a 4-D Tensor.\")\n\n        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])\n        # The value 0 in shape attr means copying the corresponding dimension\n        # size of the input as the output dimension size.\n        return layers.reshape(\n            x=trans_x,\n            shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],\n            inplace=True)\n\n    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):\n        \"\"\"\n        Scaled Dot-Product Attention\n        \"\"\"\n        scaled_q = layers.scale(x=q, scale=d_key**-0.5)\n        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)\n        if attn_bias:\n            product += attn_bias\n        weights = layers.softmax(product)\n        if dropout_rate:\n            weights = layers.dropout(\n                weights,\n                dropout_prob=dropout_rate,\n                dropout_implementation=\"upscale_in_train\",\n                is_test=False)\n        out = layers.matmul(weights, v)\n        return out\n\n    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)\n\n    if cache is not None:  # use cache and concat time steps\n        # Since the inplace reshape in __split_heads changes the shape of k and\n        # v, which is the cache input for next time step, reshape the cache\n        # input from the previous time step first.\n        k = cache[\"k\"] = layers.concat(\n            [layers.reshape(\n                cache[\"k\"], shape=[0, 0, d_model]), k], axis=1)\n        v = cache[\"v\"] = layers.concat(\n            [layers.reshape(\n                cache[\"v\"], shape=[0, 0, d_model]), v], axis=1)\n\n    q = __split_heads(q, n_head)\n    k = __split_heads(k, n_head)\n    v = __split_heads(v, n_head)\n\n    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,\n                                                  dropout_rate)\n\n    out = __combine_heads(ctx_multiheads)\n\n    # Project back to the model size.\n    proj_out = layers.fc(input=out,\n                         size=d_model,\n                         num_flatten_dims=2,\n                         param_attr=fluid.ParamAttr(\n                             name=name + '_output_fc.w_0',\n                             initializer=param_initializer),\n                         bias_attr=name + '_output_fc.b_0')\n    return proj_out\n\n\ndef positionwise_feed_forward(x,\n                              d_inner_hid,\n                              d_hid,\n                              dropout_rate,\n                              hidden_act,\n                              param_initializer=None,\n                              name='ffn'):\n    \"\"\"\n    Position-wise Feed-Forward Networks.\n    This module consists of two linear transformations with a ReLU activation\n    in between, which is applied to each position separately and identically.\n    \"\"\"\n    hidden = layers.fc(input=x,\n                       size=d_inner_hid,\n                       num_flatten_dims=2,\n                       act=hidden_act,\n                       param_attr=fluid.ParamAttr(\n                           name=name + '_fc_0.w_0',\n                           initializer=param_initializer),\n                       bias_attr=name + '_fc_0.b_0')\n    if dropout_rate:\n        hidden = layers.dropout(\n            hidden,\n            dropout_prob=dropout_rate,\n            dropout_implementation=\"upscale_in_train\",\n            is_test=False)\n    out = layers.fc(input=hidden,\n                    size=d_hid,\n                    num_flatten_dims=2,\n                    param_attr=fluid.ParamAttr(\n                        name=name + '_fc_1.w_0', initializer=param_initializer),\n                    bias_attr=name + '_fc_1.b_0')\n    return out\n\n\ndef pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,\n                           name=''):\n    \"\"\"\n    Add residual connection, layer normalization and droput to the out tensor\n    optionally according to the value of process_cmd.\n    This will be used before or after multi-head attention and position-wise\n    feed-forward networks.\n    \"\"\"\n    for cmd in process_cmd:\n        if cmd == \"a\":  # add residual connection\n            out = out + prev_out if prev_out else out\n        elif cmd == \"n\":  # add layer normalization\n            out_dtype = out.dtype\n            if out_dtype == fluid.core.VarDesc.VarType.FP16:\n                out = layers.cast(x=out, dtype=\"float32\")\n            out = layer_norm(\n                out,\n                begin_norm_axis=len(out.shape) - 1,\n                param_attr=fluid.ParamAttr(\n                    name=name + '_layer_norm_scale',\n                    initializer=fluid.initializer.Constant(1.)),\n                bias_attr=fluid.ParamAttr(\n                    name=name + '_layer_norm_bias',\n                    initializer=fluid.initializer.Constant(0.)))\n            if out_dtype == fluid.core.VarDesc.VarType.FP16:\n                out = layers.cast(x=out, dtype=\"float16\")\n        elif cmd == \"d\":  # add dropout\n            if dropout_rate:\n                out = layers.dropout(\n                    out,\n                    dropout_prob=dropout_rate,\n                    dropout_implementation=\"upscale_in_train\",\n                    is_test=False)\n    return out\n\n\npre_process_layer = partial(pre_post_process_layer, None)\npost_process_layer = pre_post_process_layer\n\n\ndef encoder_layer(enc_input,\n                  attn_bias,\n                  n_head,\n                  d_key,\n                  d_value,\n                  d_model,\n                  d_inner_hid,\n                  prepostprocess_dropout,\n                  attention_dropout,\n                  relu_dropout,\n                  hidden_act,\n                  preprocess_cmd=\"n\",\n                  postprocess_cmd=\"da\",\n                  param_initializer=None,\n                  name=''):\n    \"\"\"The encoder layers that can be stacked to form a deep encoder.\n    This module consits of a multi-head (self) attention followed by\n    position-wise feed-forward networks and both the two components companied\n    with the post_process_layer to add residual connection, layer normalization\n    and droput.\n    \"\"\"\n    attn_output = multi_head_attention(\n        pre_process_layer(\n            enc_input,\n            preprocess_cmd,\n            prepostprocess_dropout,\n            name=name + '_pre_att'),\n        None,\n        None,\n        attn_bias,\n        d_key,\n        d_value,\n        d_model,\n        n_head,\n        attention_dropout,\n        param_initializer=param_initializer,\n        name=name + '_multi_head_att')\n    attn_output = post_process_layer(\n        enc_input,\n        attn_output,\n        postprocess_cmd,\n        prepostprocess_dropout,\n        name=name + '_post_att')\n    ffd_output = positionwise_feed_forward(\n        pre_process_layer(\n            attn_output,\n            preprocess_cmd,\n            prepostprocess_dropout,\n            name=name + '_pre_ffn'),\n        d_inner_hid,\n        d_model,\n        relu_dropout,\n        hidden_act,\n        param_initializer=param_initializer,\n        name=name + '_ffn')\n    return post_process_layer(\n        attn_output,\n        ffd_output,\n        postprocess_cmd,\n        prepostprocess_dropout,\n        name=name + '_post_ffn')\n\n\ndef encoder(enc_input,\n            attn_bias,\n            n_layer,\n            n_head,\n            d_key,\n            d_value,\n            d_model,\n            d_inner_hid,\n            prepostprocess_dropout,\n            attention_dropout,\n            relu_dropout,\n            hidden_act,\n            preprocess_cmd=\"n\",\n            postprocess_cmd=\"da\",\n            param_initializer=None,\n            name=''):\n    \"\"\"\n    The encoder is composed of a stack of identical layers returned by calling\n    encoder_layer.\n    \"\"\"\n    for i in range(n_layer):\n        enc_output = encoder_layer(\n            enc_input,\n            attn_bias,\n            n_head,\n            d_key,\n            d_value,\n            d_model,\n            d_inner_hid,\n            prepostprocess_dropout,\n            attention_dropout,\n            relu_dropout,\n            hidden_act,\n            preprocess_cmd,\n            postprocess_cmd,\n            param_initializer=param_initializer,\n            name=name + '_layer_' + str(i))\n        enc_input = enc_output\n    enc_output = pre_process_layer(\n        enc_output, preprocess_cmd, prepostprocess_dropout, name=\"post_encoder\")\n\n    return enc_output\n"
  },
  {
    "path": "paddlepalm/distribute/__init__.py",
    "content": "from paddle import fluid\nimport os\nimport multiprocessing\n\ngpu_dev_count = int(fluid.core.get_cuda_device_count())\ncpu_dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))\n\nfrom .reader import yield_pieces, data_feeder, decode_fake\n\n"
  },
  {
    "path": "paddlepalm/distribute/reader.py",
    "content": "\nfrom . import gpu_dev_count, cpu_dev_count\ntry:\n    import queue as Queue\nexcept ImportError:\n    import Queue\nfrom threading import Thread\n\ndev_count = gpu_dev_count if gpu_dev_count > 0 else cpu_dev_count\n\ndef yield_pieces(data, distribute_strategy, batch_size):\n    \"\"\"\n    Args:\n        distribute_strategy: support s=split, c=copy, u=unstack,\n        \"\"\"\n    assert batch_size % dev_count == 0, \"batch_size need to be integer times larger than dev_count.\"\n    # print('data in yield pieces')\n    # print(len(data))\n\n    assert type(data) == type(distribute_strategy), [type(data), type(distribute_strategy)]\n    assert len(data) == len(distribute_strategy), [len(data), len(distribute_strategy)]\n    if isinstance(data, dict):\n        keys = list(data.keys())\n        data_list = [data[i] for i in keys]\n        ds_list = [distribute_strategy[i] for i in keys]\n    else:\n        assert isinstance(data, list), \"the input data must be a list or dict, and contained with multiple tensors.\"\n        data_list = data\n        ds_list = distribute_strategy\n    stride = batch_size // dev_count\n    p = stride\n    # while p < len(data_list) + stride:\n    while p <= batch_size:\n        temp = []\n        for d, s in zip(data_list, ds_list):\n            s = s.strip().lower()\n            if s == 's' or s == 'split':\n                if p - stride >= len(d):\n                    # print('WARNING: no more examples to feed empty devices')\n                    temp = []\n                    return\n                temp.append(d[p-stride:p])\n            elif s == 'u' or s == 'unstack':\n                assert len(d) <= dev_count, 'Tensor size on dim 0 must be less equal to dev_count when unstack is applied.'\n                if p//stride > len(d):\n                    # print('WARNING: no more examples to feed empty devices')\n                    return\n                temp.append(d[p//stride-1])\n            elif s == 'c' or s == 'copy':\n                temp.append(d)\n            else:\n                raise NotImplementedError()\n            \n        p += stride\n        if type(data) == dict:\n            yield dict(zip(*[keys, temp]))\n        else:\n            # print('yielded pieces')\n            # print(len(temp))\n            yield temp\n\n\ndef data_feeder(reader, postprocess_fn=None, prefetch_steps=2, phase='train', is_multi=False):\n    if postprocess_fn is None:\n        def postprocess_fn(batch, id=-1, phase='train', is_multi=False):\n            return batch\n\n    def worker(reader, dev_count, queue):\n        dev_batches = []\n        for index, data in enumerate(reader()):\n            if len(dev_batches) < dev_count:\n                dev_batches.append(data)\n            if len(dev_batches) == dev_count:\n                queue.put((dev_batches, 0))\n                dev_batches = []\n        # For the prediction of the remained batches, pad more batches to \n        # the number of devices and the padded samples would be removed in\n        # prediction outputs. \n        if len(dev_batches) > 0:\n            num_pad = dev_count - len(dev_batches)\n            for i in range(len(dev_batches), dev_count):\n                dev_batches.append(dev_batches[-1])\n            queue.put((dev_batches, num_pad))\n        queue.put(None)\n\n    queue = Queue.Queue(dev_count*prefetch_steps)\n    p = Thread(\n        target=worker, args=(reader, dev_count, queue))\n    p.daemon = True\n    p.start()\n    while True:\n        ret = queue.get()\n        queue.task_done()\n        if ret is not None:\n            batches, num_pad = ret\n            if dev_count > 1 and phase == 'train' and is_multi: \n                id = batches[0]['__task_id'][0]\n            else:\n                id = -1\n            batch_buf = []\n            flag_buf = []\n            for idx, batch in enumerate(batches):\n                # flag = num_pad == 0\n                flag = idx-len(batches) < -num_pad\n                # if num_pad > 0:\n                #     num_pad -= 1\n                batch = postprocess_fn(batch, id, phase, is_multi=is_multi)\n                # batch = postprocess_fn(batch)\n                batch_buf.append(batch)\n                flag_buf.append(flag)\n            yield batch_buf, flag_buf\n        else:\n            break\n    queue.join()\n\n\n\ndef decode_fake(nums, mask, bs):\n    bs //= dev_count\n    n_t = 0\n    for flag in mask:\n        if not flag:\n            break\n        n_t = n_t + 1\n\n    n_f = len(mask) - n_t\n    p1 = nums - (n_t-1) * bs\n    assert p1 % (n_f+1) == 0\n    each_f = p1 // (n_f+1)\n    return each_f * n_f\n\n"
  },
  {
    "path": "paddlepalm/downloader.py",
    "content": "from ._downloader import *\n"
  },
  {
    "path": "paddlepalm/head/__init__.py",
    "content": "\nfrom .cls import Classify\nfrom .match import Match\nfrom .ner import SequenceLabel\nfrom .mrc import MRC\nfrom .mlm import MaskLM\n"
  },
  {
    "path": "paddlepalm/head/base_head.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport os\nimport json\nimport copy\n\nclass Head(object):\n\n    def __init__(self, phase='train'):\n        \"\"\"该函数完成一个任务头的构造，至少需要包含一个phase参数。\n        注意：实现该构造函数时，必须保证对基类构造函数的调用，以创建必要的框架内建的成员变量。\n        Args:\n            phase: str类型。用于区分任务头被调用时所处的任务运行阶段，目前支持训练阶段train和预测阶段predict\n            \"\"\"\n        self._stop_gradient = {}\n        self._phase = phase\n        self._prog = None\n        self._results_buffer = []\n\n    @property\n    def inputs_attrs(self):\n        \"\"\"step级别的任务输入对象声明。\n\n        描述该任务头所依赖的reader、backbone和来自其他任务头的输出对象（每个step获取一次）。使用字典进行描述，\n        字典的key为输出对象所在的组件（如’reader‘，’backbone‘等），value为该组件下任务头所需要的输出对象集。\n        输出对象集使用字典描述，key为输出对象的名字（该名字需保证在相关组件的输出对象集中），value为该输出对象\n        的shape和dtype。当某个输出对象的某个维度长度可变时，shape中的相应维度设置为-1。\n\n        Return:\n            dict类型。描述该任务头所依赖的step级输入，即来自各个组件的输出对象。\"\"\"\n        raise NotImplementedError()\n\n    @property\n    def outputs_attr(self):\n        \"\"\"step级别的任务输出对象声明。\n\n        描述该任务头的输出对象（每个step输出一次），包括每个输出对象的名字，shape和dtype。输出对象会被加入到\n        fetch_list中，从而在每个训练/推理step时得到实时的计算结果，该计算结果可以传入batch_postprocess方\n        法中进行当前step的后处理。当某个对象为标量数据类型（如str, int, float等）时，shape设置为空列表[]，\n        当某个对象的某个维度长度可变时，shape中的相应维度设置为-1。\n\n        Return:\n            dict类型。描述该任务头所产生的输出对象。注意，在训练阶段时必须包含名为loss的输出对象。\n            \"\"\"\n\n        raise NotImplementedError()\n\n    @property\n    def epoch_inputs_attrs(self):\n        \"\"\"epoch级别的任务输入对象声明。\n\n        描述该任务所依赖的来自reader、backbone和来自其他任务头的输出对象（每个epoch结束后产生一次），如完整的\n        样本集，有效的样本数等。使用字典进行描述，字典的key为输出对象所在的组件（如’reader‘，’backbone‘等），\n        value为该组件下任务头所需要的输出对象集。输出对象集使用字典描述，key为输出对象的名字（该名字需保证在相关\n        组件的输出对象集中），value为该输出对象的shape和dtype。当某个输出对象的某个维度长度可变时，shape中的相\n        应维度设置为-1。\n        \n        Return:\n            dict类型。描述该任务头所产生的输出对象。注意，在训练阶段时必须包含名为loss的输出对象。\n        \"\"\"\n        return {}\n\n    def build(self, inputs, scope_name=\"\"):\n        \"\"\"建立任务头的计算图。\n\n        将符合inputs_attrs描述的来自各个对象集的静态图Variables映射成符合outputs_attr描述的静态图Variable输出。\n\n        Args:\n            inputs: dict类型。字典中包含inputs_attrs中的对象名到计算图Variable的映射，inputs中至少会包含inputs_attr中定义的对象\n        Return:\n           需要输出的计算图变量，输出对象会被加入到fetch_list中，从而在每个训练/推理step时得到runtime的计算结果，该计算结果会被传入postprocess方法中供用户处理。\n        \"\"\"\n        raise NotImplementedError()\n\n    def batch_postprocess(self, rt_outputs):\n        \"\"\"batch/step级别的后处理。\n\n        每个训练或推理step后针对当前batch的任务头输出对象的实时计算结果来进行相关后处理。\n        默认将输出结果存储到缓冲区self._results_buffer中。\"\"\"\n        if isinstance(rt_outputs, dict):\n            keys = rt_outputs.keys()\n            vals = [rt_outputs[k] for k in keys]\n            lens = [len(v) for v in vals]\n            if len(set(lens)) == 1:\n                results = [dict(zip(*[keys, i])) for i in zip(*vals)]\n                self._results_buffer.extend(results)\n                return results\n            else:\n                print('WARNING: irregular output results. visualize failed.')\n                self._results_buffer.append(rt_outputs)\n        return None\n\n    def reset(self):\n        \"\"\"清空该任务头的缓冲区（在训练或推理过程中积累的处理结果）\"\"\"\n        self._results_buffer = []\n\n    def get_results(self):\n        \"\"\"返回当前任务头积累的处理结果。\"\"\"\n        return copy.deepcopy(self._results_buffer)\n        \n    def epoch_postprocess(self, post_inputs=None, output_dir=None):\n        \"\"\"epoch级别的后处理。\n\n        每个训练或推理epoch结束后，对积累的各样本的后处理结果results进行后处理。默认情况下，当output_dir为None时，直接将results打印到\n        屏幕上。当指定output_dir时，将results存储在指定的文件夹内，并以任务头所处阶段来作为存储文件的文件名。\n\n        Args:\n            post_inputs: 当声明的epoch_inputs_attr不为空时，该参数会携带对应的输入变量的内容。\n            output_dir: 积累结果的保存路径。\n        \"\"\"\n        if output_dir is not None:\n            if not os.path.exists(output_dir):\n                os.makedirs(output_dir)\n            with open(os.path.join(output_dir, self._phase), 'w') as writer:\n                for i in self._results_buffer:\n                    writer.write(json.dumps(i)+'\\n')\n        else:\n            return self._results_buffer\n\n"
  },
  {
    "path": "paddlepalm/head/cls.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport paddle.fluid as fluid\nfrom paddle.fluid import layers\nfrom paddlepalm.head.base_head import Head\nimport numpy as np\nimport os\nimport json\n\n\nclass Classify(Head):\n    \"\"\"\n    classification\n    \"\"\"\n    def __init__(self, num_classes, input_dim, dropout_prob=0.0, \\\n                 param_initializer_range=0.02, phase='train'):\n\n        self._is_training = phase == 'train'\n        self._hidden_size = input_dim\n\n        self.num_classes = num_classes\n    \n        self._dropout_prob = dropout_prob if phase == 'train' else 0.0\n        self._param_initializer = fluid.initializer.TruncatedNormal(\n            scale=param_initializer_range)\n        self._preds = []\n        self._probs = []\n\n    @property\n    def inputs_attrs(self):\n        reader = {}\n        bb = {\"sentence_embedding\": [[-1, self._hidden_size], 'float32']}\n        if self._is_training:\n            reader[\"label_ids\"] = [[-1], 'int64']\n        return {'reader': reader, 'backbone': bb}\n\n    @property\n    def outputs_attrs(self):\n        if self._is_training:\n            return {'loss': [[1], 'float32']}\n        else:\n            return {'logits': [[-1, self.num_classes], 'float32'],\n                    'probs': [[-1, self.num_classes], 'float32']}\n            \n\n    def build(self, inputs, scope_name=''):\n        sent_emb = inputs['backbone']['sentence_embedding']\n        if self._is_training:\n            label_ids = inputs['reader']['label_ids']\n            cls_feats = fluid.layers.dropout(\n                x=sent_emb,\n                dropout_prob=self._dropout_prob,\n                dropout_implementation=\"upscale_in_train\")\n\n        logits = fluid.layers.fc(\n            input=sent_emb,\n            size=self.num_classes,\n            param_attr=fluid.ParamAttr(\n                name=scope_name+\"cls_out_w\",\n                initializer=self._param_initializer),\n            bias_attr=fluid.ParamAttr(\n                name=scope_name+\"cls_out_b\", initializer=fluid.initializer.Constant(0.)))\n        probs = fluid.layers.softmax(logits)\n        if self._is_training:\n            loss = fluid.layers.cross_entropy(\n                input=probs, label=label_ids)\n            loss = layers.mean(loss)\n            return {\"loss\": loss}\n        else:\n            return {\"logits\":logits,\n                    \"probs\":probs}\n\n    def batch_postprocess(self, rt_outputs):\n        if not self._is_training:\n            logits = rt_outputs['logits']\n            probs = rt_outputs['probs']\n            self._preds.extend(logits.tolist())\n            self._probs.extend(probs.tolist())\n\n\n    def epoch_postprocess(self, post_inputs, output_dir=None):\n        # there is no post_inputs needed and not declared in epoch_inputs_attrs, hence no elements exist in post_inputs\n        if not self._is_training:\n            results = []\n            for i in range(len(self._preds)):\n                label = int(np.argmax(np.array(self._preds[i])))\n                result = {'index': i, 'label': label, 'logits': self._preds[i], 'probs': self._probs[i]}\n                results.append(result)\n            if output_dir is not None:\n                with open(os.path.join(output_dir, 'predictions.json'), 'w') as writer:\n                    for result in results:\n                        result = json.dumps(result)\n                        writer.write(result+'\\n')\n                print('Predictions saved at '+os.path.join(output_dir, 'predictions.json'))\n            return results\n\n                \n"
  },
  {
    "path": "paddlepalm/head/match.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\n\nimport paddle.fluid as fluid\nfrom paddle.fluid import layers\nfrom paddlepalm.head.base_head import Head\nimport numpy as np\nimport os\nimport json\n\n\ndef computeHingeLoss(pos, neg, margin):\n    loss_part1 = fluid.layers.elementwise_sub(\n        fluid.layers.fill_constant_batch_size_like(\n            input=pos, shape=[-1, 1], value=margin, dtype='float32'), pos)\n    loss_part2 = fluid.layers.elementwise_add(loss_part1, neg)\n    loss_part3 = fluid.layers.elementwise_max(\n        fluid.layers.fill_constant_batch_size_like(\n            input=loss_part2, shape=[-1, 1], value=0.0, dtype='float32'), loss_part2)\n    return loss_part3\n\n\nclass Match(Head):\n    '''\n    matching\n    '''\n   \n    def __init__(self, num_classes, input_dim, dropout_prob=0.0, param_initializer_range=0.02, \\\n        learning_strategy='pointwise', margin=0.5, phase='train'):\n\n        \"\"\"  \n        Args:\n            phase: train, eval, pred\n            lang: en, ch, ...\n            learning_strategy: pointwise, pairwise\n        \"\"\"\n        \n        self._is_training = phase == 'train'\n        self._hidden_size = input_dim\n    \n        self._num_classes = num_classes\n\n        self._dropout_prob = dropout_prob if phase == 'train' else 0.0\n        self._param_initializer = fluid.initializer.TruncatedNormal(\n            scale=param_initializer_range)\n        self._learning_strategy = learning_strategy \n        self._margin = margin\n\n    \n        self._preds = []\n        self._preds_logits = []\n    \n    @property\n    def inputs_attrs(self):\n        reader = {}\n        bb = {\"sentence_pair_embedding\": [[-1, self._hidden_size], 'float32']}\n        if self._is_training:\n            if self._learning_strategy == 'pointwise':\n                reader[\"label_ids\"] = [[-1], 'int64']\n            elif self._learning_strategy == 'pairwise':\n                bb[\"sentence_pair_embedding_neg\"] = [[-1, self._hidden_size], 'float32']\n\n        return {'reader': reader, 'backbone': bb}\n\n    @property\n    def outputs_attrs(self):\n        if self._is_training:\n            return {\"loss\": [[1], 'float32']}\n        else:\n            if self._learning_strategy=='paiwise':\n                return {\"probs\": [[-1, 1], 'float32']}\n            else:\n                return {\"logits\": [[-1, self._num_classes], 'float32'],\n                        \"probs\": [[-1, self._num_classes], 'float32']}\n\n    def build(self, inputs, scope_name=\"\"):\n\n        # inputs          \n        cls_feats = inputs[\"backbone\"][\"sentence_pair_embedding\"] \n        if self._is_training:\n            cls_feats = fluid.layers.dropout(\n                x=cls_feats,\n                dropout_prob=self._dropout_prob,\n                dropout_implementation=\"upscale_in_train\")\n            if self._learning_strategy == 'pairwise':\n                cls_feats_neg = inputs[\"backbone\"][\"sentence_pair_embedding_neg\"]\n                cls_feats_neg = fluid.layers.dropout(\n                x=cls_feats_neg,\n                dropout_prob=self._dropout_prob,\n                dropout_implementation=\"upscale_in_train\")\n            elif self._learning_strategy == 'pointwise':\n                labels = inputs[\"reader\"][\"label_ids\"] \n        \n        # loss\n        # for pointwise\n        if self._learning_strategy == 'pointwise':\n            logits = fluid.layers.fc(\n                input=cls_feats,\n                size=self._num_classes,\n                param_attr=fluid.ParamAttr(\n                    name=scope_name+\"cls_out_w\",\n                    initializer=self._param_initializer),\n                bias_attr=fluid.ParamAttr(\n                    name=scope_name+\"cls_out_b\",\n                    initializer=fluid.initializer.Constant(0.)))\n            probs = fluid.layers.softmax(logits)\n            if self._is_training:\n                ce_loss = fluid.layers.cross_entropy(\n                    input=probs, label=labels)\n                loss = fluid.layers.mean(x=ce_loss)\n                return {'loss': loss}\n            # for pred\n            else:\n                return {'logits': logits,\n                        'probs': probs}\n        # for pairwise\n        elif self._learning_strategy == 'pairwise':\n            pos_score = fluid.layers.fc(\n                input=cls_feats,\n                size=1,\n                act = \"sigmoid\",\n                param_attr=fluid.ParamAttr(\n                    name=scope_name+\"cls_out_w_pr\",\n                    initializer=self._param_initializer),\n                bias_attr=fluid.ParamAttr(\n                    name=scope_name+\"cls_out_b_pr\",\n                    initializer=fluid.initializer.Constant(0.)))\n            pos_score = fluid.layers.reshape(x=pos_score, shape=[-1, 1], inplace=True)\n\n            if self._is_training:\n                neg_score = fluid.layers.fc(\n                    input=cls_feats_neg,\n                    size=1,\n                    act = \"sigmoid\",\n                    param_attr=fluid.ParamAttr(\n                        name=scope_name+\"cls_out_w_pr\",\n                        initializer=self._param_initializer),\n                    bias_attr=fluid.ParamAttr(\n                        name=scope_name+\"cls_out_b_pr\",\n                        initializer=fluid.initializer.Constant(0.)))        \n                neg_score = fluid.layers.reshape(x=neg_score, shape=[-1, 1], inplace=True)\n        \n                loss = fluid.layers.mean(computeHingeLoss(pos_score, neg_score, self._margin))\n                return {'loss': loss}\n            # for pred\n            else:\n                return {'probs': pos_score}\n        \n    def batch_postprocess(self, rt_outputs):\n        if not self._is_training:\n            probs = []\n            logits = []\n            probs = rt_outputs['probs']\n            self._preds.extend(probs.tolist())\n            if self._learning_strategy == 'pointwise':\n                logits = rt_outputs['logits']\n                self._preds_logits.extend(logits.tolist())\n\n    def reset(self):\n        self._preds_logits = []\n        self._preds = []\n        \n    def epoch_postprocess(self, post_inputs, output_dir=None):\n        # there is no post_inputs needed and not declared in epoch_inputs_attrs, hence no elements exist in post_inputs\n        if not self._is_training:\n            results = []\n            for i in range(len(self._preds)):\n                if self._learning_strategy == 'pointwise':\n                    label = int(np.argmax(np.array(self._preds[i])))\n                    result = {'index': i, 'label': label, 'logits': self._preds_logits[i], 'probs': self._preds[i]}\n                elif self._learning_strategy == 'pairwise':\n                    result = {'index': i, 'probs': self._preds[i][0]}\n                results.append(result)\n            if output_dir is not None:\n                with open(os.path.join(output_dir, 'predictions.json'), 'w') as writer:\n                    for result in results:\n                        result = json.dumps(result, ensure_ascii=False)\n                        writer.write(result+'\\n')\n                print('Predictions saved at '+os.path.join(output_dir, 'predictions.json'))\n            return results\n"
  },
  {
    "path": "paddlepalm/head/mlm.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport paddle.fluid as fluid\nfrom paddlepalm.head.base_head import Head\nfrom paddle.fluid import layers\nimport numpy as np\nimport os\nfrom paddlepalm.backbone.utils.transformer import pre_process_layer\n\nclass MaskLM(Head):\n    '''\n    mlm\n    '''\n    def __init__(self, input_dim, vocab_size, hidden_act, dropout_prob=0.0, \\\n                 param_initializer_range=0.02, phase='train'):\n        self._is_training = phase == 'train'\n        self._emb_size = input_dim\n        self._hidden_size = input_dim\n        self._dropout_prob = dropout_prob if phase == 'train' else 0.0\n        self._preds = []\n\n        self._vocab_size = vocab_size\n        self._hidden_act = hidden_act\n        self._initializer_range = param_initializer_range\n    \n    @property\n    def inputs_attrs(self):\n        reader = {\n            \"mask_label\": [[-1], 'int64'],\n            \"mask_pos\": [[-1], 'int64'],\n            }\n        if not self._is_training:\n            del reader['mask_label']\n        bb = {\n            \"encoder_outputs\": [[-1, -1, self._hidden_size], 'float32'],\n            \"embedding_table\": [[-1, self._vocab_size, self._emb_size], 'float32']}\n        return {'reader': reader, 'backbone': bb}\n\n    @property\n    def outputs_attrs(self):\n        if self._is_training:\n            return {\"loss\": [[1], 'float32']}\n        else:\n            return {\"logits\": [[-1], 'float32']}\n\n    def build(self, inputs, scope_name=\"\"):\n        mask_pos = inputs[\"reader\"][\"mask_pos\"]\n        \n        word_emb = inputs[\"backbone\"][\"embedding_table\"]\n        enc_out = inputs[\"backbone\"][\"encoder_outputs\"]\n\n        if self._is_training:\n            mask_label = inputs[\"reader\"][\"mask_label\"]\n            l1 = enc_out.shape[0] \n            l2 = enc_out.shape[1]\n            bxs = fluid.layers.fill_constant(shape=[1], value=l1*l2, dtype='int64')\n            max_position = bxs - 1\n            mask_pos = fluid.layers.elementwise_min(mask_pos, max_position)\n            mask_pos.stop_gradient = True\n\n        emb_size = word_emb.shape[-1]\n\n        _param_initializer = fluid.initializer.TruncatedNormal(\n            scale=self._initializer_range)\n\n        reshaped_emb_out = fluid.layers.reshape(\n            x=enc_out, shape=[-1, emb_size])\n\n        # extract masked tokens' feature\n        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)\n\n        # transform: fc\n        mask_trans_feat = fluid.layers.fc(\n            input=mask_feat,\n            size=emb_size,\n            act=self._hidden_act,\n            param_attr=fluid.ParamAttr(\n                name=scope_name+'mask_lm_trans_fc.w_0',\n                initializer=_param_initializer),\n                bias_attr=fluid.ParamAttr(name=scope_name+'mask_lm_trans_fc.b_0'))\n        # transform: layer norm\n        mask_trans_feat = pre_process_layer(\n            mask_trans_feat, 'n', name=scope_name+'mask_lm_trans')\n\n        mask_lm_out_bias_attr = fluid.ParamAttr(\n            name=scope_name+\"mask_lm_out_fc.b_0\",\n            initializer=fluid.initializer.Constant(value=0.0))\n\n        fc_out = fluid.layers.matmul(\n            x=mask_trans_feat,\n            y=word_emb,\n            transpose_y=True)\n        fc_out += fluid.layers.create_parameter(\n            shape=[self._vocab_size],\n            dtype='float32',\n            attr=mask_lm_out_bias_attr,\n            is_bias=True)\n\n        if self._is_training:\n            inputs = fluid.layers.softmax(fc_out)\n            mask_lm_loss = fluid.layers.cross_entropy(\n                input=inputs, label=mask_label)\n            loss = fluid.layers.mean(mask_lm_loss)\n            return {'loss': loss}\n        else:\n            return {'logits': fc_out}\n\n    def batch_postprocess(self, rt_outputs):\n        if not self._is_training:\n            logits = rt_outputs['logits']\n            preds = np.argmax(logits, -1)\n            self._preds.extend(preds.tolist())\n            return preds\n\n    def epoch_postprocess(self, post_inputs, output_dir=None):\n        # there is no post_inputs needed and not declared in epoch_inputs_attrs, hence no elements exist in post_inputs\n        if not self._is_training:\n            results = []\n            for i in range(len(self._preds)):\n                result = {'index': i, 'word_id': self._preds[i]}\n                results.append(result)\n            if output_dir is not None:\n                with open(os.path.join(output_dir, 'predictions.json'), 'w') as writer:\n                    for result in results:\n                        result = json.dumps(result)\n                        writer.write(result+'\\n')\n                print('Predictions saved at '+os.path.join(output_dir, 'predictions.json'))\n            return results\n\n"
  },
  {
    "path": "paddlepalm/head/mrc.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport paddle.fluid as fluid\nfrom paddlepalm.head.base_head import Head\nimport collections\nimport numpy as np\nimport os\nimport math\nimport six\nimport paddlepalm.tokenizer.ernie_tokenizer as tokenization\nimport json\nimport io\n\nRawResult = collections.namedtuple(\"RawResult\",\n                                   [\"unique_id\", \"start_logits\", \"end_logits\"])\n\nclass MRC(Head):\n    \"\"\"\n    Machine Reading Comprehension\n    \"\"\"\n\n    def __init__(self, max_query_len, input_dim, pred_output_path=None, verbose=False, with_negative=False, do_lower_case=False, max_ans_len=None, null_score_diff_threshold=0.0, n_best_size=20, phase='train'):\n\n        self._is_training = phase == 'train'\n        self._hidden_size = input_dim\n        self._max_sequence_length = max_query_len\n \n        self._pred_results = []\n        \n        output_dir = pred_output_path\n        self._max_answer_length = max_ans_len\n        self._null_score_diff_threshold = null_score_diff_threshold\n        self._n_best_size = n_best_size\n        output_dir = pred_output_path\n        self._verbose = verbose\n        self._with_negative = with_negative\n        self._do_lower_case = do_lower_case\n\n\n    @property\n    def inputs_attrs(self):\n        if self._is_training:\n            reader = {\"start_positions\": [[-1], 'int64'],\n                      \"end_positions\": [[-1], 'int64'],\n                      }\n        else:\n            reader = {'unique_ids': [[-1], 'int64']}\n        bb = {\"encoder_outputs\": [[-1, -1, self._hidden_size], 'float32']}\n        return {'reader': reader, 'backbone': bb}\n        \n    @property\n    def epoch_inputs_attrs(self):\n        if not self._is_training:\n            from_reader = {'examples': None, 'features': None}\n            return {'reader': from_reader}\n\n    @property\n    def outputs_attr(self):\n        if self._is_training:\n            return {'loss': [[1], 'float32']}\n        else:\n            return {'start_logits': [[-1, -1, 1], 'float32'],\n                    'end_logits': [[-1, -1, 1], 'float32'],\n                    'unique_ids': [[-1], 'int64']}\n\n\n    def build(self, inputs, scope_name=\"\"):\n        if self._is_training:\n            start_positions = inputs['reader']['start_positions']\n            end_positions = inputs['reader']['end_positions']\n            # max_position = inputs[\"reader\"][\"seqlen\"] - 1\n            # start_positions = fluid.layers.elementwise_min(start_positions, max_position)\n            # end_positions = fluid.layers.elementwise_min(end_positions, max_position)\n            start_positions.stop_gradient = True\n            end_positions.stop_gradient = True\n        else:\n            unique_id = inputs['reader']['unique_ids']\n\n            # It's used to help fetch variable 'unique_ids' that will be removed in the future\n            helper_constant = fluid.layers.fill_constant(shape=[1], value=1, dtype='int64')\n            fluid.layers.elementwise_mul(unique_id, helper_constant)  \n            \n\n        enc_out = inputs['backbone']['encoder_outputs']\n        logits = fluid.layers.fc(\n            input=enc_out,\n            size=2,\n            num_flatten_dims=2,\n            param_attr=fluid.ParamAttr(\n                name=scope_name+\"cls_squad_out_w\",\n                initializer=fluid.initializer.TruncatedNormal(scale=0.02)),\n            bias_attr=fluid.ParamAttr(\n                name=scope_name+\"cls_squad_out_b\", initializer=fluid.initializer.Constant(0.)))\n\n        logits = fluid.layers.transpose(x=logits, perm=[2, 0, 1])\n        start_logits, end_logits = fluid.layers.unstack(x=logits, axis=0)\n\n        def _compute_single_loss(logits, positions):\n            \"\"\"Compute start/en\n            d loss for mrc model\"\"\"\n            inputs = fluid.layers.softmax(logits)\n            loss = fluid.layers.cross_entropy(\n                input=inputs, label=positions)\n            loss = fluid.layers.mean(x=loss)\n            return loss\n\n        if self._is_training:\n            start_loss = _compute_single_loss(start_logits, start_positions)\n            end_loss = _compute_single_loss(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2.0\n            return {'loss': total_loss}\n        else:\n            return {'start_logits': start_logits,\n                    'end_logits': end_logits,\n                    'unique_ids': unique_id}\n\n\n    def batch_postprocess(self, rt_outputs):\n        \"\"\"this func will be called after each step(batch) of training/evaluating/predicting process.\"\"\"\n        if not self._is_training:\n            unique_ids = rt_outputs['unique_ids']\n            start_logits = rt_outputs['start_logits']\n            end_logits = rt_outputs['end_logits']\n            for idx in range(len(unique_ids)):\n                \n                if unique_ids[idx] < 0:\n                    continue\n                if len(self._pred_results) % 1000 == 0:\n                    print(\"Predicting example: {}\".format(len(self._pred_results)))\n                uid = int(unique_ids[idx])\n\n                s = [float(x) for x in start_logits[idx].flat]\n                e = [float(x) for x in end_logits[idx].flat]\n                self._pred_results.append(\n                    RawResult(\n                        unique_id=uid,\n                        start_logits=s,\n                        end_logits=e))\n\n    def epoch_postprocess(self, post_inputs, output_dir=None):\n        \"\"\"(optional interface) this func will be called after evaluation/predicting process and each epoch during training process.\"\"\"\n\n        if not self._is_training:\n            if output_dir is not None:\n                examples = post_inputs['reader']['examples']\n                features = post_inputs['reader']['features']\n                if not os.path.exists(output_dir):\n                    os.makedirs(output_dir)\n                output_prediction_file = os.path.join(output_dir, \"predictions.json\")\n                output_nbest_file = os.path.join(output_dir, \"nbest_predictions.json\")\n                output_null_log_odds_file = os.path.join(output_dir, \"null_odds.json\")\n                _write_predictions(examples, features, self._pred_results,\n                                  self._n_best_size, self._max_answer_length,\n                                  self._do_lower_case, output_prediction_file,\n                                  output_nbest_file, output_null_log_odds_file,\n                                  self._with_negative,\n                                  self._null_score_diff_threshold, self._verbose)\n            return self._pred_results\n\n\ndef _write_predictions(all_examples, all_features, all_results, n_best_size,\n                      max_answer_length, do_lower_case, output_prediction_file,\n                      output_nbest_file, output_null_log_odds_file,\n                      with_negative, null_score_diff_threshold,\n                      verbose):\n    \"\"\"Write final predictions to the json file and log-odds of null if needed.\"\"\"\n    print(\"Writing predictions to: %s\" % (output_prediction_file))\n    print(\"Writing nbest to: %s\" % (output_nbest_file))\n\n    example_index_to_features = collections.defaultdict(list)\n    for feature in all_features:\n        example_index_to_features[feature.example_index].append(feature)\n\n    unique_id_to_result = {}\n    for result in all_results:\n        unique_id_to_result[result.unique_id] = result\n\n    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name\n        \"PrelimPrediction\", [\n            \"feature_index\", \"start_index\", \"end_index\", \"start_logit\",\n            \"end_logit\"\n        ])\n\n    all_predictions = collections.OrderedDict()\n    all_nbest_json = collections.OrderedDict()\n    scores_diff_json = collections.OrderedDict()\n\n    for (example_index, example) in enumerate(all_examples):\n        features = example_index_to_features[example_index]\n\n        prelim_predictions = []\n        # keep track of the minimum score of null start+end of position 0\n        score_null = 1000000  # large and positive\n        min_null_feature_index = 0  # the paragraph slice with min mull score\n        ull_start_logit = 0  # the start logit at the slice with min null score\n        null_end_logit = 0  # the end logit at the slice with min null score\n    \n        for (feature_index, feature) in enumerate(features):\n            result = unique_id_to_result[feature.unique_id]\n            start_indexes = _get_best_indexes(result.start_logits, n_best_size)\n            end_indexes = _get_best_indexes(result.end_logits, n_best_size)\n            # if we could have irrelevant answers, get the min score of irrelevant\n            if with_negative:\n                feature_null_score = result.start_logits[0] + result.end_logits[\n                    0]\n                if feature_null_score < score_null:\n                    score_null = feature_null_score\n                    min_null_feature_index = feature_index\n                    null_start_logit = result.start_logits[0]\n                    null_end_logit = result.end_logits[0]\n            for start_index in start_indexes:\n                for end_index in end_indexes:\n                    # We could hypothetically create invalid predictions, e.g., predict\n                    # that the start of the span is in the question. We throw out all\n                    # invalid predictions.\n                    if start_index >= len(feature.tokens):\n                        continue\n                    if end_index >= len(feature.tokens):\n                        continue\n                    if start_index not in feature.token_to_orig_map:\n                        continue\n                    if end_index not in feature.token_to_orig_map:\n                        continue\n                    if not feature.token_is_max_context.get(start_index, False):\n                        continue\n                    if end_index < start_index:\n                        continue\n                    length = end_index - start_index + 1\n                    if length > max_answer_length:\n                        continue\n                    prelim_predictions.append(\n                        _PrelimPrediction(\n                            feature_index=feature_index,\n                            start_index=start_index,\n                            end_index=end_index,\n                            start_logit=result.start_logits[start_index],\n                            end_logit=result.end_logits[end_index]))\n\n        if with_negative:\n            prelim_predictions.append(\n                _PrelimPrediction(\n                    feature_index=min_null_feature_index,\n                    start_index=0,\n                    end_index=0,\n                    start_logit=null_start_logit,\n                    end_logit=null_end_logit))\n        prelim_predictions = sorted(\n            prelim_predictions,\n            key=lambda x: (x.start_logit + x.end_logit),\n            reverse=True)\n\n        _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name\n            \"NbestPrediction\", [\"text\", \"start_logit\", \"end_logit\"])\n\n        seen_predictions = {}\n        nbest = []\n        for pred in prelim_predictions:\n            if len(nbest) >= n_best_size:\n                break\n            feature = features[pred.feature_index]\n            if pred.start_index > 0:  # this is a non-null prediction\n                tok_tokens = feature.tokens[pred.start_index:(pred.end_index + 1\n                                                              )]\n                orig_doc_start = feature.token_to_orig_map[pred.start_index]\n                orig_doc_end = feature.token_to_orig_map[pred.end_index]\n                orig_tokens = example.doc_tokens[orig_doc_start:(orig_doc_end +\n                                                                 1)]\n                tok_text = \" \".join(tok_tokens)\n\n                # De-tokenize WordPieces that have been split off.\n                tok_text = tok_text.replace(\" ##\", \"\")\n                tok_text = tok_text.replace(\"##\", \"\")\n\n                # Clean whitespace\n                tok_text = tok_text.strip()\n                tok_text = \" \".join(tok_text.split())\n                orig_text = \" \".join(orig_tokens)\n\n                final_text = _get_final_text(tok_text, orig_text, do_lower_case,\n                                            verbose)\n                if final_text in seen_predictions:\n                    continue\n\n                seen_predictions[final_text] = True\n            else:\n                final_text = \"\"\n                seen_predictions[final_text] = True\n\n            nbest.append(\n                _NbestPrediction(\n                    text=final_text,\n                    start_logit=pred.start_logit,\n                    end_logit=pred.end_logit))\n\n        # if we didn't inlude the empty option in the n-best, inlcude it\n        if with_negative:\n            if \"\" not in seen_predictions:\n                nbest.append(\n                    _NbestPrediction(\n                        text=\"\",\n                        start_logit=null_start_logit,\n                        end_logit=null_end_logit))\n        # In very rare edge cases we could have no valid predictions. So we\n        # just create a nonce prediction in this case to avoid failure.\n        if not nbest:\n            nbest.append(\n                _NbestPrediction(\n                    text=\"empty\", start_logit=0.0, end_logit=0.0))\n\n        assert len(nbest) >= 1\n\n        total_scores = []\n        best_non_null_entry = None\n        for entry in nbest:\n            total_scores.append(entry.start_logit + entry.end_logit)\n            if not best_non_null_entry:\n                if entry.text:\n                    best_non_null_entry = entry\n        # debug\n        if best_non_null_entry is None:\n            print(\"Emmm..., sth wrong\")\n\n        probs = _compute_softmax(total_scores)\n\n        nbest_json = []\n        for (i, entry) in enumerate(nbest):\n            output = collections.OrderedDict()\n            output[\"text\"] = entry.text.encode('utf-8').decode('utf-8')\n            output[\"probability\"] = probs[i]\n            output[\"start_logit\"] = entry.start_logit\n            output[\"end_logit\"] = entry.end_logit\n            nbest_json.append(output)\n\n        assert len(nbest_json) >= 1\n\n        if not with_negative:\n            all_predictions[example.qas_id] = nbest_json[0][\"text\"]\n        else:\n            # predict \"\" iff the null score - the score of best non-null > threshold\n            score_diff = score_null - best_non_null_entry.start_logit - (\n                best_non_null_entry.end_logit)\n            scores_diff_json[example.qas_id] = score_diff\n            if score_diff > null_score_diff_threshold:\n                all_predictions[example.qas_id] = \"\"\n            else:\n                all_predictions[example.qas_id] = best_non_null_entry.text\n\n        all_nbest_json[example.qas_id] = nbest_json\n    \n\n\n    with io.open(output_prediction_file, \"w\", encoding='utf-8') as writer:\n        \n        writer.write(json.dumps(all_predictions, indent=4, ensure_ascii=False) + \"\\n\")\n\n    with io.open(output_nbest_file, \"w\", encoding='utf-8') as writer:\n        writer.write(json.dumps(all_nbest_json, indent=4, ensure_ascii=False) + \"\\n\")\n\n    if with_negative:\n        with io.open(output_null_log_odds_file, \"w\", encoding='utf-8') as writer:\n            writer.write(json.dumps(scores_diff_json, indent=4, ensure_ascii=False) + \"\\n\")\n\n\ndef _get_final_text(pred_text, orig_text, do_lower_case, verbose):\n    \"\"\"Project the tokenized prediction back to the original text.\"\"\"\n\n    # When we created the data, we kept track of the alignment between original\n    # (whitespace tokenized) tokens and our WordPiece tokenized tokens. So\n    # now `orig_text` contains the span of our original text corresponding to the\n    # span that we predicted.\n    #\n    # However, `orig_text` may contain extra characters that we don't want in\n    # our prediction.\n    #\n    # For example, let's say:\n    #   pred_text = steve smith\n    #   orig_text = Steve Smith's\n    #\n    # We don't want to return `orig_text` because it contains the extra \"'s\".\n    #\n    # We don't want to return `pred_text` because it's already been normalized\n    # (the MRQA eval script also does punctuation stripping/lower casing but\n    # our tokenizer does additional normalization like stripping accent\n    # characters).\n    #\n    # What we really want to return is \"Steve Smith\".\n    #\n    # Therefore, we have to apply a semi-complicated alignment heruistic between\n    # `pred_text` and `orig_text` to get a character-to-charcter alignment. This\n    # can fail in certain cases in which case we just return `orig_text`.\n\n    def _strip_spaces(text):\n        ns_chars = []\n        ns_to_s_map = collections.OrderedDict()\n        for (i, c) in enumerate(text):\n            if c == \" \":\n                continue\n            ns_to_s_map[len(ns_chars)] = i\n            ns_chars.append(c)\n        ns_text = \"\".join(ns_chars)\n        return (ns_text, ns_to_s_map)\n\n    # We first tokenize `orig_text`, strip whitespace from the result\n    # and `pred_text`, and check if they are the same length. If they are\n    # NOT the same length, the heuristic has failed. If they are the same\n    # length, we assume the characters are one-to-one aligned.\n    tokenizer = tokenization.BasicTokenizer(do_lower_case=do_lower_case)\n\n    tok_text = \" \".join(tokenizer.tokenize(orig_text))\n\n    start_position = tok_text.find(pred_text)\n    if start_position == -1:\n        if verbose:\n            print(\"Unable to find text: '%s' in '%s'\" % (pred_text, orig_text))\n        return orig_text\n    end_position = start_position + len(pred_text) - 1\n\n    (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text)\n    (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text)\n\n    if len(orig_ns_text) != len(tok_ns_text):\n        if verbose:\n            print(\"Length not equal after stripping spaces: '%s' vs '%s'\",\n                  orig_ns_text, tok_ns_text)\n        return orig_text\n\n    # We then project the characters in `pred_text` back to `orig_text` using\n    # the character-to-character alignment.\n    tok_s_to_ns_map = {}\n    for (i, tok_index) in six.iteritems(tok_ns_to_s_map):\n        tok_s_to_ns_map[tok_index] = i\n\n    orig_start_position = None\n    if start_position in tok_s_to_ns_map:\n        ns_start_position = tok_s_to_ns_map[start_position]\n        if ns_start_position in orig_ns_to_s_map:\n            orig_start_position = orig_ns_to_s_map[ns_start_position]\n\n    if orig_start_position is None:\n        if verbose:\n            print(\"Couldn't map start position\")\n        return orig_text\n\n    orig_end_position = None\n    if end_position in tok_s_to_ns_map:\n        ns_end_position = tok_s_to_ns_map[end_position]\n        if ns_end_position in orig_ns_to_s_map:\n            orig_end_position = orig_ns_to_s_map[ns_end_position]\n\n    if orig_end_position is None:\n        if verbose:\n            print(\"Couldn't map end position\")\n        return orig_text\n\n    output_text = orig_text[orig_start_position:(orig_end_position + 1)]\n    return output_text\n\n\ndef _get_best_indexes(logits, n_best_size):\n    \"\"\"Get the n-best logits from a list.\"\"\"\n    index_and_score = sorted(\n        enumerate(logits), key=lambda x: x[1], reverse=True)\n\n    best_indexes = []\n    for i in range(len(index_and_score)):\n        if i >= n_best_size:\n            break\n        best_indexes.append(index_and_score[i][0])\n    return best_indexes\n\n\ndef _compute_softmax(scores):\n    \"\"\"Compute softmax probability over raw logits.\"\"\"\n    if not scores:\n        return []\n\n    max_score = None\n    for score in scores:\n        if max_score is None or score > max_score:\n            max_score = score\n\n    exp_scores = []\n    total_sum = 0.0\n    for score in scores:\n        x = math.exp(score - max_score)\n        exp_scores.append(x)\n        total_sum += x\n\n    probs = []\n    for score in exp_scores:\n        probs.append(score / total_sum)\n    return probs\n\n\n"
  },
  {
    "path": "paddlepalm/head/ner.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport paddle.fluid as fluid\nfrom paddle.fluid import layers\nfrom paddlepalm.head.base_head import Head\nimport numpy as np\nimport os\nimport math\n\nclass SequenceLabel(Head):\n    '''\n    Sequence label\n    '''\n    def __init__(self, num_classes, input_dim, dropout_prob=0.0, learning_rate=1e-3,  \\\n                 param_initializer_range=0.02, phase='train'):\n        \n        \"\"\"  \n        Args:\n            phase: train, eval, pred\n            lang: en, ch, ...\n        \"\"\"\n\n        self._is_training = phase == 'train'\n        self._hidden_size = input_dim\n\n        self.num_classes = num_classes\n    \n        self._dropout_prob = dropout_prob if phase == 'train' else 0.0\n        self._param_initializer = fluid.initializer.TruncatedNormal(\n            scale=param_initializer_range)\n\n        self.learning_rate = learning_rate\n        self._preds = []\n\n\n    @property\n    def inputs_attrs(self):\n        reader = {}\n        bb = {\"encoder_outputs\": [[-1, -1, -1], 'float32']}\n        if self._is_training:\n            reader[\"label_ids\"] = [[-1, -1], 'int64']\n            reader[\"seq_lens\"] = [[-1], 'int64']\n        return {'reader': reader, 'backbone': bb}\n\n    @property\n    def outputs_attrs(self):\n        if self._is_training:\n            return {'loss': [[1], 'float32']}\n        else:\n            return {'logits': [[-1, -1, self.num_classes], 'float32']}\n\n    def build(self, inputs, scope_name=''):\n        token_emb = inputs['backbone']['encoder_outputs']\n        if self._is_training:\n            label_ids = inputs['reader']['label_ids']\n            seq_lens = inputs['reader']['seq_lens']\n\n        emission = fluid.layers.fc(\n            size=self.num_classes,\n            input=token_emb,\n            param_attr=fluid.ParamAttr(\n                initializer=self._param_initializer,\n                regularizer=fluid.regularizer.L2DecayRegularizer(\n                    regularization_coeff=1e-4)),\n            bias_attr=fluid.ParamAttr(\n                name=scope_name+\"cls_out_b\", initializer=fluid.initializer.Constant(0.)),\n            num_flatten_dims=2)\n\n        if self._is_training:\n\n            # compute loss\n            crf_cost = fluid.layers.linear_chain_crf(  \n                input=emission,\n                label=label_ids,\n                param_attr=fluid.ParamAttr(\n                    name=scope_name+'crfw', learning_rate=self.learning_rate),\n                length=seq_lens)\n\n            avg_cost = fluid.layers.mean(x=crf_cost)\n            crf_decode = fluid.layers.crf_decoding(\n                input=emission,\n                param_attr=fluid.ParamAttr(name=scope_name+'crfw'),\n                length=seq_lens)\n\n            (precision, recall, f1_score, num_infer_chunks, num_label_chunks,\n            num_correct_chunks) = fluid.layers.chunk_eval(\n                input=crf_decode,\n                label=label_ids,\n                chunk_scheme=\"IOB\",\n                num_chunk_types=int(math.ceil((self.num_classes - 1) / 2.0)),\n                seq_length=seq_lens)\n            chunk_evaluator = fluid.metrics.ChunkEvaluator()\n            chunk_evaluator.reset()\n\n            return {\"loss\": avg_cost}\n        else:\n            return {\"logits\": emission} \n\n    def batch_postprocess(self, rt_outputs):\n        if not self._is_training:\n            emission = rt_outputs['emission']\n            preds = np.argmax(emission, -1)\n            self._preds.extend(preds.tolist())\n\n    def epoch_postprocess(self, post_inputs, output_dir=None):\n        # there is no post_inputs needed and not declared in epoch_inputs_attrs, hence no elements exist in post_inputs\n        if not self._is_training:\n            if output_dir is not None:\n                with open(os.path.join(output_dir, 'predictions.json'), 'w') as writer:\n                    for p in self._preds:\n                        writer.write(str(p)+'\\n')\n                print('Predictions saved at '+os.path.join(output_dir, 'predictions.json'))\n            return self._preds\n"
  },
  {
    "path": "paddlepalm/lr_sched/__init__.py",
    "content": "\nfrom .slanted_triangular_schedualer import TriangularSchedualer\nfrom .warmup_schedualer import WarmupSchedualer\n\n"
  },
  {
    "path": "paddlepalm/lr_sched/base_schedualer.py",
    "content": "\nclass Schedualer():\n\n    def __init__(self):\n        self._prog = None\n    \n    def _set_prog(self, prog):\n        self._prog = prog\n\n    def _build(self, learning_rate):\n        raise NotImplementedError()\n\n"
  },
  {
    "path": "paddlepalm/lr_sched/slanted_triangular_schedualer.py",
    "content": "from paddlepalm.lr_sched.base_schedualer import Schedualer\nfrom paddle import fluid\n\nclass TriangularSchedualer(Schedualer):\n\n    \"\"\" Implementation of Slanted Triangular learning rate schedual method, more details refer to https://arxiv.org/pdf/1801.06146.pdf . Apply linear warmup of learning rate from 0 to learning_rate until warmup_steps, and then decay to 0 linearly until num_train_steps.\"\"\"\n\n    def __init__(self, warmup_steps, num_train_steps):\n        \"\"\"Create a new TriangularSchedualer object.\n\n        Args:\n            warmup_steps: the learning rate will grow from 0 to max_learning_rate over `warmup_steps` steps.\n            num_train_steps: the number of train steps.\n\n        \"\"\"\n        Schedualer.__init__(self)\n        assert num_train_steps > warmup_steps > 0\n        self.warmup_steps = warmup_steps\n        self.num_train_steps = num_train_steps\n        \n\n    def _build(self, learning_rate):\n        with self._prog._lr_schedule_guard():\n            lr = fluid.layers.tensor.create_global_var(\n                shape=[1],\n                value=0.0,\n                dtype='float32',\n                persistable=True,\n                name=\"scheduled_learning_rate\")\n\n            global_step = fluid.layers.learning_rate_scheduler._decay_step_counter()\n\n            with fluid.layers.control_flow.Switch() as switch:\n                with switch.case(global_step < self.warmup_steps):\n                    warmup_lr = learning_rate * (global_step / self.warmup_steps)\n                    fluid.layers.tensor.assign(warmup_lr, lr)\n                with switch.default():\n                    decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(\n                        learning_rate=learning_rate,\n                        decay_steps=self.num_train_steps,\n                        end_learning_rate=0.0,\n                        power=1.0,\n                        cycle=False)\n                    fluid.layers.tensor.assign(decayed_lr, lr)\n\n            return lr\n\n\n"
  },
  {
    "path": "paddlepalm/lr_sched/warmup_schedualer.py",
    "content": "\nfrom paddlepalm.lr_sched.base_schedualer import Schedualer\nimport paddle.fluid as fluid\n\ndef WarmupSchedualer(Schedualer):\n    \"\"\" Applies linear warmup of learning rate from 0 to learning_rate until warmup_steps, and then decay to 0 linearly until num_train_steps.\"\"\"\n\n    def __init__(self, warmup_steps):\n        schedualer.__init__(self)\n        self.warmup_steps = warmup_steps\n\n    def _build(self, learning_rate):\n\n        with self._prog._lr_schedule_guard():\n            lr = fluid.layers.tensor.create_global_var(\n                shape=[1],\n                value=0.0,\n                dtype='float32',\n                persistable=True,\n                name=\"scheduled_learning_rate\")\n\n            global_step = fluid.layers.learning_rate_scheduler._decay_step_counter()\n\n            with fluid.layers.control_flow.Switch() as switch:\n                with switch.case(global_step < self.warmup_steps):\n                    warmup_lr = learning_rate * (global_step / self.warmup_steps)\n                    fluid.layers.tensor.assign(warmup_lr, lr)\n                with switch.default():\n                    fluid.layers.tensor.assign(learning_rate, lr)\n\n            return lr\n\n"
  },
  {
    "path": "paddlepalm/multihead_trainer.py",
    "content": "\nfrom paddle import fluid\nfrom paddle.fluid import layers\nfrom paddlepalm.distribute import gpu_dev_count, cpu_dev_count, data_feeder, decode_fake\nfrom paddlepalm import Trainer\nfrom paddlepalm.utils import reader_helper\nimport numpy as np\nimport time\nimport sys\n\ndev_count = 1 if gpu_dev_count <= 1 else gpu_dev_count\nVERBOSE=False\n\n\nclass MultiHeadTrainer(Trainer):\n    \"\"\"\n    The core unit to start a multi-task training/predicting session. A MultiHeadTrainer is built based on several Trainers. Beyond the inheritance of Trainer, it additionally achieves model backbone reuse across tasks, trainer sampling for multi-task learning, and multi-head inference for effective evaluation and prediction. \n    \"\"\"\n    \n    def __init__(self, trainers):\n        \"\"\"Create a new multi_head_trainer.\n\n        Args:\n            trainers: a list of Trainer objects.\n\n        \"\"\"\n        Trainer.__init__(self, '')\n\n        self._trainers = trainers\n\n        name_maxlen = max([len(i.name) for i in self._trainers])\n        self._name_pads = {i.name: name_maxlen-len(i.name) for i in self._trainers}\n\n        self._train_init = False\n        self._dist_train_init = False\n        self._predict_init = False\n        self._feeded_var_names = None\n        self._cur_train_step = 0\n        self._target_vars = None\n\n        self._inputname_to_varname = {}\n        self._pred_input_name_list = []\n        self._pred_input_varname_list = []\n        self._pred_fetch_name_list = []\n        self._pred_fetch_var_list = []\n\n        self._exe = None\n\n        self._save_protocol = {\n            'input_names': 'self._pred_input_name_list',\n            'input_varnames': 'self._pred_input_varname_list',\n            'fetch_list': 'self._pred_fetch_name_list'}\n\n        self._check_save = lambda: False\n        for t in self._trainers:\n            t._set_multitask()\n\n    def build_forward(self):\n        \"\"\"\n        Build forward computation graph for training, which usually built from input layer to loss node.\n\n        Return:\n            - loss_var: a Variable object. The computational graph variable(node) of loss.\n        \"\"\"\n        head_dict = {}\n        backbone = self._trainers[0]._backbone\n        for i in self._trainers:\n            assert i._task_head is not None and i._backbone is not None, \"You should build forward for the {} task\".format(i._name)\n            assert i._backbone == backbone, \"The backbone for each task must be the same\"\n            head_dict[i._name] = i._task_head\n            \n        train_prog = fluid.Program()\n        train_init_prog = fluid.Program()\n        self._train_prog = train_prog\n        self._train_init_prog = train_init_prog\n\n        def get_loss(i):\n            head = head_dict[self._trainers[i].name]\n            self._trainers[i]._lock_prog = True\n            loss_var = self._trainers[i].build_forward(backbone, head)\n            self._trainers[i]._lock_prog = False\n            return loss_var\n      \n        task_fns = {i: lambda i=i: get_loss(i) for i in range(len(self._trainers))}\n\n        with fluid.program_guard(train_prog, train_init_prog):\n            task_id_var = fluid.data(name=\"__task_id\",shape=[1],dtype='int64')\n\n            loss_var = layers.switch_case(\n                branch_index=task_id_var,\n                branch_fns=task_fns\n            )\n        self._task_id_var = task_id_var\n        self._loss_var = loss_var\n        self._fetch_list = [loss_var.name]\n        if not self._multi_task:\n            self._init_exe_prog(for_train=True)\n        return loss_var\n        \n    def build_predict_forward(self):\n        head_dict = {}\n        backbone = self._trainers[0]._pred_backbone\n        for i in self._trainers:\n            assert i._pred_head is not None and i._pred_backbone is not None, \"You should build_predict_forward for the {} task\".format(i._name)\n            assert i._pred_backbone == backbone, \"The backbone for each task must be the same\"\n            head_dict[i._name] = i._pred_head\n            \n        pred_prog = fluid.Program()\n        pred_init_prog = fluid.Program()\n        self._pred_prog = pred_prog\n        self._pred_init_prog = pred_init_prog\n\n        def get_loss(i):\n            head = head_dict[self._trainers[i].name]\n            self._trainers[i]._lock_prog = True\n            pred_vars = self._trainers[i].build_predict_forward(backbone, head)\n            self._trainers[i]._lock_prog = False\n            # return loss_var\n      \n        task_fns = {i: lambda i=i: get_loss(i) for i in range(len(self._trainers))}\n\n        with fluid.program_guard(pred_prog, pred_init_prog):\n            task_id_var = fluid.data(name=\"__task_id\",shape=[1],dtype='int64')\n\n            loss_var = layers.switch_case(\n                branch_index=task_id_var,\n                branch_fns=task_fns\n            )\n        if not self._multi_task:\n            self._init_exe_prog(for_train=False)\n\n    def merge_inference_readers(self, readers):\n\n        for r in readers:\n            assert r._phase == 'predict'\n\n        if isinstance(readers, list):\n            reader_dict = {k.name: v for k,v in zip(self._trainers, readers)}\n        elif isinstance(readers, dict):\n            reader_dict = readers\n        else:\n            raise ValueError()\n        \n        num_heads = len(self._trainers)\n        assert len(reader_dict) == num_heads, \"received number of readers is not consistent with trainers.\"\n\n        trainer_dict = {t.name: t for t in self._trainers}\n        task_name2id = {t.name: idx for idx, t in enumerate(self._trainers)}\n        self._task_name2id = task_name2id\n\n        self._finish_steps = {}\n        self._finish = {}\n        input_names = []\n        name_to_pos = []\n        joint_shape_and_dtypes = []\n        iterators = []\n        prefixes = []\n        mrs = []\n        net_inputs = []\n        global_steps = 0\n        for t in self._trainers:\n            assert t.name in reader_dict\n            assert reader_dict[t.name].num_epochs is None, \"{}: num_epochs is not None. \\\n                To run with multi-head mode, num_epochs of each Trainer should be set as None.\".format(t.name)\n            # print(num_epochs, t.mix_ratio, base_steps_pur_epoch)\n            self._finish_steps[t.name] = 9999999999\n            self._finish[t.name] = True\n\n            # t._set_task_id(self._task_id_var)\n            t.fit_reader(reader_dict[t.name], phase='predict')\n            net_inputs.append(t._pred_net_inputs)\n            prefixes.append(t.name)\n            iterators.append(t._raw_iterator_fn())\n            input_names.append(t._pred_input_names)\n            name_to_pos.append(t._pred_name_to_position)\n            joint_shape_and_dtypes.append(t._pred_shape_and_dtypes)\n\n        iterator_fn = reader_helper.create_multihead_inference_fn(iterators, prefixes, joint_shape_and_dtypes, \\\n            input_names, name_to_pos, task_name2id, dev_count=dev_count)\n        feed_batch_process_fn = reader_helper.create_feed_batch_process_fn(net_inputs)\n\n        if gpu_dev_count > 1:\n            raise NotImplementedError('currently only single-gpu mode has been supported running with multi-task mode.')\n            # distribute_feeder_fn = data_feeder(iterator_fn, feed_batch_process_fn, phase=phase, is_multi=True, with_arg=True)\n        else:\n            distribute_feeder_fn = iterator_fn\n\n        self._predict_iterator_fn = distribute_feeder_fn\n        self._pred_feed_batch_process_fn = feed_batch_process_fn\n        return distribute_feeder_fn\n\n    def fit_readers_with_mixratio(self, readers, sampling_reference, num_epochs, phase='train'):\n        \"\"\"\n        Bind readers and loaded train/predict data to trainers. The `num_epochs` argument only \n            works on `sampling_reference` task(trainer), and num_epochs of other tasks are infered from \n            their `mix_ratio`.\n\n        Args:\n            readers: a dict or list of Reader objects. For dict case, each key is a trainer's name, and the mapped value is the reader object to bind to the trainer. For list case, each \n            sampling_reference: a trainer name. The task(trainer) selected as baseline for task sampling. \n            num_epochs: training epochs of the sampling_reference task (trainer). \n        \"\"\"\n        self._check_phase(phase)\n\n        if isinstance(readers, list):\n            reader_dict = {k.name: v for k,v in zip(self._trainers, readers)}\n        elif isinstance(readers, dict):\n            reader_dict = readers\n        else:\n            raise ValueError()\n        \n        num_heads = len(self._trainers)\n        assert len(reader_dict) == num_heads, \"received number of readers is not consistent with trainers.\"\n\n        trainer_dict = {t.name: t for t in self._trainers}\n        assert sampling_reference in trainer_dict\n\n        trainer_dict[sampling_reference]._set_task_id(self._task_id_var)\n        trainer_dict[sampling_reference].fit_reader(reader_dict[sampling_reference])\n        base_steps_pur_epoch = trainer_dict[sampling_reference]._steps_pur_epoch\n\n        self._finish_steps = {}\n        self._finish = {}\n        input_names = []\n        name_to_pos = []\n        joint_shape_and_dtypes = []\n        iterators = []\n        prefixes = []\n        mrs = []\n        net_inputs = []\n        global_steps = 0\n        for t in self._trainers:\n            assert t.name in reader_dict\n            assert reader_dict[t.name].num_epochs is None, \"{}: num_epochs is not None. \\\n                To run with multi-head mode, num_epochs of each Trainer should be set as None.\".format(t.name)\n            # print(num_epochs, t.mix_ratio, base_steps_pur_epoch)\n            max_train_steps = int(num_epochs * t.mix_ratio * base_steps_pur_epoch)\n            if not t._as_auxilary:\n                print('{}: expected train steps {}.'.format(t.name, max_train_steps))\n                sys.stdout.flush()\n                self._finish_steps[t.name] = max_train_steps\n                self._finish[t.name] = False\n            else:\n                self._finish_steps[t.name] = 9999999999\n                self._finish[t.name] = True\n\n            global_steps += max_train_steps\n            if t.name != sampling_reference:\n                t._set_task_id(self._task_id_var)\n                t.fit_reader(reader_dict[t.name])\n            net_inputs.append(t._net_inputs)\n            prefixes.append(t.name)\n            mrs.append(t.mix_ratio)\n            iterators.append(t._raw_iterator_fn())\n            input_names.append(t._input_names)\n            name_to_pos.append(t._name_to_position)\n            joint_shape_and_dtypes.append(t._shape_and_dtypes)\n\n        print('Estimated overall train steps {}.'.format(global_steps))\n        sys.stdout.flush()\n        self._overall_train_steps = global_steps\n\n        iterator_fn = reader_helper.create_multihead_iterator_fn(iterators, prefixes, joint_shape_and_dtypes, \\\n            mrs, input_names, name_to_pos, dev_count=dev_count)\n        feed_batch_process_fn = reader_helper.create_feed_batch_process_fn(net_inputs)\n\n        if gpu_dev_count > 1:\n            distribute_feeder_fn = data_feeder(iterator_fn, feed_batch_process_fn, phase=phase, is_multi=True)\n        else:\n            distribute_feeder_fn = iterator_fn()\n\n        if phase == 'train':\n            self._train_reader = distribute_feeder_fn\n            self._feed_batch_process_fn = feed_batch_process_fn\n        elif phase == 'predict':\n            self._predict_reader = distribute_feeder_fn\n            self._pred_feed_batch_process_fn = feed_batch_process_fn\n        return distribute_feeder_fn\n\n    def _check_finish(self, task_name, silent=False):\n        trainers = {t.name:t for t in self._trainers}\n        if trainers[task_name]._cur_train_step == self._finish_steps[task_name]:\n            if not silent:\n                print(task_name+' train finish!')\n                sys.stdout.flush()\n            self._finish[task_name]=True\n        flags = list(set(self._finish.values()))\n        return len(flags) == 1 and flags[0] == True\n        \n    def train(self, print_steps=5):\n        \"\"\"\n        start training.\n\n        Args:\n            print_steps: int. Logging frequency of training message, e.g., current step, loss and speed.\n        \"\"\"\n        iterator = self._train_reader\n        self._distribute_train_prog = fluid.CompiledProgram(self._train_prog).with_data_parallel(loss_name=self._loss_var.name)\n        for t in self._trainers:\n            t._dist_train_init = True\n            t._set_exe(self._exe)\n            t._set_dist_train(self._distribute_train_prog)\n            t._set_fetch_list(self._fetch_list)\n\n        time_begin = time.time()\n        for feed in iterator:\n            # batch, task_id = feed\n            rt_outputs, task_id = self.train_one_step(feed)\n\n            task_rt_outputs = {k[len(self._trainers[task_id].name+'.'):]: v for k,v in rt_outputs.items() if k.startswith(self._trainers[task_id].name+'.')}\n            self._trainers[task_id]._task_head.batch_postprocess(task_rt_outputs)\n            if print_steps > 0 and self._cur_train_step % print_steps == 0:\n                loss = rt_outputs[self._trainers[task_id].name+'.loss']\n                loss = np.mean(np.squeeze(loss)).tolist()\n\n                time_end = time.time()\n                time_cost = time_end - time_begin\n\n                print(\"global step: {}, {}: step {}/{} (epoch {}), loss: {:.3f}, speed: {:.2f} steps/s\".format(\n                       self._cur_train_step, ' '*self._name_pads[self._trainers[task_id].name]+self._trainers[task_id].name, \\\n                       (self._trainers[task_id]._cur_train_step-1) % self._trainers[task_id]._steps_pur_epoch + 1, \\\n                       self._trainers[task_id]._steps_pur_epoch, self._trainers[task_id]._cur_train_epoch, \\\n                       loss, print_steps / time_cost))\n                sys.stdout.flush()\n                time_begin = time.time()\n\n            self._check_save()\n            finish = self._check_finish(self._trainers[task_id].name)\n            if finish:\n                break\n\n    def train_one_step(self, batch):\n        if not self._dist_train_init:\n            self._distribute_train_prog = fluid.CompiledProgram(self._train_prog).with_data_parallel(loss_name=self._loss_var.name)\n            for t in self._trainers:\n                t._dist_train_init = True\n                t._set_exe(self._exe)\n                t._set_dist_train(self._distribute_train_prog)\n                t._set_fetch_list(self._fetch_list)\n            self._dist_train_init = True\n\n        if dev_count > 1:\n            assert isinstance(batch, tuple)\n            task_id = batch[0][0]['__task_id'][0]\n        else:\n            assert isinstance(batch, dict)\n            task_id = batch['__task_id'][0]\n            \n        rt_outputs = self._trainers[task_id].train_one_step(batch)\n\n        self._cur_train_step += 1\n        self._check_save()\n        return rt_outputs, task_id\n        \n    def predict_one_batch(self, task_name, batch):\n        if dev_count > 1:\n            raise NotImplementedError()\n\n        # batch = next(self._predict_iterator_fn(task_name))\n        t = self._trainers[self._task_name2id[task_name]]\n        # t._set_exe(self._exe)\n        t._set_dist_pred(self._trainers[self._task_name2id[task_name]]._pred_prog)\n        rt_outputs = t.predict_one_batch(batch)\n        return rt_outputs\n\n    def predict(self, output_dir=None, print_steps=1000):\n        raise NotImplementedError()\n        # iterator = self._predict_iterator\n        # self._distribute_pred_prog = fluid.CompiledProgram(self._pred_prog).with_data_parallel()\n\n    @property\n    def overall_train_steps(self):\n        return self._overall_train_steps\n\n"
  },
  {
    "path": "paddlepalm/optimizer/__init__.py",
    "content": "\nfrom .adam import Adam\n"
  },
  {
    "path": "paddlepalm/optimizer/adam.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Optimization and learning rate scheduling.\"\"\"\n\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport numpy as np\nimport paddle.fluid as fluid\nfrom paddlepalm.optimizer.base_optimizer import Optimizer\n\nclass Adam(Optimizer):\n\n    def __init__(self, loss_var, lr, lr_schedualer=None):\n\n        Optimizer.__init__(self, loss_var, lr, lr_schedualer=None)\n\n        self._loss = loss_var\n        self._lr = lr\n        self._lr_schedualer = lr_schedualer\n    \n    def _build(self, grad_clip=None):\n\n        if self._lr_schedualer is not None:\n            self._lr = self._lr_schedualer._build(self._lr)\n\n        optimizer = fluid.optimizer.Adam(learning_rate=self._lr)\n\n        if grad_clip is not None:\n            clip_norm_thres = grad_clip\n            # When using mixed precision training, scale the gradient clip threshold\n            # by loss_scaling\n            fluid.clip.set_gradient_clip(\n                clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=clip_norm_thres))\n\n        _, param_grads = optimizer.minimize(self._loss)\n        return param_grads\n\n    def get_cur_learning_rate(self):\n        return self._lr\n\n\n"
  },
  {
    "path": "paddlepalm/optimizer/base_optimizer.py",
    "content": "\nclass Optimizer(object):\n\n    def __init__(self, loss_var, lr, lr_schedualer=None):\n        self._prog = None\n        self._lr_schedualer = lr_schedualer\n\n    def _build(self, grad_clip=None):\n        raise NotImplementedError()\n\n    def _set_prog(self, prog, init_prog):\n        self._prog = prog\n        self._init_prog = prog\n        if self._lr_schedualer is not None:\n            self._lr_schedualer._set_prog(prog)\n\n    def get_cur_learning_rate(self):\n        pass\n\n\n"
  },
  {
    "path": "paddlepalm/reader/__init__.py",
    "content": "\nfrom .cls import ClassifyReader\nfrom .match import MatchReader\nfrom .seq_label import SequenceLabelReader\nfrom .mrc import MRCReader\nfrom .mlm import MaskLMReader\n"
  },
  {
    "path": "paddlepalm/reader/base_reader.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom copy import copy\nclass Reader(object):\n    \"\"\"interface of data reader.\"\"\"\n\n    def __init__(self, phase='train'):\n        \"\"\"该函数完成一个Reader的构造，至少需要包含一个phase参数。\n        注意：实现该构造函数时，必须保证对基类构造函数的调用，以创建必要的框架内建的成员变量。\n        Args:\n            phase: str类型。用于区分主干网络被调用时所处的运行阶段，目前支持训练阶段train和预测阶段predict\n            \"\"\"\n        \n        self._phase = phase\n        self._batch_size = None\n        self._num_epochs = 1\n        self._register = set()\n        self._registered_backbone = None\n\n    @classmethod\n    def create_register(self):\n        return set()\n        \n    def clone(self, phase='train'):\n        \"\"\"拷贝一个新的reader对象。\"\"\"\n        if phase == self._phase:\n            return copy(self)\n        else:\n            ret = copy(self)\n            ret._phase = phase\n            return ret\n\n    def require_attr(self, attr_name):\n        \"\"\"在注册器中新增一个需要产生的对象。\n\n        Args:\n            attr_name: 需要产出的对象的对象名，例如’segment_ids‘。\n            \"\"\"\n        self._register.add(attr_name)\n            \n    def register_with(self, backbone):\n        \"\"\"根据backbone对输入对象的依赖，在注册器中对每个依赖的输入对象进行注册。\n\n        Args:\n            backbone: 需要对接的主干网络。\n        \"\"\"\n        for attr in backbone.inputs_attr:\n            self.require_attr(attr)\n        self._registered_backbone = backbone\n\n    def get_registered_backbone(self):\n        \"\"\"返回该reader所注册的backbone。\"\"\"\n        return self._registered_backbone\n\n    def _get_registed_attrs(self, attrs):\n        ret = {}\n        for i in self._register:\n            if i not in attrs:\n                raise NotImplementedError('output attr {} is not found in this reader.'.format(i))\n            ret[i] = attrs[i]\n        return ret\n\n    def load_data(self, input_file, batch_size, num_epochs=None, \\\n                  file_format='tsv', shuffle_train=True):\n        \"\"\"将磁盘上的数据载入到reader中。\n\n        注意：实现该方法时需要同步创建self._batch_size和self._num_epochs。\n\n        Args:\n            input_file: 数据集文件路径。文件格式需要满足`file_format`参数的要求。\n            batch_size: 迭代器每次yield出的样本数量。注意：当环境中存在多个GPU时，batch_size需要保证被GPU卡数整除。\n            num_epochs: 数据集遍历次数。默认为None, 在单任务模式下代表遍历一次，在多任务模式下该参数会被上层的Trainer进行自动赋值。该参数仅对训练阶段有效。\n            file_format: 输入文件的文件格式。目前支持的格式: tsv. 默认为tsv.\n            shuffle_train: 是否打乱训练集中的样本。默认为True。该参数仅对训练阶段有效。\n        \"\"\"\n        raise NotImplementedError()\n\n    @property\n    def outputs_attr(self):\n        \"\"\"描述reader输出对象（被yield出的对象）的属性，包含各个对象的名字、shape以及数据类型。当某个对象为标量数据\n        类型（如str, int, float等）时，shape设置为空列表[]，当某个对象的某个维度长度可变时，shape中的相应维度设置为-1。\n        注意：当使用mini-batch梯度下降学习策略时，，应为常规的输入对象设置batch_size维度（一般为-1）\n        Return:\n            dict类型。对各个输入对象的属性描述。例如，\n            对于文本分类和匹配任务，yield的输出内容可能包含如下的对象（下游backbone和task可按需访问其中的对象）\n                {\"token_ids\": ([-1, max_len], 'int64'),\n                 \"input_ids\": ([-1, max_len], 'int64'),\n                 \"segment_ids\": ([-1, max_len], 'int64'),\n                 \"input_mask\": ([-1, max_len], 'float32'),\n                 \"label\": ([-1], 'int')}\n        \"\"\"\n        raise NotImplementedError()\n    \n    def _iterator(self):\n        \"\"\"数据集遍历接口，注意，当数据集遍历到尾部时该接口应自动完成指针重置，即重新从数据集头部开始新的遍历。\n        Yield:\n            dict类型。符合outputs_attr描述的当前step的输出对象。\n        \"\"\"\n        raise NotImplementedError()\n\n    def get_epoch_outputs(self):\n        \"\"\"返回数据集每个epoch遍历后的输出对象。\"\"\"\n        raise NotImplementedError()\n\n    @property\n    def num_examples(self):\n        \"\"\"数据集中的样本数量，即每个epoch中iterator所生成的样本数。注意，使用滑动窗口等可能导致数据集样本数发生变化的策略时\n        该接口应返回runtime阶段的实际样本数。\"\"\"\n        raise NotImplementedError()\n\n    @property\n    def num_epochs(self):\n        \"\"\"数据集遍历次数\"\"\"\n        return self._num_epochs\n"
  },
  {
    "path": "paddlepalm/reader/cls.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom paddlepalm.reader.base_reader import Reader\nfrom paddlepalm.reader.utils.reader4ernie import ClassifyReader as CLSReader\n\n\nclass ClassifyReader(Reader):\n    \"\"\"\n    The reader completes the loading and processing of text classification dataset. Supported file format: tsv. \n    \n    For tsv format, training dataset file should have two header areas, i.e., `label` and `text`, and test set only requires `text` area. For example,\n\n    ```\n    label [TAB] text\n    1 [TAB] Today is a good day.\n    0 [TAB] Such a terriable day!\n    1 [TAB] I feel lucky to meet you, dear.\n    1 [TAB] He likes sunshine and I like him :).\n    0 [TAB] JUST! GO! OUT!\n    ```\n\n    CAUTIOUS: The first line of the file must be header! And areas are splited by tab (\\\\t).\n\n    \"\"\"\n    \n    def __init__(self, vocab_path, max_len, tokenizer='wordpiece', \\\n             lang='en', seed=None, do_lower_case=False, phase='train'):\n        \"\"\"Create a new Reader for loading and processing classification task data.\n\n        Args:\n          vocab_path: the vocab file path to do tokenization and token_ids generation.\n          max_len: The maximum length of the sequence (after word segmentation). The part exceeding max_len will be removed from right.\n          tokenizer: string type. The name of the used tokenizer. A tokenizer is to convert raw text into tokens. Avaliable tokenizers: wordpiece.\n          lang: the language of dataset. Supported language: en (English), cn (Chinese). Default is en (English). \n          seed: int type. The random seed to shuffle dataset. Default is None, means no use of random seed.\n          do_lower_case: bool type. Whether to do lowercase on English text. Default is False. This argument only works on English text.\n          phase: the running phase of this reader. Supported phase: train, predict. Default is train.\n\n        Return:\n            a Reader object for classification task.\n        \"\"\"\n\n        Reader.__init__(self, phase)\n\n        assert lang.lower() in ['en', 'cn', 'english', 'chinese'], \"supported language: en (English), cn (Chinese).\"\n        assert phase in ['train', 'predict'], \"supported phase: train, predict.\"\n\n        for_cn = lang.lower() == 'cn' or lang.lower() == 'chinese'\n\n        self._register.add('token_ids')\n        if phase == 'train':\n            self._register.add('label_ids')\n\n        self._is_training = phase == 'train'\n\n        cls_reader = CLSReader(vocab_path,\n                                max_seq_len=max_len,\n                                do_lower_case=do_lower_case,\n                                for_cn=for_cn,\n                                random_seed=seed)\n        self._reader = cls_reader\n\n        self._phase = phase\n        # self._batch_size = \n        # self._print_first_n = config.get('print_first_n', 0)\n\n\n    @property\n    def outputs_attr(self):\n        \"\"\"The contained output items (input features) of this reader.\"\"\"\n        attrs = {\"token_ids\": [[-1, -1], 'int64'],\n                \"position_ids\": [[-1, -1], 'int64'],\n                \"segment_ids\": [[-1, -1], 'int64'],\n                \"input_mask\": [[-1, -1, 1], 'float32'],\n                \"label_ids\": [[-1], 'int64'],\n                \"task_ids\": [[-1, -1], 'int64']\n                }\n        return self._get_registed_attrs(attrs)\n\n\n    def load_data(self, input_file, batch_size, num_epochs=None, \\\n                  file_format='tsv', shuffle_train=True):\n        \"\"\"Load classification data into reader. \n\n        Args:\n            input_file: the dataset file path. File format should keep consistent with `file_format` argument.\n            batch_size: number of examples for once yield. CAUSIOUS! If your environment exists multiple GPU devices (marked as dev_count), the batch_size should be divided by dev_count with no remainder!\n            num_epochs: the travelsal times of input examples. Default is None, means once for single-task learning and automatically calculated for multi-task learning. This argument only works on train phase.\n            file_format: the file format of input file. Supported format: tsv. Default is tsv.\n            shuffle_train: whether to shuffle training dataset. Default is True. This argument only works on training phase.\n\n        \"\"\"\n        self._batch_size = batch_size\n        self._num_epochs = num_epochs\n        self._data_generator = self._reader.data_generator( \\\n            input_file, batch_size, num_epochs if self._phase == 'train' else 1, \\\n            shuffle=shuffle_train if self._phase == 'train' else False, \\\n            phase=self._phase)\n\n    def _iterator(self): \n\n        names = ['token_ids', 'segment_ids', 'position_ids', 'task_ids', 'input_mask', \n            'label_ids', 'unique_ids']\n        for batch in self._data_generator():\n            outputs = {n: i for n,i in zip(names, batch)}\n            ret = {}\n            # TODO: move runtime shape check here\n            for attr in self.outputs_attr.keys():\n                ret[attr] = outputs[attr]\n            yield ret\n\n    def get_epoch_outputs(self):\n        return {'examples': self._reader.get_examples(self._phase),\n                'features': self._reader.get_features(self._phase)}\n\n    @property\n    def num_examples(self):\n        return self._reader.get_num_examples(phase=self._phase)\n\n    @property\n    def num_epochs(self):\n        return self._num_epochs\n\n\n"
  },
  {
    "path": "paddlepalm/reader/match.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom paddlepalm.reader.base_reader import Reader\nfrom paddlepalm.reader.utils.reader4ernie import ClassifyReader as CLSReader\n\n\nclass MatchReader(Reader):\n    \"\"\"\n    The reader completes the loading and processing of matching-like task (e.g, query-query, question-answer, text similarity, natural language inference) dataset. Supported file format: tsv. \n    \n    For pointwise learning strategy, there should be two fields in training dataset file, i.e., `text_a`, `text_b` and `label`. For pairwise learning, there should exist three fields, i.e., `text_a`, `text_b` and `text_b_neg`. For predicting, only `text_a` and `text_b` are required.\n    \n    A pointwise learning case shows as follows:\n    ```\n    label [TAB] text_a [TAB] text_b\n    1 [TAB] Today is a good day. [TAB] what a nice day!\n    0 [TAB] Such a terriable day! [TAB] There is a dog.\n    1 [TAB] I feel lucky to meet you, dear. [TAB] You are my lucky, darling.\n    1 [TAB] He likes sunshine and I like him :). [TAB] I like him. He like sunshine.\n    0 [TAB] JUST! GO! OUT! [TAB] Come in please.\n    ```\n    A pairwise learning case shows as follows:\n    text_a [TAB] text_b [TAB] text_b_neg\n    Today is a good day. [TAB] what a nice day! [TAB] terriable day!\n    Such a terriable day! [TAB] So terriable today! [TAB] There is a dog.\n    I feel lucky to meet you, dear. [TAB] You are my lucky, darling. [TAB] Buy some bananas, okey?\n    He likes sunshine and I like him :). [TAB] I like him. He like sunshine. [TAB] He has a dog.\n    JUST! GO! OUT! [TAB] go out now! [TAB] Come in please.\n\n    CAUTIOUS: the HEADER is required for each dataset file! And fields (columns) should be splited by Tab (\\\\t).\n\n    \"\"\"\n    \n    def __init__(self, vocab_path, max_len, tokenizer='wordpiece', lang='en', seed=None, \\\n        do_lower_case=False, learning_strategy='pointwise', phase='train', dev_count=1, print_prefix=''): \n        \"\"\"Create a new Reader for classification task data.\n\n        Args:\n          vocab_path: the vocab file path to do tokenization and token_ids generation.\n          max_len: The maximum length of the sequence (after word segmentation). The part exceeding max_len will be removed from right.\n          tokenizer: string type. The name of the used tokenizer. A tokenizer is to convert raw text into tokens. Avaliable tokenizers: wordpiece.\n          lang: the language of dataset. Supported language: en (English), cn (Chinese). Default is en (English). \n          seed: int type. The random seed to shuffle dataset. Default is None, means no use of random seed.\n          do_lower_case: bool type. Whether to do lowercase on English text. Default is False. This argument only works on English text.\n          learning_strategy: string type. This only works for training phase. Available strategies: pointwise, pairwise.\n          phase: the running phase of this reader. Supported phase: train, predict. Default is train.\n\n        Return:\n            a Reader object for matching-like task.\n        \"\"\"\n\n        Reader.__init__(self, phase)\n\n        assert lang.lower() in ['en', 'cn', 'english', 'chinese'], \"supported language: en (English), cn (Chinese).\"\n        assert phase in ['train', 'predict'], \"supported phase: train, predict.\"\n\n        for_cn = lang.lower() == 'cn' or lang.lower() == 'chinese'\n\n        self._register.add('token_ids')\n        if phase == 'train':\n            if learning_strategy == 'pointwise':\n                self._register.add('label_ids')\n            if learning_strategy == 'pairwise':\n                self._register.add('token_ids_neg')\n                self._register.add('position_ids_neg')\n                self._register.add('segment_ids_neg')\n                self._register.add('input_mask_neg')\n                self._register.add('task_ids_neg')\n\n        self._is_training = phase == 'train'\n        self._learning_strategy = learning_strategy\n\n\n        match_reader = CLSReader(vocab_path,\n                                max_seq_len=max_len,\n                                do_lower_case=do_lower_case,\n                                for_cn=for_cn,\n                                random_seed=seed,\n                                learning_strategy = learning_strategy)\n            \n        self._reader = match_reader\n        self._dev_count = dev_count\n        self._phase = phase\n\n\n    @property\n    def outputs_attr(self):\n        attrs = {\"token_ids\": [[-1, -1], 'int64'],\n                \"position_ids\": [[-1, -1], 'int64'],\n                \"segment_ids\": [[-1, -1], 'int64'],\n                \"input_mask\": [[-1, -1, 1], 'float32'],\n                \"task_ids\": [[-1, -1], 'int64'],\n                \"label_ids\": [[-1], 'int64'],\n                \"token_ids_neg\": [[-1, -1], 'int64'],\n                \"position_ids_neg\": [[-1, -1], 'int64'],\n                \"segment_ids_neg\": [[-1, -1], 'int64'],\n                \"input_mask_neg\": [[-1, -1, 1], 'float32'],\n                \"task_ids_neg\": [[-1, -1], 'int64']\n                }\n        return self._get_registed_attrs(attrs)\n\n\n    def load_data(self, input_file, batch_size, num_epochs=None, \\\n                  file_format='tsv', shuffle_train=True):\n        \"\"\"Load matching data into reader. \n\n        Args:\n            input_file: the dataset file path. File format should keep consistent with `file_format` argument.\n            batch_size: number of examples for once yield. CAUSIOUS! If your environment exists multiple GPU devices (marked as dev_count), the batch_size should be divided by dev_count with no remainder!\n            num_epochs: the travelsal times of input examples. Default is None, means once for single-task learning and automatically calculated for multi-task learning. This argument only works on train phase.\n            file_format: the file format of input file. Supported format: tsv. Default is tsv.\n            shuffle_train: whether to shuffle training dataset. Default is True. This argument only works on training phase.\n\n        \"\"\"\n        self._batch_size = batch_size\n        self._num_epochs = num_epochs\n        self._data_generator = self._reader.data_generator( \\\n            input_file, batch_size, num_epochs if self._phase == 'train' else 1, \\\n            shuffle=shuffle_train if self._phase == 'train' else False, \\\n            phase=self._phase)\n\n    def _iterator(self): \n\n        \n        names = ['token_ids', 'segment_ids', 'position_ids', 'task_ids', 'input_mask', 'label_ids', \\\n            'token_ids_neg', 'segment_ids_neg', 'position_ids_neg', 'task_ids_neg', 'input_mask_neg']\n        \n        if self._learning_strategy == 'pairwise':\n            names.remove('label_ids')\n\n\n        for batch in self._data_generator():\n            outputs = {n: i for n,i in zip(names, batch)}\n            ret = {}\n            # TODO: move runtime shape check here\n            for attr in self.outputs_attr.keys():\n                ret[attr] = outputs[attr]\n            yield ret\n\n    @property\n    def num_examples(self):\n        return self._reader.get_num_examples(phase=self._phase)\n\n    @property\n    def num_epochs(self):\n        return self._num_epochs\n\n"
  },
  {
    "path": "paddlepalm/reader/mlm.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom paddlepalm.reader.base_reader import Reader\nfrom paddlepalm.reader.utils.reader4ernie import MaskLMReader as MLMReader\nimport numpy as np\n\nclass MaskLMReader(Reader):\n    \n    def __init__(self, vocab_path, max_len, tokenizer='wordpiece', \\\n             lang='en', seed=None, do_lower_case=False, phase='train', dev_count=1, print_prefix=''):\n        \"\"\"\n        Args:\n            phase: train, eval, pred\n        \"\"\"\n\n\n        Reader.__init__(self, phase)\n\n        assert lang.lower() in ['en', 'cn', 'english', 'chinese'], \"supported language: en (English), cn (Chinese).\"\n        assert phase in ['train', 'predict'], \"supported phase: train, predict.\"\n\n        for_cn = lang.lower() == 'cn' or lang.lower() == 'chinese'\n\n        self._register.add('mask_pos')\n        if phase == 'train':\n            self._register.add('mask_label')\n        self._is_training = phase == 'train'\n\n        mlm_reader = MLMReader(vocab_path,\n                                max_seq_len=max_len,\n                                do_lower_case=do_lower_case,\n                                for_cn=for_cn,\n                                random_seed=seed)\n        self._reader = mlm_reader\n\n        self._phase = phase\n        self._dev_count = dev_count\n\n\n    @property\n    def outputs_attr(self):\n        attrs = {\"token_ids\": [[-1, -1], 'int64'],\n                \"position_ids\": [[-1, -1], 'int64'],\n                \"segment_ids\": [[-1, -1], 'int64'],\n                \"input_mask\": [[-1, -1, 1], 'float32'],\n                \"task_ids\": [[-1, -1], 'int64'],\n                \"mask_label\": [[-1], 'int64'],\n                \"mask_pos\": [[-1], 'int64']\n                }\n\n        return self._get_registed_attrs(attrs)\n\n\n    def load_data(self, input_file, batch_size, num_epochs=None, \\\n                  file_format='csv', shuffle_train=True):\n        self._batch_size = batch_size\n        self._num_epochs = num_epochs\n        self._data_generator = self._reader.data_generator( \\\n            input_file, batch_size, num_epochs if self._phase == 'train' else 1, \\\n            shuffle=shuffle_train if self._phase == 'train' else False, \\\n            phase=self._phase)\n\n    def _iterator(self): \n\n        names = ['token_ids', 'position_ids', 'segment_ids', 'input_mask', \n            'task_ids', 'mask_label', 'mask_pos']\n        for batch in self._data_generator():\n            outputs = {n: i for n,i in zip(names, batch)}\n            ret = {}\n            # TODO: move runtime shape check here\n            for attr in self.outputs_attr.keys():\n                ret[attr] = outputs[attr]\n\n            yield ret\n\n    def get_epoch_outputs(self):\n        return {'examples': self._reader.get_examples(self._phase),\n                'features': self._reader.get_features(self._phase)}\n\n    @property\n    def num_examples(self):\n        return self._reader.get_num_examples(phase=self._phase)\n\n    @property\n    def num_epochs(self):\n        return self._num_epochs\n\n"
  },
  {
    "path": "paddlepalm/reader/mrc.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom paddlepalm.reader.base_reader import Reader\nfrom paddlepalm.reader.utils.reader4ernie import MRCReader as MRCReader_t\nimport numpy as np\n\nclass MRCReader(Reader):\n    \"\"\"\n    The reader completes the loading and processing of SQuAD like machine reading comprehension dataset. Supported file format: json. \n    \n    The outermost data structure of a dataset is a dictionary, which contains the dataset version number field and data field. In the data field, each example contains the title of the article and several paragraphs. Each paragraph contains a paragraph context corresponed question-answer pairs. For each q-a pair, it contains a question with globally unique ID, as well as (several) answers. Each answer item contains the text of the answer itself and its starting position of the context. Note that the starting position is at the character level. In addition, for the test set, answers field is not necessary.\n\n    A typical case is shown as follows.\n    {\"version\": \"1.0\",\n     \"data\": [\n         {\"title\": \"...\",\n          \"paragraphs\": [\n             {\"context\": \"...\",\n              \"qas\": [\n                 {\"question\": \"...\"\n                  \"id\": \"...\"\n                  \"answers\": [\n                     {\"text\": \"...\",\n                      \"answer_start\": ...}\n                     {...}\n                     ...\n                     ]\n                  }\n                  {...}\n                  ...\n                  ]\n              }\n              {...},\n              ...\n              ]\n          }\n          {...}\n          ...\n      ]\n     }\n    \n    \"\"\"\n\n    def __init__(self, vocab_path, max_len, max_query_len, doc_stride, \\\n                 tokenizer='wordpiece', lang='en', seed=None, do_lower_case=False, \\\n                 remove_noanswer=True, phase='train'):\n        \"\"\"Create a new Reader for loading and processing machine reading comprehension task data.\n\n        Args:\n          vocab_path: the vocab file path to do tokenization and token_ids generation.\n          max_len: the maximum length of the sequence (after word segmentation). The part exceeding max_len will be removed from right.\n          max_query_len: the maximum length of query/question (after word segmentation).\n          doc_stride: the slice stride of context window.\n          tokenizer: string type. The name of the used tokenizer. A tokenizer is to convert raw text into tokens. Avaliable tokenizers: wordpiece.\n          lang: the language of dataset. Supported language: en (English), cn (Chinese). Default is en (English). \n          seed: int type. The random seed to shuffle dataset. Default is None, means no use of random seed.\n          do_lower_case: bool type. Whether to do lowercase on English text. Default is False. This argument only works on English text.\n          remove_noanswer: bool type. Whether to remove no answer question and invalid answer.\n          phase: the running phase of this reader. Supported phase: train, predict. Default is train.\n\n        Return:\n            a Reader object for classification task.\n        \"\"\"\n\n        Reader.__init__(self, phase)\n\n\n        assert lang.lower() in ['en', 'cn', 'english', 'chinese'], \"supported language: en (English), cn (Chinese).\"\n        assert phase in ['train', 'predict'], \"supported phase: train, predict.\"\n\n        for_cn = lang.lower() == 'cn' or lang.lower() == 'chinese'\n\n\n        self._register.add('token_ids')\n        if phase == 'train':\n            self._register.add(\"start_positions\")\n            self._register.add(\"end_positions\")\n        else:\n            self._register.add(\"unique_ids\")\n            \n\n        self._is_training = phase == 'train'\n\n        mrc_reader = MRCReader_t(vocab_path,\n                                 max_seq_len=max_len,\n                                 do_lower_case=do_lower_case,\n                                 tokenizer=tokenizer,\n                                 doc_stride=doc_stride,\n                                 remove_noanswer=remove_noanswer,\n                                 max_query_length=max_query_len,\n                                 for_cn=for_cn,\n                                 random_seed=seed)\n        self._reader = mrc_reader\n\n        self._phase = phase\n \n\n    @property\n    def outputs_attr(self):\n        attrs = {\"token_ids\": [[-1, -1], 'int64'],\n                \"position_ids\": [[-1, -1], 'int64'],\n                \"segment_ids\": [[-1, -1], 'int64'],\n                \"input_mask\": [[-1, -1, 1], 'float32'],\n                \"start_positions\": [[-1], 'int64'],\n                \"end_positions\": [[-1], 'int64'],\n                \"task_ids\": [[-1, -1], 'int64'],\n                \"unique_ids\": [[-1], 'int64']\n                }\n        return self._get_registed_attrs(attrs)\n\n    @property\n    def epoch_outputs_attr(self):\n        if not self._is_training:\n            return {\"examples\": None,\n                    \"features\": None}\n\n    def load_data(self, input_file, batch_size, num_epochs=None, file_format='csv', shuffle_train=True):\n        \"\"\"Load mrc data into reader. \n\n        Args:\n            input_file: the dataset file path. File format should keep consistent with `file_format` argument.\n            batch_size: number of examples for once yield. CAUSIOUS! If your environment exists multiple GPU devices (marked as dev_count), the batch_size should be divided by dev_count with no remainder!\n            num_epochs: the travelsal times of input examples. Default is None, means once for single-task learning and automatically calculated for multi-task learning. This argument only works on train phase.\n            file_format: the file format of input file. Supported format: tsv. Default is tsv.\n            shuffle_train: whether to shuffle training dataset. Default is True. This argument only works on training phase.\n\n        \"\"\"\n        self._batch_size = batch_size\n        self._num_epochs = num_epochs\n        self._data_generator = self._reader.data_generator( \\\n            input_file, batch_size, num_epochs if self._phase == 'train' else 1, \\\n            shuffle=shuffle_train if self._phase == 'train' else False, \\\n            phase=self._phase)\n    def _iterator(self): \n\n        names = ['token_ids', 'segment_ids', 'position_ids', 'task_ids', 'input_mask', \n            'start_positions', 'end_positions', 'unique_ids']\n        \n        if self._is_training:\n            names.remove('unique_ids')\n        \n        for batch in self._data_generator():\n            outputs = {n: i for n,i in zip(names, batch)}\n            ret = {}\n            # TODO: move runtime shape check here\n            for attr in self.outputs_attr.keys():\n                ret[attr] = outputs[attr]\n            if not self._is_training:\n                assert 'unique_ids' in ret, ret\n            yield ret\n    \n\n    def get_epoch_outputs(self):\n\n        return {'examples': self._reader.get_examples(self._phase),\n                'features': self._reader.get_features(self._phase)}\n\n    @property\n    def num_examples(self):\n        return self._reader.get_num_examples(phase=self._phase)\n\n    @property\n    def num_epochs(self):\n        return self._num_epochs\n\n"
  },
  {
    "path": "paddlepalm/reader/seq_label.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom paddlepalm.reader.base_reader import Reader\nfrom paddlepalm.reader.utils.reader4ernie import SequenceLabelReader as SLReader\n\nclass SequenceLabelReader(Reader):\n    \"\"\"\n    The reader completes the loading and processing of sequence labeling type task (e.g, pos tagging, named entity recognition) dataset. Supported file format: tsv. \n    \"\"\"\n    \n    def __init__(self, vocab_path, max_len, label_map_config, tokenizer='wordpiece', \\\n             lang='en', seed=None, do_lower_case=False, phase='train', dev_count=1, print_prefix=''):\n        \"\"\"  \n        Args:\n            phase: train, eval, pred\n            lang: en, ch, ...\n        \"\"\"\n        \n        Reader.__init__(self, phase)\n\n        assert lang.lower() in ['en', 'cn', 'english', 'chinese'], \"supported language: en (English), cn (Chinese).\"\n        assert phase in ['train', 'predict'], \"supported phase: train, predict.\"\n\n        for_cn = lang.lower() == 'cn' or lang.lower() == 'chinese'\n\n        self._register.add('token_ids')\n        self._register.add('seq_lens')\n        if phase == 'train':\n            self._register.add('label_ids')\n\n        self._is_training = phase == 'train'\n\n        ner_reader = SLReader(vocab_path,\n                                max_seq_len=max_len,\n                                do_lower_case=do_lower_case,\n                                for_cn=for_cn,\n                                random_seed=seed,\n                                label_map_config=label_map_config)\n        self._reader = ner_reader\n        self._phase = phase\n        self._dev_count = dev_count\n\n \n    @property\n    def outputs_attr(self):\n        attrs = {\"token_ids\": [[-1, -1], 'int64'],\n                \"position_ids\": [[-1, -1], 'int64'],\n                \"segment_ids\": [[-1, -1], 'int64'],\n                \"task_ids\": [[-1, -1], 'int64'],\n                \"input_mask\": [[-1, -1, 1], 'float32'],\n                \"seq_lens\": [[-1], 'int64'],\n                \"label_ids\": [[-1, -1], 'int64']}\n        return self._get_registed_attrs(attrs)\n\n\n    def load_data(self, input_file, batch_size, num_epochs=None, \\\n                  file_format='tsv', shuffle_train=True):\n        \"\"\"Load sequence labeling data into reader. \n\n        Args:\n            input_file: the dataset file path. File format should keep consistent with `file_format` argument.\n            batch_size: number of examples for once yield. CAUSIOUS! If your environment exists multiple GPU devices (marked as dev_count), the batch_size should be divided by dev_count with no remainder!\n            num_epochs: the travelsal times of input examples. Default is None, means once for single-task learning and automatically calculated for multi-task learning. This argument only works on train phase.\n            file_format: the file format of input file. Supported format: tsv. Default is tsv.\n            shuffle_train: whether to shuffle training dataset. Default is True. This argument only works on training phase.\n\n        \"\"\"\n        self._batch_size = batch_size\n        self._num_epochs = num_epochs\n        self._data_generator = self._reader.data_generator( \\\n            input_file, batch_size, num_epochs if self._phase == 'train' else 1, \\\n            shuffle=shuffle_train if self._phase == 'train' else False, \\\n            phase=self._phase)\n\n    def _iterator(self): \n\n        names = ['token_ids', 'segment_ids', 'position_ids', 'task_ids', 'input_mask', \n            'label_ids', 'seq_lens', 'label_ids']\n        for batch in self._data_generator():\n            outputs = {n: i for n,i in zip(names, batch)}\n            ret = {}\n            # TODO: move runtime shape check here\n            for attr in self.outputs_attr.keys():\n                ret[attr] = outputs[attr]\n            yield ret\n\n    def get_epoch_outputs(self):\n        return {'examples': self._reader.get_examples(self._phase),\n                'features': self._reader.get_features(self._phase)}\n\n    @property\n    def num_examples(self):\n        return self._reader.get_num_examples(phase=self._phase)\n\n    @property\n    def num_epochs(self):\n        return self._num_epochs\n"
  },
  {
    "path": "paddlepalm/reader/utils/__init__.py",
    "content": ""
  },
  {
    "path": "paddlepalm/reader/utils/batching4bert.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Mask, padding and batching.\"\"\"\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\nimport numpy as np\n\n\ndef mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3):\n    \"\"\"\n    Add mask for batch_tokens, return out, mask_label, mask_pos;\n    Note: mask_pos responding the batch_tokens after padded;\n    \"\"\"\n    max_len = max([len(sent) for sent in batch_tokens])\n    mask_label = []\n    mask_pos = []\n    prob_mask = np.random.rand(total_token_num)\n    # Note: the first token is [CLS], so [low=1]\n    replace_ids = np.random.randint(1, high=vocab_size, size=total_token_num)\n    pre_sent_len = 0\n    prob_index = 0\n    for sent_index, sent in enumerate(batch_tokens):\n        mask_flag = False\n        prob_index += pre_sent_len\n        for token_index, token in enumerate(sent):\n            prob = prob_mask[prob_index + token_index]\n            if prob > 0.15:\n                continue\n            elif 0.03 < prob <= 0.15:\n                # mask\n                if token != SEP and token != CLS:\n                    mask_label.append(sent[token_index])\n                    sent[token_index] = MASK\n                    mask_flag = True\n                    mask_pos.append(sent_index * max_len + token_index)\n            elif 0.015 < prob <= 0.03:\n                # random replace\n                if token != SEP and token != CLS:\n                    mask_label.append(sent[token_index])\n                    sent[token_index] = replace_ids[prob_index + token_index]\n                    mask_flag = True\n                    mask_pos.append(sent_index * max_len + token_index)\n            else:\n                # keep the original token\n                if token != SEP and token != CLS:\n                    mask_label.append(sent[token_index])\n                    mask_pos.append(sent_index * max_len + token_index)\n        pre_sent_len = len(sent)\n        # ensure at least mask one word in a sentence\n        while not mask_flag:\n            token_index = int(np.random.randint(1, high=len(sent) - 1, size=1))\n            if sent[token_index] != SEP and sent[token_index] != CLS:\n                mask_label.append(sent[token_index])\n                sent[token_index] = MASK\n                mask_flag = True\n                mask_pos.append(sent_index * max_len + token_index)\n    mask_label = np.array(mask_label).astype(\"int64\").reshape([-1])\n    mask_pos = np.array(mask_pos).astype(\"int64\").reshape([-1])\n    return batch_tokens, mask_label, mask_pos\n\n\ndef prepare_batch_data(insts,\n                       total_token_num,\n                       max_len=None,\n                       voc_size=0,\n                       pad_id=None,\n                       cls_id=None,\n                       sep_id=None,\n                       mask_id=None,\n                       return_input_mask=True,\n                       return_max_len=True,\n                       return_num_token=False):\n    \"\"\"\n    1. generate Tensor of data\n    2. generate Tensor of position\n    3. generate self attention mask, [shape: batch_size *  max_len * max_len]\n    \"\"\"\n    batch_src_ids = [inst[0] for inst in insts]\n    batch_sent_ids = [inst[1] for inst in insts]\n    batch_pos_ids = [inst[2] for inst in insts]\n    labels_list = []\n    # compatible with mrqa, whose example includes start/end positions, \n    # or unique id\n    for i in range(3, len(insts[0]), 1):\n        labels = [inst[i] for inst in insts]\n        labels = np.array(labels).astype(\"int64\").reshape([-1])\n        labels_list.append(labels)\n    # First step: do mask without padding\n    if mask_id >= 0:\n        out, mask_label, mask_pos = mask(\n            batch_src_ids,\n            total_token_num,\n            vocab_size=voc_size,\n            CLS=cls_id,\n            SEP=sep_id,\n            MASK=mask_id)\n    else:\n        out = batch_src_ids\n    # Second step: padding\n    src_id, self_input_mask = pad_batch_data(\n        out, \n        max_len=max_len,\n        pad_idx=pad_id, return_input_mask=True)\n    pos_id = pad_batch_data(\n        batch_pos_ids,\n        max_len=max_len,\n        pad_idx=pad_id,\n        return_pos=False,\n        return_input_mask=False)\n    sent_id = pad_batch_data(\n        batch_sent_ids,\n        max_len=max_len,\n        pad_idx=pad_id,\n        return_pos=False,\n        return_input_mask=False)\n    if mask_id >= 0:\n        return_list = [\n            src_id, pos_id, sent_id, self_input_mask, mask_label, mask_pos\n        ] + labels_list\n    else:\n        return_list = [src_id, pos_id, sent_id, self_input_mask] + labels_list\n    return return_list if len(return_list) > 1 else return_list[0]\n\n\ndef pad_batch_data(insts,\n                   max_len=None,\n                   pad_idx=0,\n                   return_pos=False,\n                   return_input_mask=False,\n                   return_max_len=False,\n                   return_num_token=False):\n    \"\"\"\n    Pad the instances to the max sequence length in batch, and generate the\n    corresponding position data and input mask.\n    \"\"\"\n    return_list = []\n    if max_len is None:\n        max_len = max(len(inst) for inst in insts)\n    # Any token included in dict can be used to pad, since the paddings' loss\n    # will be masked out by weights and make no effect on parameter gradients.\n    inst_data = np.array([\n        list(inst) + list([pad_idx] * (max_len - len(inst))) for inst in insts\n    ])\n    return_list += [inst_data.astype(\"int64\").reshape([-1, max_len])]\n    # position data\n    if return_pos:\n        inst_pos = np.array([\n            list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))\n            for inst in insts\n        ])\n        return_list += [inst_pos.astype(\"int64\").reshape([-1, max_len])]\n    if return_input_mask:\n        # This is used to avoid attention on paddings.\n        input_mask_data = np.array([[1] * len(inst) + [0] *\n                                    (max_len - len(inst)) for inst in insts])\n        input_mask_data = np.expand_dims(input_mask_data, axis=-1)\n        return_list += [input_mask_data.astype(\"float32\")]\n    if return_max_len:\n        return_list += [max_len]\n    if return_num_token:\n        num_token = 0\n        for inst in insts:\n            num_token += len(inst)\n        return_list += [num_token]\n    return return_list if len(return_list) > 1 else return_list[0]\n\n\nif __name__ == \"__main__\":\n    pass\n\n\n"
  },
  {
    "path": "paddlepalm/reader/utils/batching4ernie.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Mask, padding and batching.\"\"\"\n\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport numpy as np\n\nfrom six.moves import xrange\n\n\ndef mask(batch_tokens,\n         seg_labels,\n         mask_word_tags,\n         total_token_num,\n         vocab_size,\n         CLS=1,\n         SEP=2,\n         MASK=3):\n    \"\"\"\n    Add mask for batch_tokens, return out, mask_label, mask_pos;\n    Note: mask_pos responding the batch_tokens after padded;\n    \"\"\"\n    max_len = max([len(sent) for sent in batch_tokens])\n    mask_label = []\n    mask_pos = []\n    prob_mask = np.random.rand(total_token_num)\n    # Note: the first token is [CLS], so [low=1]\n    replace_ids = np.random.randint(1, high=vocab_size, size=total_token_num)\n    pre_sent_len = 0\n    prob_index = 0\n    for sent_index, sent in enumerate(batch_tokens):\n        mask_flag = False\n        mask_word = mask_word_tags[sent_index]\n        prob_index += pre_sent_len\n        if mask_word:\n            beg = 0\n            for token_index, token in enumerate(sent):\n                seg_label = seg_labels[sent_index][token_index]\n                if seg_label == 1:\n                    continue\n                if beg == 0:\n                    if seg_label != -1:\n                        beg = token_index\n                    continue\n\n                prob = prob_mask[prob_index + beg]\n                if prob > 0.15:\n                    pass\n                else:\n                    for index in xrange(beg, token_index):\n                        prob = prob_mask[prob_index + index]\n                        base_prob = 1.0\n                        if index == beg:\n                            base_prob = 0.15\n                        if base_prob * 0.2 < prob <= base_prob:\n                            mask_label.append(sent[index])\n                            sent[index] = MASK\n                            mask_flag = True\n                            mask_pos.append(sent_index * max_len + index)\n                        elif base_prob * 0.1 < prob <= base_prob * 0.2:\n                            mask_label.append(sent[index])\n                            sent[index] = replace_ids[prob_index + index]\n                            mask_flag = True\n                            mask_pos.append(sent_index * max_len + index)\n                        else:\n                            mask_label.append(sent[index])\n                            mask_pos.append(sent_index * max_len + index)\n\n                if seg_label == -1:\n                    beg = 0\n                else:\n                    beg = token_index\n        else:\n            for token_index, token in enumerate(sent):\n                prob = prob_mask[prob_index + token_index]\n                if prob > 0.15:\n                    continue\n                elif 0.03 < prob <= 0.15:\n                    # mask\n                    if token != SEP and token != CLS:\n                        mask_label.append(sent[token_index])\n                        sent[token_index] = MASK\n                        mask_flag = True\n                        mask_pos.append(sent_index * max_len + token_index)\n                elif 0.015 < prob <= 0.03:\n                    # random replace\n                    if token != SEP and token != CLS:\n                        mask_label.append(sent[token_index])\n                        sent[token_index] = replace_ids[prob_index +\n                                                        token_index]\n                        mask_flag = True\n                        mask_pos.append(sent_index * max_len + token_index)\n                else:\n                    # keep the original token\n                    if token != SEP and token != CLS:\n                        mask_label.append(sent[token_index])\n                        mask_pos.append(sent_index * max_len + token_index)\n\n        pre_sent_len = len(sent)\n\n    mask_label = np.array(mask_label).astype(\"int64\").reshape([-1])\n    mask_pos = np.array(mask_pos).astype(\"int64\").reshape([-1])\n    return batch_tokens, mask_label, mask_pos\n\n\ndef pad_batch_data(insts,\n                   pad_idx=0,\n                   return_pos=False,\n                   return_input_mask=False,\n                   return_max_len=False,\n                   return_num_token=False,\n                   return_seq_lens=False):\n    \"\"\"\n    Pad the instances to the max sequence length in batch, and generate the\n    corresponding position data and attention bias.\n    \"\"\"\n    return_list = []\n    max_len = max(len(inst) for inst in insts)\n    # Any token included in dict can be used to pad, since the paddings' loss\n    # will be masked out by weights and make no effect on parameter gradients.\n\n    inst_data = np.array(\n        [inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])\n    return_list += [inst_data.astype(\"int64\").reshape([-1, max_len])]\n\n    # position data\n    if return_pos:\n        inst_pos = np.array([\n            list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))\n            for inst in insts\n        ])\n\n        return_list += [inst_pos.astype(\"int64\").reshape([-1, max_len])]\n\n    if return_input_mask:\n        # This is used to avoid attention on paddings.\n        input_mask_data = np.array([[1] * len(inst) + [0] *\n                                    (max_len - len(inst)) for inst in insts])\n        input_mask_data = np.expand_dims(input_mask_data, axis=-1)\n        return_list += [input_mask_data.astype(\"float32\")]\n\n    if return_max_len:\n        return_list += [max_len]\n\n    if return_num_token:\n        num_token = 0\n        for inst in insts:\n            num_token += len(inst)\n        return_list += [num_token]\n\n    if return_seq_lens:\n        seq_lens = np.array([len(inst) for inst in insts])\n        return_list += [seq_lens.astype(\"int64\").reshape([-1])]\n\n    return return_list if len(return_list) > 1 else return_list[0]\n\n\nif __name__ == \"__main__\":\n\n    pass\n"
  },
  {
    "path": "paddlepalm/reader/utils/mlm_batching.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Mask, padding and batching.\"\"\"\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\nimport numpy as np\n\n\ndef mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3, dev_count=1):\n    \"\"\"\n    Add mask for batch_tokens, return out, mask_label, mask_pos;\n    Note: mask_pos responding the batch_tokens after padded;\n    \"\"\"\n    max_len = max([len(sent) for sent in batch_tokens])\n\n    multidev_batch_tokens = []\n    multidev_mask_label = []\n    multidev_mask_pos = []\n\n    big_batch_tokens = batch_tokens\n    stride = len(batch_tokens) // dev_count\n    if stride == 0:\n        return None, None, None\n    p = stride\n\n    for i in range(dev_count):\n        batch_tokens = big_batch_tokens[p-stride:p]\n        p += stride\n        mask_label = []\n        mask_pos = []\n        prob_mask = np.random.rand(total_token_num)\n        # Note: the first token is [CLS], so [low=1]\n        replace_ids = np.random.randint(1, high=vocab_size, size=total_token_num)\n        pre_sent_len = 0\n        prob_index = 0\n        for sent_index, sent in enumerate(batch_tokens):\n            mask_flag = False\n            prob_index += pre_sent_len\n            for token_index, token in enumerate(sent):\n                prob = prob_mask[prob_index + token_index]\n                if prob > 0.15:\n                    continue\n                elif 0.03 < prob <= 0.15:\n                    # mask\n                    if token != SEP and token != CLS:\n                        mask_label.append(sent[token_index])\n                        sent[token_index] = MASK\n                        mask_flag = True\n                        mask_pos.append(sent_index * max_len + token_index)\n                elif 0.015 < prob <= 0.03:\n                    # random replace\n                    if token != SEP and token != CLS:\n                        mask_label.append(sent[token_index])\n                        sent[token_index] = replace_ids[prob_index + token_index]\n                        mask_flag = True\n                        mask_pos.append(sent_index * max_len + token_index)\n                else:\n                    # keep the original token\n                    if token != SEP and token != CLS:\n                        mask_label.append(sent[token_index])\n                        mask_pos.append(sent_index * max_len + token_index)\n            pre_sent_len = len(sent)\n            # ensure at least mask one word in a sentence\n            while not mask_flag:\n                token_index = int(np.random.randint(1, high=len(sent) - 1, size=1))\n                if sent[token_index] != SEP and sent[token_index] != CLS:\n                    mask_label.append(sent[token_index])\n                    sent[token_index] = MASK\n                    mask_flag = True\n                    mask_pos.append(sent_index * max_len + token_index)\n        mask_label = np.array(mask_label).astype(\"int64\").reshape([-1])\n        mask_pos = np.array(mask_pos).astype(\"int64\").reshape([-1])\n\n        multidev_batch_tokens.extend(batch_tokens)\n        multidev_mask_label.append(mask_label)\n        multidev_mask_pos.append(mask_pos)\n    \n    return multidev_batch_tokens, multidev_mask_label, multidev_mask_pos\n\n\ndef prepare_batch_data(insts,\n                       total_token_num,\n                       max_len=None,\n                       voc_size=0,\n                       pad_id=None,\n                       cls_id=None,\n                       sep_id=None,\n                       mask_id=None,\n                       task_id=0,\n                       return_input_mask=True,\n                       return_max_len=True,\n                       return_num_token=False, \n                       dev_count=1):\n    \"\"\"\n    1. generate Tensor of data\n    2. generate Tensor of position\n    3. generate self attention mask, [shape: batch_size *  max_len * max_len]\n    \"\"\"\n    batch_src_ids = [inst[0] for inst in insts]\n    batch_sent_ids = [inst[1] for inst in insts]\n    batch_pos_ids = [inst[2] for inst in insts]\n\n    # 这里是否应该反过来？？？否则在task layer里展开后的word embedding是padding后的，这时候word的index是跟没有padding时的index对不上的？\n    # First step: do mask without padding\n    out, mask_label, mask_pos = mask(\n        batch_src_ids,\n        total_token_num,\n        vocab_size=voc_size,\n        CLS=cls_id,\n        SEP=sep_id,\n        MASK=mask_id,\n        dev_count=dev_count)\n    # Second step: padding\n    src_id, self_input_mask = pad_batch_data(\n        out, \n        max_len=max_len,\n        pad_idx=pad_id, return_input_mask=True)\n\n    pos_id = pad_batch_data(\n        batch_pos_ids,\n        max_len=max_len,\n        pad_idx=pad_id,\n        return_pos=False,\n        return_input_mask=False)\n    sent_id = pad_batch_data(\n        batch_sent_ids,\n        max_len=max_len,\n        pad_idx=pad_id,\n        return_pos=False,\n        return_input_mask=False)\n    task_ids = np.ones_like(\n        src_id, dtype=\"int64\") * task_id\n    return_list = [\n        src_id, pos_id, sent_id, self_input_mask, task_ids, mask_label, mask_pos\n    ]\n    return return_list\n\n\ndef pad_batch_data(insts,\n                   max_len=None,\n                   pad_idx=0,\n                   return_pos=False,\n                   return_input_mask=False,\n                   return_max_len=False,\n                   return_num_token=False):\n    \"\"\"\n    Pad the instances to the max sequence length in batch, and generate the\n    corresponding position data and input mask.\n    \"\"\"\n    return_list = []\n    if max_len is None:\n        max_len = max(len(inst) for inst in insts)\n    # Any token included in dict can be used to pad, since the paddings' loss\n    # will be masked out by weights and make no effect on parameter gradients.\n    inst_data = np.array([\n        list(inst) + list([pad_idx] * (max_len - len(inst))) for inst in insts\n    ])\n    return_list += [inst_data.astype(\"int64\").reshape([-1, max_len])]\n    # position data\n    if return_pos:\n        inst_pos = np.array([\n            list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))\n            for inst in insts\n        ])\n        return_list += [inst_pos.astype(\"int64\").reshape([-1, max_len])]\n    if return_input_mask:\n        # This is used to avoid attention on paddings.\n        input_mask_data = np.array([[1] * len(inst) + [0] *\n                                    (max_len - len(inst)) for inst in insts])\n        input_mask_data = np.expand_dims(input_mask_data, axis=-1)\n        return_list += [input_mask_data.astype(\"float32\")]\n    if return_max_len:\n        return_list += [max_len]\n    if return_num_token:\n        num_token = 0\n        for inst in insts:\n            num_token += len(inst)\n        return_list += [num_token]\n    return return_list if len(return_list) > 1 else return_list[0]\n\n\nif __name__ == \"__main__\":\n    pass\n\n\n"
  },
  {
    "path": "paddlepalm/reader/utils/mrqa_helper.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nclass MRQAExample(object):\n    \"\"\"A single training/test example for simple sequence classification.\n\n     For examples without an answer, the start and end position are -1.\n  \"\"\"\n\n    def __init__(self,\n                 qas_id,\n                 question_text,\n                 doc_tokens,\n                 orig_answer_text=None,\n                 start_position=None,\n                 end_position=None,\n                 is_impossible=False):\n        self.qas_id = qas_id\n        self.question_text = question_text\n        self.doc_tokens = doc_tokens\n        self.orig_answer_text = orig_answer_text\n        self.start_position = start_position\n        self.end_position = end_position\n        self.is_impossible = is_impossible\n\n    def __str__(self):\n        return self.__repr__()\n\n    def __repr__(self):\n        s = \"\"\n        s += \"qas_id: %s\" % (tokenization.printable_text(self.qas_id))\n        s += \", question_text: %s\" % (\n            tokenization.printable_text(self.question_text))\n        s += \", doc_tokens: [%s]\" % (\" \".join(self.doc_tokens))\n        if self.start_position:\n            s += \", start_position: %d\" % (self.start_position)\n        if self.start_position:\n            s += \", end_position: %d\" % (self.end_position)\n        if self.start_position:\n            s += \", is_impossible: %r\" % (self.is_impossible)\n        return s\n\n\nclass MRQAFeature(object):\n    \"\"\"A single set of features of data.\"\"\"\n\n    def __init__(self,\n                 unique_id,\n                 example_index,\n                 doc_span_index,\n                 tokens,\n                 token_to_orig_map,\n                 token_is_max_context,\n                 input_ids,\n                 input_mask,\n                 segment_ids,\n                 start_position=None,\n                 end_position=None,\n                 is_impossible=None):\n        self.unique_id = unique_id\n        self.example_index = example_index\n        self.doc_span_index = doc_span_index\n        self.tokens = tokens\n        self.token_to_orig_map = token_to_orig_map\n        self.token_is_max_context = token_is_max_context\n        self.input_ids = input_ids\n        self.input_mask = input_mask\n        self.segment_ids = segment_ids\n        self.start_position = start_position\n        self.end_position = end_position\n        self.is_impossible = is_impossible\n\n"
  },
  {
    "path": "paddlepalm/reader/utils/reader4ernie.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\nfrom __future__ import unicode_literals\nfrom __future__ import absolute_import\n\nimport sys\nimport os\nimport json\nimport random\nimport logging\nimport numpy as np\nimport six\nfrom io import open\nfrom collections import namedtuple\n\nimport paddlepalm as palm\nimport paddlepalm.tokenizer.ernie_tokenizer as tokenization\nfrom paddlepalm.reader.utils.batching4ernie import pad_batch_data\nfrom paddlepalm.reader.utils.mlm_batching import prepare_batch_data\n\n\nlog = logging.getLogger(__name__)\n\nif six.PY3 and hasattr(sys.stdout, 'buffer'):\n    import io\n    sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')\n    sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')\n\nif sys.version[0] == '2':\n    reload(sys)\n    sys.setdefaultencoding('utf-8')\nelse:\n    import importlib\n    importlib.reload(sys)\n\ndef csv_reader(fd, delimiter='\\t'):\n    def gen():\n        for i in fd:\n            yield i.rstrip('\\n').split(delimiter)\n    return gen()\n\n\nclass Reader(object):\n    def __init__(self,\n                 vocab_path,\n                 label_map_config=None,\n                 max_seq_len=512,\n                 do_lower_case=True,\n                 in_tokens=False,\n                 is_inference=False,\n                 learning_strategy='pointwise',\n                 random_seed=None,\n                 tokenizer=\"FullTokenizer\",\n                 phase='train',\n                 is_classify=True,\n                 is_regression=False,\n                 for_cn=True,\n                 task_id=0):\n        assert phase in ['train', 'predict'], \"supported phase: train, predict.\"\n        self.max_seq_len = max_seq_len\n        self.tokenizer = tokenization.FullTokenizer(\n            vocab_file=vocab_path, do_lower_case=do_lower_case)\n        self.vocab = self.tokenizer.vocab\n        self.pad_id = self.vocab[\"[PAD]\"]\n        self.cls_id = self.vocab[\"[CLS]\"]\n        self.sep_id = self.vocab[\"[SEP]\"]\n        self.mask_id = self.vocab[\"[MASK]\"]\n        self.in_tokens = in_tokens\n        self.phase = phase\n        self.is_inference = is_inference\n        self.learning_strategy = learning_strategy\n        self.for_cn = for_cn\n        self.task_id = task_id\n\n        np.random.seed(random_seed)\n\n        self.is_classify = is_classify\n        self.is_regression = is_regression\n        self.current_example = 0\n        self.current_epoch = 0\n        self.num_examples = 0\n        self.examples = {}\n\n        if label_map_config:\n            with open(label_map_config, encoding='utf8') as f: \n                self.label_map = json.load(f)\n        else:\n            self.label_map = None\n\n    def get_train_progress(self):\n        \"\"\"Gets progress for training phase.\"\"\"\n        return self.current_example, self.current_epoch\n\n    def _read_tsv(self, input_file, quotechar=None):\n        \"\"\"Reads a tab separated value file.\"\"\"\n        with open(input_file, 'r', encoding='utf8') as f:\n            reader = csv_reader(f)\n            headers = next(reader)\n            Example = namedtuple('Example', headers)\n\n            examples = []\n            for line in reader:\n                example = Example(*line)\n                examples.append(example)\n            return examples\n\n    def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):\n        \"\"\"Truncates a sequence pair in place to the maximum length.\"\"\"\n\n        # This is a simple heuristic which will always truncate the longer sequence\n        # one token at a time. This makes more sense than truncating an equal percent\n        # of tokens from each, since if one sequence is very short then each token\n        # that's truncated likely contains more information than a longer sequence.\n        while True:\n            total_length = len(tokens_a) + len(tokens_b)\n            if total_length <= max_length:\n                break\n            if len(tokens_a) > len(tokens_b):\n                tokens_a.pop()\n            else:\n                tokens_b.pop()\n    \n\n    def _convert_example_to_record(self, example, max_seq_length, tokenizer):\n        \"\"\"Converts a single `Example` into a single `Record`.\"\"\"\n\n        text_a = tokenization.convert_to_unicode(example.text_a)\n        tokens_a = tokenizer.tokenize(text_a)\n        tokens_b = None\n        has_text_b = False\n        has_text_b_neg = False\n        if isinstance(example, dict):\n            has_text_b = \"text_b\" in example.keys()\n            has_text_b_neg = \"text_b_neg\" in example.keys()\n        else:\n            has_text_b = \"text_b\" in example._fields\n            has_text_b_neg = \"text_b_neg\" in example._fields\n\n        if has_text_b:\n            text_b = tokenization.convert_to_unicode(example.text_b)\n            tokens_b = tokenizer.tokenize(text_b)\n            # Modifies `tokens_a` and `tokens_b` in place so that the total\n            # length is less than the specified length.\n            # Account for [CLS], [SEP], [SEP] with \"- 3\"\n            self._truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)\n           \n            if has_text_b_neg and self.phase == 'train':\n                tokens_a_neg = tokenizer.tokenize(text_a)\n                text_b_neg = tokenization.convert_to_unicode(example.text_b_neg)\n                tokens_b_neg = tokenizer.tokenize(text_b_neg)\n                self._truncate_seq_pair(tokens_a_neg, tokens_b_neg, max_seq_length - 3)\n        else:\n            # Account for [CLS] and [SEP] with \"- 2\"\n            if len(tokens_a) > max_seq_length - 2:\n                tokens_a = tokens_a[0:(max_seq_length - 2)]\n        \n\n        # The convention in BERT/ERNIE is:\n        # (a) For sequence pairs:\n        #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]\n        #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1\n        # (b) For single sequences:\n        #  tokens:   [CLS] the dog is hairy . [SEP]\n        #  type_ids: 0     0   0   0  0     0 0\n        #\n        # Where \"type_ids\" are used to indicate whether this is the first\n        # sequence or the second sequence. The embedding vectors for `type=0` and\n        # `type=1` were learned during pre-training and are added to the wordpiece\n        # embedding vector (and position vector). This is not *strictly* necessary\n        # since the [SEP] token unambiguously separates the sequences, but it makes\n        # it easier for the model to learn the concept of sequences.\n        #\n        # For classification tasks, the first vector (corresponding to [CLS]) is\n        # used as as the \"sentence vector\". Note that this only makes sense because\n        # the entire model is fine-tuned.\n        tokens = []\n        text_type_ids = []\n        tokens.append(\"[CLS]\")\n        \n        text_type_ids.append(0)\n        for token in tokens_a:\n            tokens.append(token)\n            text_type_ids.append(0)\n        tokens.append(\"[SEP]\")\n        text_type_ids.append(0)\n\n        if tokens_b:\n            for token in tokens_b:\n                tokens.append(token)\n                text_type_ids.append(1)\n            tokens.append(\"[SEP]\")\n            text_type_ids.append(1)\n\n        token_ids = tokenizer.convert_tokens_to_ids(tokens)\n        position_ids = list(range(len(token_ids)))\n\n\n        if has_text_b_neg and self.phase == 'train':\n            tokens_neg = []\n            text_type_ids_neg = []\n            tokens_neg.append(\"[CLS]\")\n            text_type_ids_neg.append(0)\n            for token in tokens_a_neg:\n                tokens_neg.append(token)\n                text_type_ids_neg.append(0)\n            tokens_neg.append(\"[SEP]\")\n            text_type_ids_neg.append(0)\n\n            if tokens_b_neg:\n                for token in tokens_b_neg:\n                    tokens_neg.append(token)\n                    text_type_ids_neg.append(1)\n                tokens_neg.append(\"[SEP]\")\n                text_type_ids_neg.append(1)\n\n            token_ids_neg = tokenizer.convert_tokens_to_ids(tokens_neg)\n            position_ids_neg = list(range(len(token_ids_neg)))\n\n\n        if self.is_inference:\n            Record = namedtuple('Record',\n                                ['token_ids', 'text_type_ids', 'position_ids'])\n            record = Record(\n                token_ids=token_ids,\n                text_type_ids=text_type_ids,\n                position_ids=position_ids)\n        else:\n            qid = None\n            if \"qid\" in example._fields:\n                qid = example.qid\n            if self.learning_strategy == 'pairwise' and self.phase == 'train':\n                Record = namedtuple('Record',\n                                    ['token_ids', 'text_type_ids', 'position_ids', 'token_ids_neg', 'text_type_ids_neg', 'position_ids_neg', 'qid'])\n                \n                record = Record(\n                    token_ids=token_ids,\n                    text_type_ids=text_type_ids,\n                    position_ids=position_ids,\n                    token_ids_neg=token_ids_neg,\n                    text_type_ids_neg=text_type_ids_neg,\n                    position_ids_neg=position_ids_neg,\n                    qid=qid)\n \n            else:\n                if self.label_map:\n                    label_id = self.label_map[example.label]\n                else:\n                    label_id = example.label\n\n                Record = namedtuple('Record', [\n                    'token_ids', 'text_type_ids', 'position_ids', 'label_id', 'qid'\n                ])\n\n                record = Record(\n                    token_ids=token_ids,\n                    text_type_ids=text_type_ids,\n                    position_ids=position_ids,\n                    label_id=label_id,\n                    qid=qid)\n        return record\n\n    def _prepare_batch_data(self, examples, batch_size, phase='train'):\n        \"\"\"generate batch records\"\"\"\n        batch_records, max_len = [], 0\n        if len(examples) < batch_size:\n            raise Exception('CLS dataset contains too few samples. Expect more than '+str(batch_size))\n        for index, example in enumerate(examples):\n            if phase == \"train\":\n                self.current_example = index\n            record = self._convert_example_to_record(example, self.max_seq_len,\n                                                     self.tokenizer)                                       \n            max_len = max(max_len, len(record.token_ids))\n            if self.in_tokens:\n                to_append = (len(batch_records) + 1) * max_len <= batch_size\n            else:\n                to_append = len(batch_records) < batch_size\n            if to_append:\n                batch_records.append(record)\n            else:\n                batch_pad_records = self._pad_batch_records(batch_records)\n                ds = ['s'] * len(batch_pad_records)\n                for piece in palm.distribute.yield_pieces(batch_pad_records, ds, batch_size):\n                    yield piece\n                batch_records, max_len = [record], len(record.token_ids)\n      \n        if phase == 'predict' and batch_records:\n            for piece in palm.distribute.yield_pieces(\\\n                        self._pad_batch_records(batch_records),\n                        ds, batch_size):\n                yield piece\n\n    def get_num_examples(self, input_file=None, phase='train'):\n        if input_file is None:\n            return len(self.examples.get(phase, []))\n        else:\n            # assert input_file is not None, \"Argument input_file should be given or the data_generator should be created when this func is called.\"\n            examples = self._read_tsv(input_file)\n            return len(examples)\n\n    def data_generator(self,\n                       input_file,\n                       batch_size,\n                       epoch,\n                       dev_count=1,\n                       shuffle=True,\n                       phase=None):\n        examples = self._read_tsv(input_file)\n        if phase is None:\n            phase = 'all'\n        self.examples[phase] = examples\n\n        def wrapper():\n            all_dev_batches = []\n            if epoch is None:\n                num_epochs = 99999999\n            else:\n                num_epochs = epoch\n            for epoch_index in range(num_epochs):\n                if phase == \"train\":\n                    self.current_example = 0\n                    self.current_epoch = epoch_index\n                if shuffle:\n                    np.random.shuffle(examples)\n\n                for batch_data in self._prepare_batch_data(\n                        examples, batch_size, phase=phase):\n                    if len(all_dev_batches) < dev_count:\n                        all_dev_batches.append(batch_data)\n                    if len(all_dev_batches) == dev_count:\n                        for batch in all_dev_batches:\n                            yield batch\n                        \n                        all_dev_batches = []\n        def f():\n            for i in wrapper():\n                yield i\n        return f\n        # return wrapper\n\n\nclass MaskLMReader(Reader):\n\n    def _convert_example_to_record(self, example, max_seq_length, tokenizer):\n        \"\"\"Converts a single `Example` into a single `Record`.\"\"\"\n\n        text_a = tokenization.convert_to_unicode(example.text_a)\n        tokens_a = tokenizer.tokenize(text_a)\n        tokens_b = None \n\n        has_text_b = False\n        if isinstance(example, dict):\n            has_text_b = \"text_b\" in example.keys()\n        else:\n            has_text_b = \"text_b\" in example._fields\n\n        if has_text_b:\n            text_b = tokenization.convert_to_unicode(example.text_b)\n            tokens_b = tokenizer.tokenize(text_b)\n\n        if tokens_b:\n            # Modifies `tokens_a` and `tokens_b` in place so that the total\n            # length is less than the specified length.\n            # Account for [CLS], [SEP], [SEP] with \"- 3\"\n            self._truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)\n        else:\n            # Account for [CLS] and [SEP] with \"- 2\"\n            if len(tokens_a) > max_seq_length - 2:\n                tokens_a = tokens_a[0:(max_seq_length - 2)]\n\n        # The convention in BERT/ERNIE is:\n        # (a) For sequence pairs:\n        #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]\n        #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1\n        # (b) For single sequences:\n        #  tokens:   [CLS] the dog is hairy . [SEP]\n        #  type_ids: 0     0   0   0  0     0 0\n        #\n        # Where \"type_ids\" are used to indicate whether this is the first\n        # sequence or the second sequence. The embedding vectors for `type=0` and\n        # `type=1` were learned during pre-training and are added to the wordpiece\n        # embedding vector (and position vector). This is not *strictly* necessary\n        # since the [SEP] token unambiguously separates the sequences, but it makes\n        # it easier for the model to learn the concept of sequences.\n        #\n        # For classification tasks, the first vector (corresponding to [CLS]) is\n        # used as as the \"sentence vector\". Note that this only makes sense because\n        # the entire model is fine-tuned.\n        tokens = []\n        text_type_ids = []\n        tokens.append(\"[CLS]\")\n        text_type_ids.append(0)\n        for token in tokens_a:\n            tokens.append(token)\n            text_type_ids.append(0)\n        tokens.append(\"[SEP]\")\n        text_type_ids.append(0)\n\n        if tokens_b:\n            for token in tokens_b:\n                tokens.append(token)\n                text_type_ids.append(1)\n            tokens.append(\"[SEP]\")\n            text_type_ids.append(1)\n\n        token_ids = tokenizer.convert_tokens_to_ids(tokens)\n        position_ids = list(range(len(token_ids)))\n\n        return [token_ids, text_type_ids, position_ids]\n\n    def batch_reader(self, examples, batch_size, in_tokens, phase):\n        batch = []\n        total_token_num = 0\n        if len(examples) < batch_size:\n            raise Exception('MaskLM dataset contains too few samples. Expect more than '+str(batch_size))\n        for e in examples:\n            parsed_line = self._convert_example_to_record(e, self.max_seq_len, self.tokenizer)\n            to_append = len(batch) < batch_size\n            if to_append:\n                batch.append(parsed_line)\n                total_token_num += len(parsed_line[0])\n            else:\n                yield batch, total_token_num\n                batch = [parsed_line]\n                total_token_num = len(parsed_line[0])\n\n        if len(batch) > 0 and phase == 'predict':\n            yield batch, total_token_num\n\n    def data_generator(self,\n                       input_file,\n                       batch_size,\n                       epoch,\n                       dev_count=1,\n                       shuffle=True,\n                       phase=None):\n        examples = self._read_tsv(input_file)\n        if phase is None:\n            phase = 'all'\n        self.examples[phase] = examples\n\n        def wrapper():\n            all_dev_batches = []\n            if epoch is None:\n                num_epochs = 99999999\n            else:\n                num_epochs = epoch\n            for epoch_index in range(num_epochs):\n                if phase == \"train\":\n                    self.current_example = 0\n                    self.current_epoch = epoch_index\n                if shuffle:\n                    np.random.shuffle(examples)\n\n                all_dev_batches = []\n                for batch_data, num_tokens in self.batch_reader(examples, \n                                                    batch_size, self.in_tokens, phase=phase):\n                    batch_data = prepare_batch_data(\n                        batch_data,\n                        num_tokens,\n                        voc_size=len(self.vocab),\n                        pad_id=self.pad_id,\n                        cls_id=self.cls_id,\n                        sep_id=self.sep_id,\n                        mask_id=self.mask_id,\n                        # max_len=self.max_seq_len, # 注意，如果padding到最大长度，会导致mask_pos与实际位置不对应。因为mask pos是基于batch内最大长度来计算的。\n                        return_input_mask=True,\n                        return_max_len=False,\n                        return_num_token=False,\n                        dev_count=dev_count)\n\n                    # yield batch\n                    for piece in palm.distribute.yield_pieces(batch_data, ['s', 's', 's', 's', 's', 'u', 'u'], batch_size):\n                        yield piece\n                    # # ds = ['s'] * len(batch_data)\n                    # for piece in palm.distribute.yield_pieces(batch_data, ['s'] * 7, batch_size):\n                    #     yield piece\n\n        return wrapper\n\n\nclass ClassifyReader(Reader):\n    def _read_tsv(self, input_file, quotechar=None):\n        \"\"\"Reads a tab separated value file.\"\"\"\n        with open(input_file, 'r', encoding='utf8') as f:\n            reader = csv_reader(f)\n            headers = next(reader)\n            text_indices = [\n                index for index, h in enumerate(headers) if h != \"label\"\n            ]\n            Example = namedtuple('Example', headers)\n            examples = []\n            for line in reader:\n                for index, text in enumerate(line):\n                    if index in text_indices:\n                        if self.for_cn:\n                            line[index] = text.replace(' ', '')\n                        else:\n                            line[index] = text\n                example = Example(*line)\n                examples.append(example)\n            return examples\n\n    def _pad_batch_records(self, batch_records):\n        batch_token_ids = [record.token_ids for record in batch_records]\n        batch_text_type_ids = [record.text_type_ids for record in batch_records]\n        batch_position_ids = [record.position_ids for record in batch_records]\n        if self.phase=='train' and self.learning_strategy == 'pairwise':\n            batch_token_ids_neg = [record.token_ids_neg for record in batch_records]\n            batch_text_type_ids_neg = [record.text_type_ids_neg for record in batch_records]\n            batch_position_ids_neg = [record.position_ids_neg for record in batch_records]\n\n        if not self.is_inference:\n            if not self.learning_strategy == 'pairwise':\n                batch_labels = [record.label_id for record in batch_records]\n                if self.is_classify:\n                    batch_labels = np.array(batch_labels).astype(\"int64\").reshape(\n                        [-1])\n                elif self.is_regression:\n                    batch_labels = np.array(batch_labels).astype(\"float32\").reshape(\n                        [-1])\n\n            if batch_records[0].qid:\n                batch_qids = [record.qid for record in batch_records]\n                batch_qids = np.array(batch_qids).astype(\"int64\").reshape(\n                    [-1])\n            else:\n                batch_qids = np.array([]).astype(\"int64\").reshape([-1])\n\n        # padding\n        padded_token_ids, input_mask = pad_batch_data(\n            batch_token_ids, pad_idx=self.pad_id, return_input_mask=True)\n        padded_text_type_ids = pad_batch_data(\n            batch_text_type_ids, pad_idx=self.pad_id)\n        padded_position_ids = pad_batch_data(\n            batch_position_ids, pad_idx=self.pad_id)\n        padded_task_ids = np.ones_like(\n            padded_token_ids, dtype=\"int64\") * self.task_id\n\n        return_list = [\n            padded_token_ids, padded_text_type_ids, padded_position_ids,\n            padded_task_ids, input_mask\n        ]\n\n        if self.phase=='train':\n            if self.learning_strategy == 'pairwise':\n                padded_token_ids_neg, input_mask_neg = pad_batch_data(\n                    batch_token_ids_neg, pad_idx=self.pad_id, return_input_mask=True)\n                padded_text_type_ids_neg = pad_batch_data(\n                    batch_text_type_ids_neg, pad_idx=self.pad_id)\n                padded_position_ids_neg = pad_batch_data(\n                    batch_position_ids_neg, pad_idx=self.pad_id)\n                padded_task_ids_neg = np.ones_like(\n                    padded_token_ids_neg, dtype=\"int64\") * self.task_id\n\n                return_list += [padded_token_ids_neg, padded_text_type_ids_neg, \\\n                                padded_position_ids_neg, padded_task_ids_neg, input_mask_neg]\n\n            elif self.learning_strategy == 'pointwise':\n                return_list += [batch_labels]\n\n        return return_list\n\n\nclass SequenceLabelReader(Reader):\n    def _pad_batch_records(self, batch_records):\n        batch_token_ids = [record.token_ids for record in batch_records]\n        batch_text_type_ids = [record.text_type_ids for record in batch_records]\n        batch_position_ids = [record.position_ids for record in batch_records]\n        batch_label_ids = [record.label_ids for record in batch_records]\n\n        # padding\n        padded_token_ids, input_mask, batch_seq_lens = pad_batch_data(\n            batch_token_ids,\n            pad_idx=self.pad_id,\n            return_input_mask=True,\n            return_seq_lens=True)\n        padded_text_type_ids = pad_batch_data(\n            batch_text_type_ids, pad_idx=self.pad_id)\n        padded_position_ids = pad_batch_data(\n            batch_position_ids, pad_idx=self.pad_id)\n        padded_label_ids = pad_batch_data(\n            batch_label_ids, pad_idx=len(self.label_map) - 1)\n        padded_task_ids = np.ones_like(\n            padded_token_ids, dtype=\"int64\") * self.task_id\n\n        return_list = [\n            padded_token_ids, padded_text_type_ids, padded_position_ids,\n            padded_task_ids, input_mask, padded_label_ids, batch_seq_lens\n        ]\n        return return_list\n\n    def _reseg_token_label(self, tokens, labels, tokenizer):\n        assert len(tokens) == len(labels)\n        ret_tokens = []\n        ret_labels = []\n        for token, label in zip(tokens, labels):\n            sub_token = tokenizer.tokenize(token)\n            if len(sub_token) == 0:\n                continue\n            ret_tokens.extend(sub_token)\n            if len(sub_token) == 1:\n                ret_labels.append(label)\n                continue\n\n            ret_labels.extend([label] * len(sub_token))\n\n        assert len(ret_tokens) == len(ret_labels)\n        return ret_tokens, ret_labels\n\n    def _convert_example_to_record(self, example, max_seq_length, tokenizer):\n        tokens = tokenization.convert_to_unicode(example.text_a).split(u\"\u0002\")\n        labels = tokenization.convert_to_unicode(example.label).split(u\"\u0002\")\n        tokens, labels = self._reseg_token_label(tokens, labels, tokenizer)\n\n        if len(tokens) > max_seq_length - 2:\n            tokens = tokens[0:(max_seq_length - 2)]\n            labels = labels[0:(max_seq_length - 2)]\n\n        tokens = [\"[CLS]\"] + tokens + [\"[SEP]\"]\n        token_ids = tokenizer.convert_tokens_to_ids(tokens)\n        position_ids = list(range(len(token_ids)))\n        text_type_ids = [0] * len(token_ids)\n        no_entity_id = len(self.label_map) - 1\n        labels = [\n            label if label in self.label_map else u\"O\" for label in labels\n        ]\n        label_ids = [no_entity_id] + [\n            self.label_map[label] for label in labels\n        ] + [no_entity_id]\n\n        Record = namedtuple(\n            'Record',\n            ['token_ids', 'text_type_ids', 'position_ids', 'label_ids'])\n        record = Record(\n            token_ids=token_ids,\n            text_type_ids=text_type_ids,\n            position_ids=position_ids,\n            label_ids=label_ids)\n        return record\n\n\nclass ExtractEmbeddingReader(Reader):\n    def _pad_batch_records(self, batch_records):\n        batch_token_ids = [record.token_ids for record in batch_records]\n        batch_text_type_ids = [record.text_type_ids for record in batch_records]\n        batch_position_ids = [record.position_ids for record in batch_records]\n\n        # padding\n        padded_token_ids, input_mask, seq_lens = pad_batch_data(\n            batch_token_ids,\n            pad_idx=self.pad_id,\n            return_input_mask=True,\n            return_seq_lens=True)\n        padded_text_type_ids = pad_batch_data(\n            batch_text_type_ids, pad_idx=self.pad_id)\n        padded_position_ids = pad_batch_data(\n            batch_position_ids, pad_idx=self.pad_id)\n        padded_task_ids = np.ones_like(\n            padded_token_ids, dtype=\"int64\") * self.task_id\n\n        return_list = [\n            padded_token_ids, padded_text_type_ids, padded_position_ids,\n            padded_task_ids, input_mask, seq_lens\n        ]\n\n        return return_list\n\n\nclass MRCReader(Reader):\n    def __init__(self,\n                 vocab_path,\n                 label_map_config=None,\n                 max_seq_len=512,\n                 do_lower_case=True,\n                 in_tokens=False,\n                 random_seed=None,\n                 tokenizer=\"FullTokenizer\",\n                 is_classify=True,\n                 is_regression=False,\n                 for_cn=True,\n                 task_id=0,\n                 doc_stride=128,\n                 max_query_length=64,\n                 remove_noanswer=True):\n        self.max_seq_len = max_seq_len\n        self.tokenizer = tokenization.FullTokenizer(\n            vocab_file=vocab_path, do_lower_case=do_lower_case)\n        self.vocab = self.tokenizer.vocab\n        self.pad_id = self.vocab[\"[PAD]\"]\n        self.cls_id = self.vocab[\"[CLS]\"]\n        self.sep_id = self.vocab[\"[SEP]\"]\n        self.in_tokens = in_tokens\n        self.for_cn = for_cn\n        self.task_id = task_id\n        self.doc_stride = doc_stride\n        self.max_query_length = max_query_length\n        self.examples = {}\n        self.features = {}\n        self.remove_noanswer = remove_noanswer\n\n        if random_seed is not None:\n            np.random.seed(random_seed)\n\n        self.current_example = 0\n        self.current_epoch = 0\n        self.num_examples = 0\n\n        self.Example = namedtuple('Example',\n                ['qas_id', 'question_text', 'doc_tokens', 'orig_answer_text',\n                'start_position', 'end_position'])\n        self.Feature = namedtuple(\"Feature\", [\"unique_id\", \"example_index\", \"doc_span_index\",\n                \"tokens\", \"token_to_orig_map\", \"token_is_max_context\",\n                \"token_ids\", \"position_ids\", \"text_type_ids\",\n                \"start_position\", \"end_position\"])\n        self.DocSpan = namedtuple(\"DocSpan\", [\"start\", \"length\"])\n\n    def _read_json(self, input_file, is_training):\n        examples = []\n        with open(input_file, \"r\", encoding='utf-8') as f:\n           # f = f.read().decode(encoding='gbk').encode(encoding='utf-8')\n            input_data = json.load(f)[\"data\"]\n            for entry in input_data:\n                for paragraph in entry[\"paragraphs\"]:\n                    paragraph_text = paragraph[\"context\"]\n                    for qa in paragraph[\"qas\"]:\n                        qas_id = qa[\"id\"]\n                        question_text = qa[\"question\"]\n                        start_pos = None\n                        end_pos = None\n                        orig_answer_text = None\n\n                        if is_training:\n                            if len(qa[\"answers\"]) != 1:\n                                raise ValueError(\n                                    \"For training, each question should have exactly 1 answer.\"\n                                )\n\n                            answer = qa[\"answers\"][0]\n                            orig_answer_text = answer[\"text\"]\n                            answer_offset = answer[\"answer_start\"]\n                            answer_length = len(orig_answer_text)\n                            doc_tokens = [\n                                paragraph_text[:answer_offset],\n                                paragraph_text[answer_offset:answer_offset +\n                                               answer_length],\n                                paragraph_text[answer_offset + answer_length:]\n                            ]\n\n                            start_pos = 1\n                            end_pos = 1\n\n                            actual_text = \" \".join(doc_tokens[start_pos:(end_pos\n                                                                         + 1)])\n                            if actual_text.find(orig_answer_text) == -1:\n                                log.info(\"Could not find answer: '%s' vs. '%s'\",\n                                      actual_text, orig_answer_text)\n                                continue\n                        else:\n                            doc_tokens = tokenization.tokenize_chinese_chars(\n                                paragraph_text)\n\n                        example = self.Example(\n                            qas_id=qas_id,\n                            question_text=question_text,\n                            doc_tokens=doc_tokens,\n                            orig_answer_text=orig_answer_text,\n                            start_position=start_pos,\n                            end_position=end_pos)\n                        examples.append(example)\n\n        return examples\n\n    def _improve_answer_span(self, doc_tokens, input_start, input_end,\n                             tokenizer, orig_answer_text):\n        tok_answer_text = \" \".join(tokenizer.tokenize(orig_answer_text))\n\n        for new_start in range(input_start, input_end + 1):\n            for new_end in range(input_end, new_start - 1, -1):\n                text_span = \" \".join(doc_tokens[new_start:(new_end + 1)])\n                if text_span == tok_answer_text:\n                    return (new_start, new_end)\n\n        return (input_start, input_end)\n\n    def _check_is_max_context(self, doc_spans, cur_span_index, position):\n        best_score = None\n        best_span_index = None\n        for (span_index, doc_span) in enumerate(doc_spans):\n            end = doc_span.start + doc_span.length - 1\n            if position < doc_span.start:\n                continue\n            if position > end:\n                continue\n            num_left_context = position - doc_span.start\n            num_right_context = end - position\n            score = min(num_left_context,\n                        num_right_context) + 0.01 * doc_span.length\n            if best_score is None or score > best_score:\n                best_score = score\n                best_span_index = span_index\n\n        return cur_span_index == best_span_index\n\n    def _convert_example_to_feature(self, examples, max_seq_length, tokenizer,\n                                    is_training, remove_noanswer=True):\n        features = []\n        unique_id = 1000000000\n\n        print('converting examples to features...')\n        for (example_index, example) in enumerate(examples):\n            if example_index % 1000 == 0:\n                print('processing {}th example...'.format(example_index))\n            query_tokens = tokenizer.tokenize(example.question_text)\n            if len(query_tokens) > self.max_query_length:\n                query_tokens = query_tokens[0:self.max_query_length]\n            tok_to_orig_index = []\n            orig_to_tok_index = []\n            all_doc_tokens = []\n            for (i, token) in enumerate(example.doc_tokens):\n                orig_to_tok_index.append(len(all_doc_tokens))\n                sub_tokens = tokenizer.tokenize(token)\n                for sub_token in sub_tokens:\n                    tok_to_orig_index.append(i)\n                    all_doc_tokens.append(sub_token)\n\n            tok_start_position = None\n            tok_end_position = None\n            if is_training:\n                tok_start_position = orig_to_tok_index[example.start_position]\n                if example.end_position < len(example.doc_tokens) - 1:\n                    tok_end_position = orig_to_tok_index[example.end_position +\n                                                         1] - 1\n                else:\n                    tok_end_position = len(all_doc_tokens) - 1\n                (tok_start_position,\n                 tok_end_position) = self._improve_answer_span(\n                     all_doc_tokens, tok_start_position, tok_end_position,\n                     tokenizer, example.orig_answer_text)\n\n            max_tokens_for_doc = max_seq_length - len(query_tokens) - 3\n            doc_spans = []\n            start_offset = 0\n            while start_offset < len(all_doc_tokens):\n                length = len(all_doc_tokens) - start_offset\n                if length > max_tokens_for_doc:\n                    length = max_tokens_for_doc\n                doc_spans.append(self.DocSpan(start=start_offset, length=length))\n                if start_offset + length == len(all_doc_tokens):\n                    break\n                start_offset += min(length, self.doc_stride)\n           \n            for (doc_span_index, doc_span) in enumerate(doc_spans):\n                tokens = []\n                token_to_orig_map = {}\n                token_is_max_context = {}\n                text_type_ids = []\n                tokens.append(\"[CLS]\")\n                text_type_ids.append(0)\n                for token in query_tokens:\n                    tokens.append(token)\n                    text_type_ids.append(0)\n                tokens.append(\"[SEP]\")\n                text_type_ids.append(0)\n\n                for i in range(doc_span.length):\n                    split_token_index = doc_span.start + i\n                    token_to_orig_map[len(tokens)] = tok_to_orig_index[\n                        split_token_index]\n\n                    is_max_context = self._check_is_max_context(\n                        doc_spans, doc_span_index, split_token_index)\n                    token_is_max_context[len(tokens)] = is_max_context\n                    tokens.append(all_doc_tokens[split_token_index])\n                    text_type_ids.append(1)\n                tokens.append(\"[SEP]\")\n                text_type_ids.append(1)\n\n                token_ids = tokenizer.convert_tokens_to_ids(tokens)\n                position_ids = list(range(len(token_ids)))\n                start_position = None\n                end_position = None\n                if is_training:\n                    doc_start = doc_span.start\n                    doc_end = doc_span.start + doc_span.length - 1\n                    out_of_span = False\n                    if not (tok_start_position >= doc_start and\n                            tok_end_position <= doc_end):\n                        out_of_span = True\n                    if out_of_span:\n                        start_position = 0\n                        end_position = 0\n                        if remove_noanswer:\n                            continue\n                    else:\n                        doc_offset = len(query_tokens) + 2\n                        start_position = tok_start_position - doc_start + doc_offset\n                        end_position = tok_end_position - doc_start + doc_offset\n\n                feature = self.Feature(\n                    unique_id=unique_id,\n                    example_index=example_index,\n                    doc_span_index=doc_span_index,\n                    tokens=tokens,\n                    token_to_orig_map=token_to_orig_map,\n                    token_is_max_context=token_is_max_context,\n                    token_ids=token_ids,\n                    position_ids=position_ids,\n                    text_type_ids=text_type_ids,\n                    start_position=start_position,\n                    end_position=end_position)\n                features.append(feature)\n\n                unique_id += 1\n\n        return features\n\n    def _prepare_batch_data(self, records, batch_size, phase=None):\n        \"\"\"generate batch records\"\"\"\n        batch_records, max_len = [], 0\n\n        if len(records) < batch_size:\n            raise Exception('mrc dataset contains too few samples. Expect more than '+str(batch_size))\n\n        for index, record in enumerate(records):\n            if phase == \"train\":\n                self.current_example = index\n            max_len = max(max_len, len(record.token_ids))\n            if self.in_tokens:\n                to_append = (len(batch_records) + 1) * max_len <= batch_size\n            else:\n                to_append = len(batch_records) < batch_size\n            if to_append:\n                batch_records.append(record)\n            else:\n                # yield self._pad_batch_records(batch_records, phase == \"train\")\n                ds = ['s'] * 8\n                for piece in palm.distribute.yield_pieces(\\\n                        self._pad_batch_records(batch_records, phase == 'train'),\n                        ds, batch_size):\n                    yield piece\n                batch_records, max_len = [record], len(record.token_ids)\n      \n        if phase == 'predict' and batch_records:\n            for piece in palm.distribute.yield_pieces(\\\n                        self._pad_batch_records(batch_records, phase == 'train'),\n                        ds, batch_size):\n                yield piece\n\n\n    def _pad_batch_records(self, batch_records, is_training):\n        batch_token_ids = [record.token_ids for record in batch_records]\n        batch_text_type_ids = [record.text_type_ids for record in batch_records]\n        batch_position_ids = [record.position_ids for record in batch_records]\n        if is_training:\n            batch_start_position = [\n                record.start_position for record in batch_records\n            ]\n            batch_end_position = [\n                record.end_position for record in batch_records\n            ]\n            batch_start_position = np.array(batch_start_position).astype(\n                \"int64\").reshape([-1])\n            batch_end_position = np.array(batch_end_position).astype(\n                \"int64\").reshape([-1])\n\n        else:\n            batch_size = len(batch_token_ids)\n            batch_start_position = np.zeros(\n                shape=[batch_size], dtype=\"int64\")\n            batch_end_position = np.zeros(shape=[batch_size], dtype=\"int64\")\n\n        batch_unique_ids = [record.unique_id for record in batch_records]\n        batch_unique_ids = np.array(batch_unique_ids).astype(\"int64\").reshape(\n            [-1])\n\n        # padding\n        padded_token_ids, input_mask = pad_batch_data(\n            batch_token_ids, pad_idx=self.pad_id, return_input_mask=True)\n        padded_text_type_ids = pad_batch_data(\n            batch_text_type_ids, pad_idx=self.pad_id)\n        padded_position_ids = pad_batch_data(\n            batch_position_ids, pad_idx=self.pad_id)\n        padded_task_ids = np.ones_like(\n            padded_token_ids, dtype=\"int64\") * self.task_id\n\n        return_list = [\n            padded_token_ids, padded_text_type_ids, padded_position_ids,\n            padded_task_ids, input_mask, batch_start_position,\n            batch_end_position, batch_unique_ids\n        ]\n\n        return return_list\n\n    def get_num_examples(self, phase):\n        return len(self.features[phase])\n\n    def get_features(self, phase):\n        return self.features[phase]\n\n    def get_examples(self, phase):\n        return self.examples[phase]\n\n    def data_generator(self,\n                       input_file,\n                       batch_size,\n                       epoch,\n                       dev_count=1,\n                       shuffle=True,\n                       phase=None):\n\n        examples = self.examples.get(phase, None)\n        features = self.features.get(phase, None)\n        if not examples:\n            examples = self._read_json(input_file, phase == \"train\")\n            features = self._convert_example_to_feature(\n                examples, self.max_seq_len, self.tokenizer, phase == \"train\", remove_noanswer=self.remove_noanswer)\n            self.examples[phase] = examples\n            self.features[phase] = features\n\n        def wrapper():\n            all_dev_batches = []\n            if epoch is None:\n                num_epochs = 99999999\n            else:\n                num_epochs = epoch\n            for epoch_index in range(num_epochs):\n                if phase == \"train\":\n                    self.current_example = 0\n                    self.current_epoch = epoch_index\n                if phase == \"train\" and shuffle:\n                    np.random.shuffle(features)\n  \n                for batch_data in self._prepare_batch_data(\n                        features, batch_size, phase=phase):\n\n                    yield batch_data\n\n        return wrapper\n\n\nif __name__ == '__main__':\n    pass\n"
  },
  {
    "path": "paddlepalm/tokenizer/__init__.py",
    "content": ""
  },
  {
    "path": "paddlepalm/tokenizer/bert_tokenizer.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes.\"\"\"\n\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport collections\nimport unicodedata\nimport six\n\n\ndef convert_to_unicode(text):\n    \"\"\"Converts `text` to Unicode (if it's not already), assuming utf-8 input.\"\"\"\n    if six.PY3:\n        if isinstance(text, str):\n            return text\n        elif isinstance(text, bytes):\n            return text.decode(\"utf-8\", \"ignore\")\n        else:\n            raise ValueError(\"Unsupported string type: %s\" % (type(text)))\n    elif six.PY2:\n        if isinstance(text, str):\n            return text.decode(\"utf-8\", \"ignore\")\n        elif isinstance(text, unicode):\n            return text\n        else:\n            raise ValueError(\"Unsupported string type: %s\" % (type(text)))\n    else:\n        raise ValueError(\"Not running on Python2 or Python 3?\")\n\n\ndef printable_text(text):\n    \"\"\"Returns text encoded in a way suitable for print or `tf.logging`.\"\"\"\n\n    # These functions want `str` for both Python2 and Python3, but in one case\n    # it's a Unicode string and in the other it's a byte string.\n    if six.PY3:\n        if isinstance(text, str):\n            return text\n        elif isinstance(text, bytes):\n            return text.decode(\"utf-8\", \"ignore\")\n        else:\n            raise ValueError(\"Unsupported string type: %s\" % (type(text)))\n    elif six.PY2:\n        if isinstance(text, str):\n            return text\n        elif isinstance(text, unicode):\n            return text.encode(\"utf-8\")\n        else:\n            raise ValueError(\"Unsupported string type: %s\" % (type(text)))\n    else:\n        raise ValueError(\"Not running on Python2 or Python 3?\")\n\n\ndef load_vocab(vocab_file):\n    \"\"\"Loads a vocabulary file into a dictionary.\"\"\"\n    vocab = collections.OrderedDict()\n    fin = open(vocab_file)\n    for num, line in enumerate(fin):\n        items = convert_to_unicode(line.strip()).split(\"\\t\")\n        if len(items) > 2:\n            break\n        token = items[0]\n        index = items[1] if len(items) == 2 else num\n        token = token.strip()\n        vocab[token] = int(index)\n    return vocab\n\n\ndef convert_by_vocab(vocab, items):\n    \"\"\"Converts a sequence of [tokens|ids] using the vocab.\"\"\"\n    output = []\n    for item in items:\n        output.append(vocab[item])\n    return output\n\n\ndef convert_tokens_to_ids(vocab, tokens):\n    return convert_by_vocab(vocab, tokens)\n\n\ndef convert_ids_to_tokens(inv_vocab, ids):\n    return convert_by_vocab(inv_vocab, ids)\n\n\ndef whitespace_tokenize(text):\n    \"\"\"Runs basic whitespace cleaning and splitting on a peice of text.\"\"\"\n    text = text.strip()\n    if not text:\n        return []\n    tokens = text.split()\n    return tokens\n\n\nclass FullTokenizer(object):\n    \"\"\"Runs end-to-end tokenziation.\"\"\"\n\n    def __init__(self, vocab_file, do_lower_case=True):\n        self.vocab = load_vocab(vocab_file)\n        self.inv_vocab = {v: k for k, v in self.vocab.items()}\n        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)\n        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)\n\n    def tokenize(self, text):\n        split_tokens = []\n        for token in self.basic_tokenizer.tokenize(text):\n            for sub_token in self.wordpiece_tokenizer.tokenize(token):\n                split_tokens.append(sub_token)\n\n        return split_tokens\n\n    def convert_tokens_to_ids(self, tokens):\n        return convert_by_vocab(self.vocab, tokens)\n\n    def convert_ids_to_tokens(self, ids):\n        return convert_by_vocab(self.inv_vocab, ids)\n\n\nclass CharTokenizer(object):\n    \"\"\"Runs end-to-end tokenziation.\"\"\"\n\n    def __init__(self, vocab_file, do_lower_case=True):\n        self.vocab = load_vocab(vocab_file)\n        self.inv_vocab = {v: k for k, v in self.vocab.items()}\n        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)\n\n    def tokenize(self, text):\n        split_tokens = []\n        for token in text.lower().split(\" \"):\n            for sub_token in self.wordpiece_tokenizer.tokenize(token):\n                split_tokens.append(sub_token)\n\n        return split_tokens\n\n    def convert_tokens_to_ids(self, tokens):\n        return convert_by_vocab(self.vocab, tokens)\n\n    def convert_ids_to_tokens(self, ids):\n        return convert_by_vocab(self.inv_vocab, ids)\n\n\nclass BasicTokenizer(object):\n    \"\"\"Runs basic tokenization (punctuation splitting, lower casing, etc.).\"\"\"\n\n    def __init__(self, do_lower_case=True):\n        \"\"\"Constructs a BasicTokenizer.\n\n        Args:\n            do_lower_case: Whether to lower case the input.\n        \"\"\"\n        self.do_lower_case = do_lower_case\n        self._never_lowercase = ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']\n\n    def tokenize(self, text):\n        \"\"\"Tokenizes a piece of text.\"\"\"\n        text = convert_to_unicode(text)\n        text = self._clean_text(text)\n\n        # This was added on November 1st, 2018 for the multilingual and Chinese\n        # models. This is also applied to the English models now, but it doesn't\n        # matter since the English models were not trained on any Chinese data\n        # and generally don't have any Chinese data in them (there are Chinese\n        # characters in the vocabulary because Wikipedia does have some Chinese\n        # words in the English Wikipedia.).\n        text = self._tokenize_chinese_chars(text)\n\n        orig_tokens = whitespace_tokenize(text)\n        split_tokens = []\n        for token in orig_tokens:\n            if self.do_lower_case and token not in self._never_lowercase:\n                token = token.lower()\n                token = self._run_strip_accents(token)\n            if token in self._never_lowercase:\n                split_tokens.extend([token])\n            else:\n                split_tokens.extend(self._run_split_on_punc(token))\n\n        output_tokens = whitespace_tokenize(\" \".join(split_tokens))\n        return output_tokens\n\n    def _run_strip_accents(self, text):\n        \"\"\"Strips accents from a piece of text.\"\"\"\n        text = unicodedata.normalize(\"NFD\", text)\n        output = []\n        for char in text:\n            cat = unicodedata.category(char)\n            if cat == \"Mn\":\n                continue\n            output.append(char)\n        return \"\".join(output)\n\n    def _run_split_on_punc(self, text):\n        \"\"\"Splits punctuation on a piece of text.\"\"\"\n        chars = list(text)\n        i = 0\n        start_new_word = True\n        output = []\n        while i < len(chars):\n            char = chars[i]\n            if _is_punctuation(char):\n                output.append([char])\n                start_new_word = True\n            else:\n                if start_new_word:\n                    output.append([])\n                start_new_word = False\n                output[-1].append(char)\n            i += 1\n\n        return [\"\".join(x) for x in output]\n\n    def _tokenize_chinese_chars(self, text):\n        \"\"\"Adds whitespace around any CJK character.\"\"\"\n        output = []\n        for char in text:\n            cp = ord(char)\n            if self._is_chinese_char(cp):\n                output.append(\" \")\n                output.append(char)\n                output.append(\" \")\n            else:\n                output.append(char)\n        return \"\".join(output)\n\n    def _is_chinese_char(self, cp):\n        \"\"\"Checks whether CP is the codepoint of a CJK character.\"\"\"\n        # This defines a \"chinese character\" as anything in the CJK Unicode block:\n        #     https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)\n        #\n        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,\n        # despite its name. The modern Korean Hangul alphabet is a different block,\n        # as is Japanese Hiragana and Katakana. Those alphabets are used to write\n        # space-separated words, so they are not treated specially and handled\n        # like the all of the other languages.\n        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #\n            (cp >= 0x3400 and cp <= 0x4DBF) or  #\n            (cp >= 0x20000 and cp <= 0x2A6DF) or  #\n            (cp >= 0x2A700 and cp <= 0x2B73F) or  #\n            (cp >= 0x2B740 and cp <= 0x2B81F) or  #\n            (cp >= 0x2B820 and cp <= 0x2CEAF) or\n            (cp >= 0xF900 and cp <= 0xFAFF) or  #\n            (cp >= 0x2F800 and cp <= 0x2FA1F)):  #\n            return True\n\n        return False\n\n    def _clean_text(self, text):\n        \"\"\"Performs invalid character removal and whitespace cleanup on text.\"\"\"\n        output = []\n        for char in text:\n            cp = ord(char)\n            if cp == 0 or cp == 0xfffd or _is_control(char):\n                continue\n            if _is_whitespace(char):\n                output.append(\" \")\n            else:\n                output.append(char)\n        return \"\".join(output)\n\n\nclass WordpieceTokenizer(object):\n    \"\"\"Runs WordPiece tokenziation.\"\"\"\n\n    def __init__(self, vocab, unk_token=\"[UNK]\", max_input_chars_per_word=100):\n        self.vocab = vocab\n        self.unk_token = unk_token\n        self.max_input_chars_per_word = max_input_chars_per_word\n\n    def tokenize(self, text):\n        \"\"\"Tokenizes a piece of text into its word pieces.\n\n        This uses a greedy longest-match-first algorithm to perform tokenization\n        using the given vocabulary.\n\n        For example:\n            input = \"unaffable\"\n            output = [\"un\", \"##aff\", \"##able\"]\n\n        Args:\n            text: A single token or whitespace separated tokens. This should have\n                already been passed through `BasicTokenizer.\n\n        Returns:\n            A list of wordpiece tokens.\n        \"\"\"\n\n        text = convert_to_unicode(text)\n\n        output_tokens = []\n        for token in whitespace_tokenize(text):\n            chars = list(token)\n            if len(chars) > self.max_input_chars_per_word:\n                output_tokens.append(self.unk_token)\n                continue\n\n            is_bad = False\n            start = 0\n            sub_tokens = []\n            while start < len(chars):\n                end = len(chars)\n                cur_substr = None\n                while start < end:\n                    substr = \"\".join(chars[start:end])\n                    if start > 0:\n                        substr = \"##\" + substr\n                    if substr in self.vocab:\n                        cur_substr = substr\n                        break\n                    end -= 1\n                if cur_substr is None:\n                    is_bad = True\n                    break\n                sub_tokens.append(cur_substr)\n                start = end\n\n            if is_bad:\n                output_tokens.append(self.unk_token)\n            else:\n                output_tokens.extend(sub_tokens)\n        return output_tokens\n\n\ndef _is_whitespace(char):\n    \"\"\"Checks whether `chars` is a whitespace character.\"\"\"\n    # \\t, \\n, and \\r are technically contorl characters but we treat them\n    # as whitespace since they are generally considered as such.\n    if char == \" \" or char == \"\\t\" or char == \"\\n\" or char == \"\\r\":\n        return True\n    cat = unicodedata.category(char)\n    if cat == \"Zs\":\n        return True\n    return False\n\n\ndef _is_control(char):\n    \"\"\"Checks whether `chars` is a control character.\"\"\"\n    # These are technically control characters but we count them as whitespace\n    # characters.\n    if char == \"\\t\" or char == \"\\n\" or char == \"\\r\":\n        return False\n    cat = unicodedata.category(char)\n    if cat.startswith(\"C\"):\n        return True\n    return False\n\n\ndef _is_punctuation(char):\n    \"\"\"Checks whether `chars` is a punctuation character.\"\"\"\n    cp = ord(char)\n    # We treat all non-letter/number ASCII as punctuation.\n    # Characters such as \"^\", \"$\", and \"`\" are not in the Unicode\n    # Punctuation class but we treat them as punctuation anyways, for\n    # consistency.\n    if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or\n        (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):\n        return True\n    cat = unicodedata.category(char)\n    if cat.startswith(\"P\"):\n        return True\n    return False\n"
  },
  {
    "path": "paddlepalm/tokenizer/ernie_tokenizer.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes.\"\"\"\n\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\nfrom __future__ import unicode_literals\nfrom __future__ import absolute_import\n\nfrom io import open\n\nimport collections\nimport unicodedata\nimport six\n\n\ndef convert_to_unicode(text):\n    \"\"\"Converts `text` to Unicode (if it's not already), assuming utf-8 input.\"\"\"\n    if six.PY3:\n        if isinstance(text, str):\n            return text\n        elif isinstance(text, bytes):\n            return text.decode(\"utf-8\", \"ignore\")\n        else:\n            raise ValueError(\"Unsupported string type: %s\" % (type(text)))\n    elif six.PY2:\n        if isinstance(text, str):\n            return text.decode(\"utf-8\", \"ignore\")\n        elif isinstance(text, unicode):\n            return text\n        else:\n            raise ValueError(\"Unsupported string type: %s\" % (type(text)))\n    else:\n        raise ValueError(\"Not running on Python2 or Python 3?\")\n\n\ndef printable_text(text):\n    \"\"\"Returns text encoded in a way suitable for print or `tf.logging`.\"\"\"\n\n    # These functions want `str` for both Python2 and Python3, but in one case\n    # it's a Unicode string and in the other it's a byte string.\n    if six.PY3:\n        if isinstance(text, str):\n            return text\n        elif isinstance(text, bytes):\n            return text.decode(\"utf-8\", \"ignore\")\n        else:\n            raise ValueError(\"Unsupported string type: %s\" % (type(text)))\n    elif six.PY2:\n        if isinstance(text, str):\n            return text\n        elif isinstance(text, unicode):\n            return text.encode(\"utf-8\")\n        else:\n            raise ValueError(\"Unsupported string type: %s\" % (type(text)))\n    else:\n        raise ValueError(\"Not running on Python2 or Python 3?\")\n\n\ndef load_vocab(vocab_file):\n    \"\"\"Loads a vocabulary file into a dictionary.\"\"\"\n    vocab = collections.OrderedDict()\n    with open(vocab_file, encoding='utf8') as fin:\n        for num, line in enumerate(fin):\n            items = convert_to_unicode(line.strip()).split(\"\\t\")\n            if len(items) > 2:\n                break\n            token = items[0]\n            index = items[1] if len(items) == 2 else num\n            token = token.strip()\n            vocab[token] = int(index)\n    return vocab\n\n\ndef convert_by_vocab(vocab, items):\n    \"\"\"Converts a sequence of [tokens|ids] using the vocab.\"\"\"\n    output = []\n    for item in items:\n        output.append(vocab[item])\n    return output\n\n\ndef convert_tokens_to_ids(vocab, tokens):\n    return convert_by_vocab(vocab, tokens)\n\n\ndef convert_ids_to_tokens(inv_vocab, ids):\n    return convert_by_vocab(inv_vocab, ids)\n\n\ndef whitespace_tokenize(text):\n    \"\"\"Runs basic whitespace cleaning and splitting on a peice of text.\"\"\"\n    text = text.strip()\n    if not text:\n        return []\n    tokens = text.split()\n    return tokens\n\n\nclass FullTokenizer(object):\n    \"\"\"Runs end-to-end tokenziation.\"\"\"\n\n    def __init__(self, vocab_file, do_lower_case=True):\n        self.vocab = load_vocab(vocab_file)\n        self.inv_vocab = {v: k for k, v in self.vocab.items()}\n        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)\n        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)\n\n    def tokenize(self, text):\n        split_tokens = []\n        for token in self.basic_tokenizer.tokenize(text):\n            for sub_token in self.wordpiece_tokenizer.tokenize(token):\n                split_tokens.append(sub_token)\n\n        return split_tokens\n\n    def convert_tokens_to_ids(self, tokens):\n        return convert_by_vocab(self.vocab, tokens)\n\n    def convert_ids_to_tokens(self, ids):\n        return convert_by_vocab(self.inv_vocab, ids)\n\n\nclass CharTokenizer(object):\n    \"\"\"Runs end-to-end tokenziation.\"\"\"\n\n    def __init__(self, vocab_file, do_lower_case=True):\n        self.vocab = load_vocab(vocab_file)\n        self.inv_vocab = {v: k for k, v in self.vocab.items()}\n        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)\n\n    def tokenize(self, text):\n        split_tokens = []\n        for token in text.lower().split(\" \"):\n            for sub_token in self.wordpiece_tokenizer.tokenize(token):\n                split_tokens.append(sub_token)\n\n        return split_tokens\n\n    def convert_tokens_to_ids(self, tokens):\n        return convert_by_vocab(self.vocab, tokens)\n\n    def convert_ids_to_tokens(self, ids):\n        return convert_by_vocab(self.inv_vocab, ids)\n\n\nclass BasicTokenizer(object):\n    \"\"\"Runs basic tokenization (punctuation splitting, lower casing, etc.).\"\"\"\n\n    def __init__(self, do_lower_case=True):\n        \"\"\"Constructs a BasicTokenizer.\n\n        Args:\n            do_lower_case: Whether to lower case the input.\n        \"\"\"\n        self.do_lower_case = do_lower_case\n        self._never_lowercase = ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']\n\n    def tokenize(self, text):\n        \"\"\"Tokenizes a piece of text.\"\"\"\n        text = convert_to_unicode(text)\n        text = self._clean_text(text)\n\n        # This was added on November 1st, 2018 for the multilingual and Chinese\n        # models. This is also applied to the English models now, but it doesn't\n        # matter since the English models were not trained on any Chinese data\n        # and generally don't have any Chinese data in them (there are Chinese\n        # characters in the vocabulary because Wikipedia does have some Chinese\n        # words in the English Wikipedia.).\n        text = self._tokenize_chinese_chars(text)\n\n        orig_tokens = whitespace_tokenize(text)\n        split_tokens = []\n        for token in orig_tokens:\n            if self.do_lower_case and token not in self._never_lowercase:\n                token = token.lower()\n                token = self._run_strip_accents(token)\n            if token in self._never_lowercase:\n                split_tokens.extend([token])\n            else:\n                split_tokens.extend(self._run_split_on_punc(token))\n\n        output_tokens = whitespace_tokenize(\" \".join(split_tokens))\n        return output_tokens\n\n    def _run_strip_accents(self, text):\n        \"\"\"Strips accents from a piece of text.\"\"\"\n        text = unicodedata.normalize(\"NFD\", text)\n        output = []\n        for char in text:\n            cat = unicodedata.category(char)\n            if cat == \"Mn\":\n                continue\n            output.append(char)\n        return \"\".join(output)\n\n    def _run_split_on_punc(self, text):\n        \"\"\"Splits punctuation on a piece of text.\"\"\"\n        chars = list(text)\n        i = 0\n        start_new_word = True\n        output = []\n        while i < len(chars):\n            char = chars[i]\n            if _is_punctuation(char):\n                output.append([char])\n                start_new_word = True\n            else:\n                if start_new_word:\n                    output.append([])\n                start_new_word = False\n                output[-1].append(char)\n            i += 1\n\n        return [\"\".join(x) for x in output]\n\n    def _tokenize_chinese_chars(self, text):\n        \"\"\"Adds whitespace around any CJK character.\"\"\"\n        output = []\n        for char in text:\n            cp = ord(char)\n            if self._is_chinese_char(cp):\n                output.append(\" \")\n                output.append(char)\n                output.append(\" \")\n            else:\n                output.append(char)\n        return \"\".join(output)\n\n    def _is_chinese_char(self, cp):\n        \"\"\"Checks whether CP is the codepoint of a CJK character.\"\"\"\n        # This defines a \"chinese character\" as anything in the CJK Unicode block:\n        #     https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)\n        #\n        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,\n        # despite its name. The modern Korean Hangul alphabet is a different block,\n        # as is Japanese Hiragana and Katakana. Those alphabets are used to write\n        # space-separated words, so they are not treated specially and handled\n        # like the all of the other languages.\n        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #\n            (cp >= 0x3400 and cp <= 0x4DBF) or  #\n            (cp >= 0x20000 and cp <= 0x2A6DF) or  #\n            (cp >= 0x2A700 and cp <= 0x2B73F) or  #\n            (cp >= 0x2B740 and cp <= 0x2B81F) or  #\n            (cp >= 0x2B820 and cp <= 0x2CEAF) or\n            (cp >= 0xF900 and cp <= 0xFAFF) or  #\n            (cp >= 0x2F800 and cp <= 0x2FA1F)):  #\n            return True\n\n        return False\n\n    def _clean_text(self, text):\n        \"\"\"Performs invalid character removal and whitespace cleanup on text.\"\"\"\n        output = []\n        for char in text:\n            cp = ord(char)\n            if cp == 0 or cp == 0xfffd or _is_control(char):\n                continue\n            if _is_whitespace(char):\n                output.append(\" \")\n            else:\n                output.append(char)\n        return \"\".join(output)\n\n\nclass WordpieceTokenizer(object):\n    \"\"\"Runs WordPiece tokenziation.\"\"\"\n\n    def __init__(self, vocab, unk_token=\"[UNK]\", max_input_chars_per_word=100):\n        self.vocab = vocab\n        self.unk_token = unk_token\n        self.max_input_chars_per_word = max_input_chars_per_word\n\n    def tokenize(self, text):\n        \"\"\"Tokenizes a piece of text into its word pieces.\n\n        This uses a greedy longest-match-first algorithm to perform tokenization\n        using the given vocabulary.\n\n        For example:\n            input = \"unaffable\"\n            output = [\"un\", \"##aff\", \"##able\"]\n\n        Args:\n            text: A single token or whitespace separated tokens. This should have\n                already been passed through `BasicTokenizer.\n\n        Returns:\n            A list of wordpiece tokens.\n        \"\"\"\n\n        text = convert_to_unicode(text)\n\n        output_tokens = []\n        for token in whitespace_tokenize(text):\n            chars = list(token)\n            if len(chars) > self.max_input_chars_per_word:\n                output_tokens.append(self.unk_token)\n                continue\n\n            is_bad = False\n            start = 0\n            sub_tokens = []\n            while start < len(chars):\n                end = len(chars)\n                cur_substr = None\n                while start < end:\n                    substr = \"\".join(chars[start:end])\n                    if start > 0:\n                        substr = \"##\" + substr\n                    if substr in self.vocab:\n                        cur_substr = substr\n                        break\n                    end -= 1\n                if cur_substr is None:\n                    is_bad = True\n                    break\n                sub_tokens.append(cur_substr)\n                start = end\n\n            if is_bad:\n                output_tokens.append(self.unk_token)\n            else:\n                output_tokens.extend(sub_tokens)\n        return output_tokens\n\n\ndef _is_whitespace(char):\n    \"\"\"Checks whether `chars` is a whitespace character.\"\"\"\n    # \\t, \\n, and \\r are technically contorl characters but we treat them\n    # as whitespace since they are generally considered as such.\n    if char == \" \" or char == \"\\t\" or char == \"\\n\" or char == \"\\r\":\n        return True\n    cat = unicodedata.category(char)\n    if cat == \"Zs\":\n        return True\n    return False\n\n\ndef _is_control(char):\n    \"\"\"Checks whether `chars` is a control character.\"\"\"\n    # These are technically control characters but we count them as whitespace\n    # characters.\n    if char == \"\\t\" or char == \"\\n\" or char == \"\\r\":\n        return False\n    cat = unicodedata.category(char)\n    if cat.startswith(\"C\"):\n        return True\n    return False\n\n\ndef _is_punctuation(char):\n    \"\"\"Checks whether `chars` is a punctuation character.\"\"\"\n    cp = ord(char)\n    # We treat all non-letter/number ASCII as punctuation.\n    # Characters such as \"^\", \"$\", and \"`\" are not in the Unicode\n    # Punctuation class but we treat them as punctuation anyways, for\n    # consistency.\n    if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or\n        (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):\n        return True\n    cat = unicodedata.category(char)\n    if cat.startswith(\"P\"):\n        return True\n    return False\n\n\ndef tokenize_chinese_chars(text):\n    \"\"\"Adds whitespace around any CJK character.\"\"\"\n\n    def _is_chinese_char(cp):\n        \"\"\"Checks whether CP is the codepoint of a CJK character.\"\"\"\n        # This defines a \"chinese character\" as anything in the CJK Unicode block:\n        #     https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)\n        #\n        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,\n        # despite its name. The modern Korean Hangul alphabet is a different block,\n        # as is Japanese Hiragana and Katakana. Those alphabets are used to write\n        # space-separated words, so they are not treated specially and handled\n        # like the all of the other languages.\n        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #\n            (cp >= 0x3400 and cp <= 0x4DBF) or  #\n            (cp >= 0x20000 and cp <= 0x2A6DF) or  #\n            (cp >= 0x2A700 and cp <= 0x2B73F) or  #\n            (cp >= 0x2B740 and cp <= 0x2B81F) or  #\n            (cp >= 0x2B820 and cp <= 0x2CEAF) or\n            (cp >= 0xF900 and cp <= 0xFAFF) or  #\n            (cp >= 0x2F800 and cp <= 0x2FA1F)):  #\n            return True\n\n        return False\n\n    def _is_whitespace(c):\n        if c == \" \" or c == \"\\t\" or c == \"\\r\" or c == \"\\n\" or ord(c) == 0x202F:\n            return True\n        return False\n\n    output = []\n    buff = \"\"\n    for char in text:\n        cp = ord(char)\n        if _is_chinese_char(cp) or _is_whitespace(char):\n            if buff != \"\":\n                output.append(buff)\n                buff = \"\"\n            output.append(char)\n        else:\n            buff += char\n\n    if buff != \"\":\n        output.append(buff)\n\n    return output\n"
  },
  {
    "path": "paddlepalm/trainer.py",
    "content": "# -*- coding: utf-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom __future__ import print_function\nimport os\nimport json\nfrom paddle import fluid\nimport time\nimport sys\nimport numpy as np\nimport paddlepalm.utils.basic_helper as helper\nfrom paddlepalm.utils import reader_helper, saver\nfrom paddlepalm.distribute import gpu_dev_count, data_feeder, decode_fake\n# from paddlepalm.default_settings import *\n\nDEBUG=False\n\n\nclass Trainer(object):\n    \"\"\"\n    The core unit to start a training/predicting session for single task. A trainer is to build computation graph, manage training and evaluation process, achieve model/checkpoint saving and pretrain_model/checkpoint loading.\n    \"\"\"\n\n    def __init__(self, name, mix_ratio=1.0, reuse_head_with=None):\n        \"\"\"Create a new trainer.\n\n        Args:\n            name: string. The name of the trainer(training task).\n            mix_ratio: sampling weight of this trainer in multi-task learning mode. Default is 1.0.\n            reuse_head_with: reuse parameters of task head with another trainer. Default is None, not reuse with others.\n\n        \"\"\"\n\n        self._name = name\n        self._pred_reader = None\n        self._task_head = None\n        self._pred_head = None\n      \n        self._train_reader = None\n        self._dist_train_init = False\n        self._predict_reader = None\n        self._train_iterator = None\n        self._predict_iterator = None\n\n        self._train_init = False\n        self._predict_init = False\n        self._train_init_prog = None\n        self._pred_init_prog = None\n\n        self._check_save = lambda: False\n\n        self._task_reuse_scope = name if reuse_head_with is None else reuse_head_with\n\n        self._feeded_var_names = None\n        self._target_vars = None\n        self._predict_vars = None\n\n        self._num_examples = 0\n\n        self._multi_task = False\n        self._as_auxilary = False\n        self._task_id = None\n\n        # training process management\n        self._mix_ratio = mix_ratio\n        self._expected_train_steps = None\n        self._expected_train_epochs = None\n        self._steps_pur_epoch = None\n        self._pred_steps_pur_epoch = None\n        self._cur_train_epoch = 0\n        self._cur_train_step = 0\n        self._train_finish = False\n\n        self._inputname_to_varname = {}\n        self._pred_input_name_list = []\n        self._pred_input_varname_list = []\n        self._pred_fetch_name_list = []\n        self._pred_fetch_var_list = []\n\n        # exe is built when random_init_params called.\n        self._exe = None\n\n        self._save_protocol = {\n            'input_names': 'self._pred_input_name_list',\n            'input_varnames': 'self._pred_input_varname_list',\n            'fetch_list': 'self._pred_fetch_name_list'}\n\n        self._lock = False\n        self._lock_prog = False\n        self._build_forward = False\n\n    def build_forward(self, backbone, task_head):\n        \"\"\"\n        Build forward computation graph for training, which usually built from input layer to loss node.\n\n        Args:\n            backbone: a Backbone object with phase == 'train', which is used to extract multi-level text features, e.g., contextual word embedding and sentence embedding.\n            head: a Head object with phase == 'train', which is used to build task specific output layers.\n        \n        Return:\n            loss_var: a Variable object. The computational graph variable(node) of loss.\n        \"\"\"\n\n\n        self._task_head = task_head\n        self._backbone = backbone\n\n        self._build_forward = True\n        \n        # create reader, task\n        # then check i/o across reader, backbone and task_layer\n        task_attrs = []\n        pred_task_attrs = []\n\n        task_attr_from_reader = helper.encode_inputs(self._task_head.inputs_attrs['reader'], self.name)\n\n        # merge reader input attrs from backbone and task_instances\n        input_names, shape_and_dtypes, name_to_position = reader_helper.merge_input_attrs(backbone.inputs_attr, task_attr_from_reader, insert_taskid=False)\n        # shapes: [task_id, shapes_of_backbone, shapes_of_inst1, ..., shapes_of_instN]\n        self._shape_and_dtypes = shape_and_dtypes\n        self._name_to_position = name_to_position\n        self._input_names = input_names\n\n        if DEBUG:\n            print('----- for debug -----')\n            print('joint input names:')\n            print(joint_input_names)\n            print('joint input shape and dtypes:')\n            print(joint_shape_and_dtypes)\n\n        input_attrs = [[i, j, k] for i, (j,k) in zip(input_names, shape_and_dtypes)]\n\n        train_prog = fluid.Program()\n        train_init_prog = fluid.Program()\n\n        if not self._lock_prog:\n            self._train_prog = train_prog\n            self._train_init_prog = train_init_prog\n\n        if not self._lock_prog:\n            with fluid.program_guard(train_prog, train_init_prog):\n                net_inputs = reader_helper.create_net_inputs(input_attrs, is_async=False)\n                bb_output_vars = backbone.build(net_inputs)\n        else:\n            net_inputs = reader_helper.create_net_inputs(input_attrs, is_async=False)\n            bb_output_vars = backbone.build(net_inputs)\n        self._net_inputs = net_inputs\n        assert sorted(bb_output_vars.keys()) == sorted(backbone.outputs_attr.keys())\n\n        task_output_vars = {}\n        task_inputs = {'backbone': bb_output_vars}\n        task_inputs_from_reader = helper.decode_inputs(net_inputs, self.name)\n        task_inputs['reader'] = task_inputs_from_reader\n\n        scope = self.name+'.'\n        if not self._lock_prog:\n            with fluid.program_guard(train_prog, train_init_prog):\n                with fluid.unique_name.guard(scope):\n                    output_vars = self._build_head(task_inputs, phase='train', scope=scope)\n        else:\n            with fluid.unique_name.guard(scope):\n                output_vars = self._build_head(task_inputs, phase='train', scope=scope)\n\n        output_vars = {self.name+'.'+key: val for key, val in output_vars.items()}\n        old = len(task_output_vars) # for debug\n        task_output_vars.update(output_vars)\n        assert len(task_output_vars) - old == len(output_vars) # for debug\n\n        bb_fetches = {k: v.name for k,v in bb_output_vars.items()}\n        task_fetches = {k: v.name for k,v in task_output_vars.items()}\n        self._fetches = task_fetches\n        self._fetch_names, self._fetch_list = zip(*self._fetches.items())\n        if not self._lock_prog:\n            with fluid.program_guard(train_prog, train_init_prog):\n                loss_var = fluid.layers.reduce_sum(task_output_vars[self.name+'.loss'])\n        else:\n            loss_var = fluid.layers.reduce_sum(task_output_vars[self.name+'.loss'])\n\n        self._loss_var = loss_var\n\n        if not self._multi_task:\n            self._init_exe_prog(for_train=True)\n\n        return loss_var\n\n    def build_predict_forward(self, pred_backbone, pred_head):\n        \"\"\"\n        Build computation graph for evaluation and prediction.\n\n        Arguments:\n            - pred_backbone: a Backbone object with phase == 'predict'. For evaluating model during training, the predict backbone should keep the same with train backbone.\n            - pred_head: a Head object with phase == 'predict'. For evaluating model during training, the predict head should keep the same with train head.\n        \n        Return:\n            - output_vars: dict type. Each value is a computational graph variable(node) argumented by pred_head outputs_attr.\n        \"\"\"\n        self._pred_head = pred_head\n        self._pred_backbone = pred_backbone\n        pred_task_attr_from_reader = helper.encode_inputs(self._pred_head.inputs_attrs['reader'], self.name)\n\n        pred_input_names, pred_shape_and_dtypes, pred_name_to_position = reader_helper.merge_input_attrs(pred_backbone.inputs_attr, pred_task_attr_from_reader, insert_taskid=False)\n        pred_input_attrs = [[i, j, k] for i, (j,k) in zip(pred_input_names, pred_shape_and_dtypes)]\n        self._pred_shape_and_dtypes = pred_shape_and_dtypes\n        self._pred_name_to_position = pred_name_to_position\n        self._pred_input_names = pred_input_names\n\n        if not self._lock_prog:\n            pred_prog = fluid.Program()\n            self._pred_prog = pred_prog\n            pred_init_prog = fluid.Program()\n            self._pred_init_prog = pred_init_prog\n\n            with fluid.program_guard(pred_prog, pred_init_prog):\n                pred_net_inputs = reader_helper.create_net_inputs(pred_input_attrs)\n                pred_bb_output_vars = pred_backbone.build(pred_net_inputs)\n                self._pred_net_inputs = pred_net_inputs\n        else:\n            pred_net_inputs = reader_helper.create_net_inputs(pred_input_attrs)\n            pred_bb_output_vars = pred_backbone.build(pred_net_inputs)\n            self._pred_net_inputs = pred_net_inputs\n\n        # prepare predict vars for saving inference model\n        if not self._lock_prog:\n            with fluid.program_guard(pred_prog, pred_init_prog):\n                cur_inputs = helper.decode_inputs(pred_net_inputs, self.name)\n                self._pred_input_name_list, self._pred_input_varname_list = \\\n                    zip(*[[k, v.name] for k,v in cur_inputs.items()])\n\n                pred_task_inputs = {'backbone': pred_bb_output_vars, 'reader': cur_inputs}\n                scope = self.name + '.'\n                with fluid.unique_name.guard(scope):\n                    output_vars = self._build_head(pred_task_inputs, phase='predict', scope=scope)\n        else:\n            cur_inputs = helper.decode_inputs(pred_net_inputs, self.name)\n            self._pred_input_name_list, self._pred_input_varname_list = \\\n                zip(*[[k, v.name] for k,v in cur_inputs.items()])\n\n            pred_task_inputs = {'backbone': pred_bb_output_vars, 'reader': cur_inputs}\n            scope = self.name + '.'\n            with fluid.unique_name.guard(scope):\n                output_vars = self._build_head(pred_task_inputs, phase='predict', scope=scope)\n\n        if output_vars is not None:\n            self._pred_fetch_name_list, self._pred_fetch_list = zip(*output_vars.items())\n        else:\n            self._pred_fetch_name_list = []\n            self._pred_fetch_var_list = []\n\n        # if not self._multi_task:\n        self._init_exe_prog(for_train=False)\n        self._exe.run(self._pred_init_prog)\n\n        self._predict_vars = output_vars\n            \n        return output_vars\n\n    def build_backward(self, optimizer, weight_decay=None, use_ema=False, ema_decay=None):\n        \"\"\"\n        Build backward computation graph and training strategy.\n\n        Arguments:\n            - optimizer: \n            - weight_decay: optional, default is None (disable weight decay).\n            - use_ema: optional, default is False. The flag to control whether to apply Exponential Moving Average strategy on parameter updates.\n            - ema_decay: optional, default is None. Only works with use_ema == True. Control decay rate of EMA strategy.\n\n        \"\"\"\n        # build optimizer\n        assert self._loss_var is not None and self._train_init_prog is not None, \"train graph not foung! You should build_forward first.\"\n        optimizer._set_prog(self._train_prog, self._train_init_prog)\n        with fluid.program_guard(self._train_prog, self._train_init_prog):\n            param_grads = optimizer._build()\n\n            if weight_decay is not None:\n\n                param_list = dict()\n\n                for param in self._train_prog.global_block().all_parameters():\n                    param_list[param.name] = param * 1.0\n                    param_list[param.name].stop_gradient = True\n\n                def exclude_from_weight_decay(name):\n                    if name.find(\"layer_norm\") > -1:\n                        return True\n                    bias_suffix = [\"_bias\", \"_b\", \".b_0\"]\n                    for suffix in bias_suffix:\n                        if name.endswith(suffix):\n                            return True\n                    return False\n\n                for param, grad in param_grads:\n                    if exclude_from_weight_decay(param.name):\n                        continue\n                    with param.block.program._optimized_guard(\n                        [param, grad]), fluid.framework.name_scope(\"weight_decay\"):\n                        updated_param = param - param_list[\n                            param.name] * weight_decay * optimizer.get_cur_learning_rate()\n                        fluid.layers.assign(output=param, input=updated_param)\n\n            if use_ema:\n                ema = fluid.optimizer.ExponentialMovingAverage(ema_decay)\n                ema.update()\n\n        self._exe.run(self._train_init_prog)\n\n    def set_as_aux(self):\n        \"\"\"Set the task in this trainer as auxilary task. \\nCAUSIOUS: This API only works on multi-task learning mode. Each task is set as target task by default. \"\"\"\n        self._as_auxilary = True\n\n    def fit_reader(self, reader, phase='train'):\n        \"\"\"\n        Bind a reader and loaded train/predict data to trainer. \n        \n        Args:\n            reader: a Reader object. The running phase of the reader should be consistent with `phase` argument of this method.\n            phase: running phase. Currently support: train, predict.\n\n        \"\"\"\n\n        self._check_phase(phase)\n        if phase=='train':\n            assert self._shape_and_dtypes is not None, \"You need to build_forward or build_predict_head first to prepare input features.\"\n        else:\n            assert self._pred_shape_and_dtypes is not None, \"You need to build_forward     or build_predict_head first to prepare input features.\"\n\n        batch_size = reader._batch_size\n\n        self._num_epochs = reader.num_epochs\n        if phase == 'train':\n            self._train_reader = reader\n            self._steps_pur_epoch = reader.num_examples // batch_size\n            shape_and_dtypes = self._shape_and_dtypes\n            name_to_position = self._name_to_position\n            if self._task_id is not None:\n                self._net_inputs['__task_id'] = self._task_id\n            net_inputs = self._net_inputs\n            self._train_batch_size = batch_size\n            self._num_examples = reader.num_examples\n            reader_helper.check_io(self._backbone.inputs_attr, reader.outputs_attr, in_name='backbone', out_name='reader(train)')\n            reader_helper.check_io(self._task_head.inputs_attrs['reader'], reader.outputs_attr, in_name='task_head(reader)', out_name='reader(train)')\n            reader_helper.check_io(self._task_head.inputs_attrs['backbone'], self._backbone.outputs_attr, in_name='task_head(backbone, train)', out_name='backbone')\n        elif phase == 'predict':\n            self._predict_reader = reader\n            self._pred_steps_pur_epoch = reader.num_examples // batch_size \n            shape_and_dtypes = self._pred_shape_and_dtypes\n            name_to_position = self._pred_name_to_position\n            net_inputs = self._pred_net_inputs\n            self._predict_batch_size = batch_size\n            self._pred_num_examples = reader.num_examples\n            reader_helper.check_io(self._pred_backbone.inputs_attr, reader.outputs_attr, in_name='backbone', out_name='reader(predict)')\n            reader_helper.check_io(self._pred_head.inputs_attrs['reader'], reader.outputs_attr, in_name='task_head(reader)', out_name='reader(predict)')\n            reader_helper.check_io(self._pred_head.inputs_attrs['backbone'], self._pred_backbone.outputs_attr, in_name='task_head(backbone, predict)', out_name='backbone')\n        else:\n            raise NotImplementedError()\n            \n        print('ok!')\n\n        # merge dataset iterators and create net input vars\n        iterator = reader._iterator()\n        prefix = self.name\n\n        # merge dataset iterators and create net input vars\n        iterator = reader._iterator()\n        prefix = self.name\n\n        # 对yield出的数据进行runtime检查和适配\n        iterator_fn = reader_helper.create_iterator_fn(iterator, prefix, shape_and_dtypes, name_to_position, return_type='dict')\n        self._raw_iterator_fn = iterator_fn\n        feed_batch_process_fn = reader_helper.create_feed_batch_process_fn(net_inputs)\n        if gpu_dev_count > 1:\n            distribute_feeder_fn = data_feeder(iterator_fn, feed_batch_process_fn, phase=phase)\n        else:\n            distribute_feeder_fn = iterator_fn()\n\n        if phase == 'train':\n            self._train_iterator = distribute_feeder_fn\n            self._feed_batch_process_fn = feed_batch_process_fn\n        elif phase == 'predict':\n            self._predict_iterator = distribute_feeder_fn\n            self._pred_feed_batch_process_fn = feed_batch_process_fn\n        return distribute_feeder_fn\n\n    def load_ckpt(self, model_path):\n        \"\"\"\n        load training checkpoint for further training or predicting.\n\n        Args:\n            model_path: the path of saved checkpoint/parameters.\n        \"\"\"\n        assert self._train_init_prog is not None or self._pred_init_prog is not None, \"model graph not built. You should at least build_forward or build_predict_forward to load its checkpoint.\"\n\n        # if self._train_init_prog is not None:\n        #     saver.init_pretraining_params(\n        #         self._exe,\n        #         model_path,\n        #         convert=False,\n        #         main_program=self._train_init_prog,\n        #         strict=True)\n        # elif self._pred_init_prog is not None:\n        #     saver.init_pretraining_params(\n        #         self._exe,\n        #         model_path,\n        #         convert=False,\n        #         main_program=self._pred_init_prog,\n        #         strict=True)\n        if self._train_init_prog is not None:\n            print('loading checkpoint into train program')\n            saver.init_checkpoint(\n                self._exe,\n                model_path,\n                main_program=self._train_init_prog)\n        elif self._pred_init_prog is not None:\n            saver.init_checkpoint(\n                self._exe,\n                model_path,\n                main_program=self._pred_init_prog)\n        else:\n            raise Exception(\"model not found. You should at least build_forward or build_predict_forward to load its checkpoint.\")\n\n    def load_predict_model(self, model_path, convert=False):\n        \"\"\"\n        load pretrain models(backbone) for training.\n\n        Args:\n            model_path: the path of saved pretrained parameters.\n        \"\"\"\n\n        assert self._pred_prog is not None, \"training graph not found. You should at least build_forward to load its pretrained parameters.\"\n\n        saver.init_pretraining_params(\n            self._exe,\n            model_path,\n            convert=convert,\n            main_program=self._pred_prog)\n\n    def load_pretrain(self, model_path, convert=False):\n        \"\"\"\n        load pretrain models(backbone) for training.\n\n        Args:\n            model_path: the path of saved pretrained parameters.\n        \"\"\"\n        assert self._train_init_prog is not None, \"training graph not found. You should at least build_forward to load its pretrained parameters.\"\n\n        saver.init_pretraining_params(\n            self._exe,\n            model_path,\n            convert=convert,\n            main_program=self._train_init_prog)\n\n    def set_saver(self, save_path, save_steps, save_type='ckpt'):\n        \"\"\"\n        create a build-in saver into trainer. A saver will automatically save checkpoint or predict model every `save_steps` training steps.\n\n        Args:\n            save_path: a string. the path to save checkpoints or predict models.\n            save_steps: an integer. the frequency to save models.\n            save_type: a string. The type of saved model. Currently support checkpoint(ckpt) and predict model(predict), default is ckpt. If both two types are needed to save, you can set as \"ckpt,predict\".\n\n        \"\"\"\n        \n\n        save_type = save_type.split(',')\n        if 'predict' in save_type:\n            assert self._pred_head is not None, \"Predict head not found! You should build_predict_head first if you want to save predict model.\"\n            assert save_path is not None and save_steps is not None, 'save_path and save_steps is required to save model.'\n            self._save_predict = True\n            if not os.path.exists(save_path):\n                os.makedirs(save_path)\n        else:\n            self._save_predict = False\n\n        if 'ckpt' in save_type:\n            if save_path is not None and save_steps is not None:\n                self._save_ckpt = True\n                if not os.path.exists(save_path):\n                    os.makedirs(save_path)\n            else:\n                \"WARNING: save_path or save_steps is not set, model will not be saved during training.\"\n                self._save_ckpt = False\n        else:\n            self._save_ckpt = False\n\n        def temp_func():\n            if (self._save_predict or self._save_ckpt) and self._cur_train_step % save_steps == 0:\n\n                if self._save_predict:\n                    self._save(save_path, suffix='pred.step'+str(self._cur_train_step))\n                    print('predict model has been saved at '+os.path.join(save_path, 'pred.step'+str(self._cur_train_step)))\n                    sys.stdout.flush()\n                if self._save_ckpt:\n                    fluid.io.save_persistables(self._exe, os.path.join(save_path, 'ckpt.step'+str(self._cur_train_step)), self._train_prog)\n                    print('checkpoint has been saved at '+os.path.join(save_path, 'ckpt.step'+str(self._cur_train_step)))\n                    sys.stdout.flush()\n                return True\n            else:\n                return False\n\n        self._check_save = temp_func\n            \n    def train(self, print_steps=5):\n        \"\"\"\n        start training.\n\n        Args:\n            print_steps: int. Logging frequency of training message, e.g., current step, loss and speed.\n        \"\"\"\n        \n        iterator = self._train_iterator\n        self._distribute_train_prog = fluid.CompiledProgram(self._train_prog).with_data_parallel(loss_name=self._loss_var.name)\n\n        time_begin = time.time()\n        for feed in iterator:\n            rt_outputs = self.train_one_step(feed)\n\n            task_rt_outputs = {k[len(self.name+'.'):]: v for k,v in rt_outputs.items() if k.startswith(self.name+'.')}\n            self._task_head.batch_postprocess(task_rt_outputs)\n\n\n            if print_steps > 0 and self._cur_train_step % print_steps == 0:\n                loss = rt_outputs[self.name+'.loss']\n                loss = np.mean(np.squeeze(loss)).tolist()\n\n                time_end = time.time()\n                time_cost = time_end - time_begin\n\n                print(\"step {}/{} (epoch {}), loss: {:.3f}, speed: {:.2f} steps/s\".format(\n                       (self._cur_train_step-1) % self._steps_pur_epoch + 1 , self._steps_pur_epoch, self._cur_train_epoch,\n                       loss, print_steps / time_cost))\n                sys.stdout.flush()\n                time_begin = time.time() \n\n            if self._num_epochs is None and not self._multi_task and self._cur_train_step == self._steps_pur_epoch:\n                break\n        \n    def predict(self, output_dir=None, print_steps=1000):\n        \"\"\"\n        start predicting.\n\n        Args:\n            output_dir: str. The path to save prediction results, default is None. If set as None, the results would output to screen directly. \n            print_steps: int. Logging frequency of predicting message, e.g., current progress and speed.\n        \"\"\"\n        iterator = self._predict_iterator\n        self._distribute_pred_prog = fluid.CompiledProgram(self._pred_prog).with_data_parallel()\n\n\n        if output_dir is not None and not os.path.exists(output_dir):\n            os.makedirs(output_dir)\n\n        time_begin = time.time()\n        \n        cur_predict_step = 0\n        for feed in iterator:\n            rt_outputs = self.predict_one_batch(feed)\n            self._pred_head.batch_postprocess(rt_outputs)\n\n            cur_predict_step += 1\n\n            if print_steps > 0 and cur_predict_step % print_steps == 0:\n                time_end = time.time()\n                time_cost = time_end - time_begin\n\n                print(\"batch {}/{}, speed: {:.2f} steps/s\".format(\n                       cur_predict_step, self._pred_steps_pur_epoch,\n                       print_steps / time_cost))\n                sys.stdout.flush()\n                time_begin = time.time()\n\n        if self._pred_head.epoch_inputs_attrs:\n            reader_outputs = self._predict_reader.get_epoch_outputs()\n        else:\n            reader_outputs = None\n\n        results = self._pred_head.epoch_postprocess({'reader':reader_outputs}, output_dir=output_dir)\n        return results\n\n    def reset_buffer(self):\n        self._pred_head.reset()\n\n    def _check_phase(self, phase):\n        assert phase in ['train', 'predict'], \"Supported phase: train, predict,\"\n\n    def _set_multitask(self):\n        self._multi_task = True\n\n    def _set_nomultitask(self):\n        self._multi_task = False\n\n    def _set_task_id(self, task_id):\n        self._task_id = task_id\n\n    def _init_exe_prog(self, for_train=True):\n        if not self._train_init and not self._predict_init:\n            on_gpu = gpu_dev_count > 0\n            self._exe = helper.build_executor(on_gpu)\n\n        if for_train:\n            assert self._train_prog is not None, \"train graph not found! You should build_forward first before you random init parameters.\"\n            self._train_init = True\n        else:\n            assert self._pred_prog is not None, \"predict graph not found! You should build_predict_head first before you random init parameters.\"\n            self._predict_init = True\n\n    # def random_init_params(self):\n    #     \"\"\"\n    #     randomly initialize model parameters.\n    #     \"\"\"\n    #     \n    #     if not self._train_init:\n    #         self._init_exe_prog()\n    #     \n    #     print('random init params...')\n    #     self._exe.run(self._train_init_prog)\n\n    def get_one_batch(self, phase='train'):\n        self._check_phase(phase)\n        if phase == 'train':\n            return next(self._train_reader)\n        elif phase == 'predict':\n            return next(self._predict_reader)\n        else:\n            raise NotImplementedError()\n\n    def _set_exe(self, exe):\n        self._exe = exe\n\n    def _set_dist_train(self, prog):\n        self._distribute_train_prog = prog\n\n    def _set_dist_pred(self, prog):\n        self._distribute_pred_prog = prog\n\n    def _set_fetch_list(self, fetch_list):\n        self._fetch_list = fetch_list\n\n    def train_one_step(self, batch):\n\n        if not self._dist_train_init:\n            self._distribute_train_prog = fluid.CompiledProgram(self._train_prog).with_data_parallel(loss_name=self._loss_var.name)\n            self._dist_train_init = True\n\n        exe = self._exe\n        distribute_train_prog = self._distribute_train_prog\n        fetch_list = self._fetch_list\n\n        if gpu_dev_count > 1:\n            feed, mask = batch\n            rt_outputs = exe.run(distribute_train_prog, feed=feed, fetch_list=fetch_list)\n            num_fakes = decode_fake(len(rt_outputs[0]), mask, self._train_batch_size)\n            if num_fakes:\n                rt_outputs = [i[:-num_fakes] for i in rt_outputs]\n        \n        else:\n            feed = self._feed_batch_process_fn(batch)\n            rt_outputs = exe.run(distribute_train_prog, feed=feed, fetch_list=fetch_list)\n\n        rt_outputs = {k:v for k,v in zip(self._fetch_names, rt_outputs)}\n        self._cur_train_step += 1\n        self._check_save()\n        self._cur_train_epoch = (self._cur_train_step-1) // self._steps_pur_epoch\n        return rt_outputs\n\n    def predict_one_batch(self, batch):\n        if gpu_dev_count > 1:\n            feed, mask = batch\n            rt_outputs = self._exe.run(self._distribute_pred_prog, feed=feed, fetch_list=self._pred_fetch_list, use_prune=True)\n            num_fakes = decode_fake(len(rt_outputs[0]), mask, self._predict_batch_size)\n            if num_fakes:\n                rt_outputs = [i[:-num_fakes] for i in rt_outputs]\n        else:\n            feed = self._pred_feed_batch_process_fn(batch)\n            rt_outputs = self._exe.run(self._distribute_pred_prog, feed=feed, fetch_list=self._pred_fetch_list, use_prune=True)\n\n        rt_outputs = {k:v for k,v in zip(self._pred_fetch_name_list, rt_outputs)}\n        return rt_outputs\n\n    @property\n    def name(self):\n        return self._name\n    \n    @property\n    def num_examples(self):\n        return self._num_examples\n\n    @property\n    def mix_ratio(self):\n        return self._mix_ratio\n\n    @mix_ratio.setter\n    def mix_ratio(self, value):\n        self._mix_ratio = value\n\n    @property\n    def num_epochs(self):\n        return self._num_epochs\n\n    @property\n    def cur_train_step(self):\n        return self._cur_train_step\n\n    @property\n    def cur_train_epoch(self):\n        return self._cur_train_epoch\n\n    @property\n    def steps_pur_epoch(self):\n        return self._steps_pur_epoch\n\n    def _build_head(self, net_inputs, phase, scope=\"\"):\n        self._check_phase(phase)\n        if phase == 'train':\n            output_vars = self._task_head.build(net_inputs, scope_name=scope)\n        if phase == 'predict':\n            output_vars = self._pred_head.build(net_inputs, scope_name=scope)\n        return output_vars\n    \n    def _save(self, save_path, suffix=None):\n        # dirpath = save_path.rstrip('/').rstrip('\\\\') + suffix\n        if suffix is not None:\n            dirpath = os.path.join(save_path, suffix)\n        else:\n            dirpath = save_path\n        self._pred_input_varname_list = [str(i) for i in self._pred_input_varname_list]\n\n        prog = self._pred_prog.clone()\n        fluid.io.save_inference_model(dirpath, self._pred_input_varname_list, self._pred_fetch_var_list, self._exe, prog)\n\n        conf = {}\n        for k, strv in self._save_protocol.items(): \n            d = None\n            v = locals()\n            exec('d={}'.format(strv), globals(), v)\n            conf[k] = v['d']\n        with open(os.path.join(dirpath, '__conf__'), 'w') as writer:\n            writer.write(json.dumps(conf, indent=1))\n        print(self._name + ': predict model saved at ' + dirpath)\n        sys.stdout.flush()\n\n    \n    def _load(self, infer_model_path=None):\n        if infer_model_path is None:\n            infer_model_path = self._save_infermodel_path\n        for k,v in json.load(open(os.path.join(infer_model_path, '__conf__'))).items(): \n            strv = self._save_protocol[k]\n            exec('{}=v'.format(strv))\n        pred_prog, self._pred_input_varname_list, self._pred_fetch_var_list = \\\n            fluid.io.load_inference_model(infer_model_path, self._exe)\n        print(self._name+': inference model loaded from ' + infer_model_path)\n        sys.stdout.flush()\n        return pred_prog\n\n"
  },
  {
    "path": "paddlepalm/utils/__init__.py",
    "content": "\nfrom . import basic_helper\nfrom . import config_helper\n\n"
  },
  {
    "path": "paddlepalm/utils/basic_helper.py",
    "content": "# coding=utf-8\nimport os\nimport json\nimport yaml\nfrom .config_helper import PDConfig\nimport logging\nfrom paddle import fluid\n\ndef get_basename(f):\n    return os.path.splitext(f)[0]\n\n\ndef get_suffix(f):\n    return os.path.splitext(f)[-1]\n\n\ndef parse_yaml(f, asdict=True, support_cmd_line=False):\n    assert os.path.exists(f), \"file {} not found.\".format(f)\n    if support_cmd_line:\n        args = PDConfig(yaml_file=f, fuse_args=True)\n        args.build()\n        return args.asdict() if asdict else args\n    else:\n        if asdict:\n            with open(f, \"r\") as fin: \n                yaml_config = yaml.load(fin, Loader=yaml.SafeLoader)\n            return yaml_config\n        else:\n            raise NotImplementedError()\n\n\ndef parse_json(f, asdict=True, support_cmd_line=False):\n    assert os.path.exists(f), \"file {} not found.\".format(f)\n    if support_cmd_line:\n        args = PDConfig(json_file=f, fuse_args=support_cmd_line)\n        args.build()\n        return args.asdict() if asdict else args\n    else:\n        if asdict:\n            with open(f, \"r\") as fin: \n                config = json.load(fin)\n            return config\n        else:\n            raise NotImplementedError()\n            \n\ndef parse_list(string, astype=str):\n    assert isinstance(string, str), \"{} is not a string.\".format(string)\n    if ',' not in string:\n        return [astype(string)]\n    string = string.replace(',', ' ')\n    return [astype(i) for i in string.split()]\n\n\ndef try_float(s):\n    try:\n        float(s)\n        return(float(s))\n    except:\n        return s\n\n\n# TODO: 增加None机制，允许hidden size、batch size和seqlen设置为None\ndef check_io(in_attr, out_attr, strict=False, in_name=\"left\", out_name=\"right\"):\n    for name, attr in in_attr.items():\n        assert name in out_attr, in_name+': '+name+' not found in '+out_name\n        if attr != out_attr[name]:\n            if strict:\n                raise ValueError(name+': shape or dtype not consistent!')\n            else:\n                logging.warning('{}: shape or dtype not consistent!\\n{}:\\n{}\\n{}:\\n{}'.format(name, in_name, attr, out_name, out_attr[name]))\n\n\ndef encode_inputs(inputs, scope_name, sep='.', cand_set=None):\n    outputs = {}\n    for k, v in inputs.items():\n        if cand_set is not None:\n            if k in cand_set:\n                outputs[k] = v\n            if scope_name+sep+k in cand_set:\n                outputs[scope_name+sep+k] = v\n        else:\n            outputs[scope_name+sep+k] = v\n    return outputs\n\n\ndef decode_inputs(inputs, scope_name, sep='.', keep_unk_keys=True):\n    outputs = {}\n    for name, value in inputs.items():\n        # var for backbone are also available to tasks\n        if keep_unk_keys and sep not in name:\n            outputs[name] = value\n        # var for this inst\n        if name.startswith(scope_name+'.'):\n            outputs[name[len(scope_name+'.'):]] = value\n    return outputs\n\n\ndef build_executor(on_gpu):\n    if on_gpu:\n        place = fluid.CUDAPlace(0)\n        # dev_count = fluid.core.get_cuda_device_count()\n    else:\n        place = fluid.CPUPlace()\n        # dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))\n    # return fluid.Executor(place), dev_count\n    return fluid.Executor(place)\n\n\ndef fit_attr(conf, fit_attr, strict=False):\n    for i, attr in fit_attr.items():\n        if i not in conf:\n            if strict:\n                raise Exception('Argument {} is required to create a controller.'.format(i))\n            else:\n                continue\n        conf[i] = attr(conf[i])\n    return conf\n"
  },
  {
    "path": "paddlepalm/utils/config_helper.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport os\nimport sys\nimport argparse\nimport json\nimport yaml\nimport six\nimport logging\n\nlogging_only_message = \"%(message)s\"\nlogging_details = \"%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s\"\n\n\nclass JsonConfig(object):\n    \"\"\"\n    A high-level api for handling json configure file.\n    \"\"\"\n\n    def __init__(self, config_path):\n        self._config_dict = self._parse(config_path)\n\n    def _parse(self, config_path):\n        try:\n            with open(config_path) as json_file:\n                config_dict = json.load(json_file)\n                assert isinstance(config_dict, dict), \"Object in {} is NOT a dict.\".format(config_path)\n        except:\n            raise IOError(\"Error in parsing bert model config file '%s'\" %\n                          config_path)\n        else:\n            return config_dict\n\n    def __getitem__(self, key):\n        return self._config_dict[key]\n\n    def asdict(self):\n        return self._config_dict\n\n    def print_config(self):\n        for arg, value in sorted(six.iteritems(self._config_dict)):\n            print('%s: %s' % (arg, value))\n        print('------------------------------------------------')\n\n\nclass ArgumentGroup(object):\n    def __init__(self, parser, title, des):\n        self._group = parser.add_argument_group(title=title, description=des)\n\n    def add_arg(self, name, type, default, help, **kwargs):\n        type = str2bool if type == bool else type\n        self._group.add_argument(\n            \"--\" + name,\n            default=default,\n            type=type,\n            help=help + ' Default: %(default)s.',\n            **kwargs)\n\n\nclass ArgConfig(object):\n    \"\"\"\n    A high-level api for handling argument configs.\n    \"\"\"\n\n    def __init__(self):\n        parser = argparse.ArgumentParser()\n\n        train_g = ArgumentGroup(parser, \"training\", \"training options.\")\n        train_g.add_arg(\"epoch\", int, 3, \"Number of epoches for fine-tuning.\")\n        train_g.add_arg(\"learning_rate\", float, 5e-5,\n                        \"Learning rate used to train with warmup.\")\n        train_g.add_arg(\n            \"lr_scheduler\",\n            str,\n            \"linear_warmup_decay\",\n            \"scheduler of learning rate.\",\n            choices=['linear_warmup_decay', 'noam_decay'])\n        train_g.add_arg(\"weight_decay\", float, 0.01,\n                        \"Weight decay rate for L2 regularizer.\")\n        train_g.add_arg(\n            \"warmup_proportion\", float, 0.1,\n            \"Proportion of training steps to perform linear learning rate warmup for.\"\n        )\n        train_g.add_arg(\"save_steps\", int, 1000,\n                        \"The steps interval to save checkpoints.\")\n        train_g.add_arg(\n            \"loss_scaling\", float, 1.0,\n            \"Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.\"\n        )\n        train_g.add_arg(\"pred_dir\", str, None,\n                        \"Path to save the prediction results\")\n\n        log_g = ArgumentGroup(parser, \"logging\", \"logging related.\")\n        log_g.add_arg(\"skip_steps\", int, 10,\n                      \"The steps interval to print loss.\")\n        log_g.add_arg(\"verbose\", bool, False, \"Whether to output verbose log.\")\n\n        run_type_g = ArgumentGroup(parser, \"run_type\", \"running type options.\")\n        run_type_g.add_arg(\"use_cuda\", bool, True,\n                           \"If set, use GPU for training.\")\n        run_type_g.add_arg(\n            \"use_fast_executor\", bool, False,\n            \"If set, use fast parallel executor (in experiment).\")\n        run_type_g.add_arg(\n            \"num_iteration_per_drop_scope\", int, 1,\n            \"Ihe iteration intervals to clean up temporary variables.\")\n        run_type_g.add_arg(\"do_train\", bool, True,\n                           \"Whether to perform training.\")\n        run_type_g.add_arg(\"do_predict\", bool, True,\n                           \"Whether to perform prediction.\")\n\n        custom_g = ArgumentGroup(parser, \"customize\", \"customized options.\")\n\n        self.custom_g = custom_g\n\n        self.parser = parser\n\n    def add_arg(self, name, dtype, default, descrip):\n        self.custom_g.add_arg(name, dtype, default, descrip)\n\n    def build_conf(self):\n        return self.parser.parse_args()\n\n\ndef str2bool(v):\n    # because argparse does not support to parse \"true, False\" as python\n    # boolean directly\n    return v.lower() in (\"true\", \"t\", \"1\")\n\n\ndef print_arguments(args, log=None):\n    if not log:\n        print('-----------  Configuration Arguments -----------')\n        for arg, value in sorted(six.iteritems(vars(args))):\n            print('%s: %s' % (arg, value))\n        print('------------------------------------------------')\n    else:\n        log.info('-----------  Configuration Arguments -----------')\n        for arg, value in sorted(six.iteritems(vars(args))):\n            log.info('%s: %s' % (arg, value))\n        log.info('------------------------------------------------')\n\n\nclass PDConfig(object):\n    \"\"\"\n    A high-level API for managing configuration files in PaddlePaddle.\n    Can jointly work with command-line-arugment, json files and yaml files.\n    \"\"\"\n\n    def __init__(self, json_file=None, yaml_file=None, fuse_args=True):\n        \"\"\"\n            Init funciton for PDConfig.\n            json_file: the path to the json configure file.\n            yaml_file: the path to the yaml configure file.\n            fuse_args: if fuse the json/yaml configs with argparse.\n        \"\"\"\n\n        if json_file is not None and yaml_file is not None:\n            raise Warning(\n                \"json_file and yaml_file can not co-exist for now. please only use one configure file type.\"\n            )\n            return\n\n        self.args = None\n        self.arg_config = {}\n        self.json_config = {}\n        self.yaml_config = {}\n\n        parser = argparse.ArgumentParser()\n\n        self.yaml_g = ArgumentGroup(parser, \"yaml\", \"options from yaml.\")\n        self.json_g = ArgumentGroup(parser, \"json\", \"options from json.\")\n        self.com_g = ArgumentGroup(parser, \"custom\", \"customized options.\")\n\n        self.parser = parser\n\n        if json_file is not None:\n            assert isinstance(json_file, str)\n            self.load_json(json_file, fuse_args=fuse_args)\n\n        if yaml_file is not None:\n            assert isinstance(yaml_file, str) or isinstance(yaml_file, list)\n            self.load_yaml(yaml_file, fuse_args=fuse_args)\n\n    def load_json(self, file_path, fuse_args=True):\n\n        if not os.path.exists(file_path):\n            raise Warning(\"the json file %s does not exist.\" % file_path)\n            return\n\n        with open(file_path, \"r\") as fin:\n            self.json_config = json.loads(fin.read())\n            fin.close()\n\n        if fuse_args:\n            for name in self.json_config:\n                if not isinstance(self.json_config[name], int) \\\n                    and not isinstance(self.json_config[name], float) \\\n                    and not isinstance(self.json_config[name], str) \\\n                    and not isinstance(self.json_config[name], bool):\n\n                    continue\n\n                self.json_g.add_arg(name,\n                                    type(self.json_config[name]),\n                                    self.json_config[name],\n                                    \"This is from %s\" % file_path)\n\n    def load_yaml(self, file_path_list, fuse_args=True):\n\n        if isinstance(file_path_list, str):\n            file_path_list = [file_path_list]\n        for file_path in file_path_list: \n            if not os.path.exists(file_path):\n                raise Warning(\"the yaml file %s does not exist.\" % file_path)\n                return\n\n            with open(file_path, \"r\") as fin: \n                self.yaml_config = yaml.load(fin, Loader=yaml.SafeLoader)\n            if fuse_args:\n                for name in self.yaml_config:\n                    if not isinstance(self.yaml_config[name], int) \\\n                        and not isinstance(self.yaml_config[name], float) \\\n                        and not isinstance(self.yaml_config[name], str) \\\n                        and not isinstance(self.yaml_config[name], bool):\n\n                        continue\n\n                    self.yaml_g.add_arg(name,\n                                        type(self.yaml_config[name]),\n                                        self.yaml_config[name],\n                                        \"This is from %s\" % file_path)\n\n    def build(self):\n        self.args = self.parser.parse_args()\n        self.arg_config = vars(self.args)\n\n    def asdict(self):\n        return self.arg_config\n\n    def __add__(self, new_arg):\n        assert isinstance(new_arg, list) or isinstance(new_arg, tuple)\n        assert len(new_arg) >= 3\n        assert self.args is None\n\n        name = new_arg[0]\n        dtype = new_arg[1]\n        dvalue = new_arg[2]\n        desc = new_arg[3] if len(\n            new_arg) == 4 else \"Description is not provided.\"\n\n        self.com_g.add_arg(name, dtype, dvalue, desc)\n\n        return self\n\n    def __getattr__(self, name):\n        if name in self.arg_config:\n            return self.arg_config[name]\n\n        if name in self.json_config:\n            return self.json_config[name]\n\n        if name in self.yaml_config:\n            return self.yaml_config[name]\n\n        raise Warning(\"The argument %s is not defined.\" % name)\n\n    def Print(self):\n\n        print(\"-\" * 70)\n        for name in self.arg_config:\n            print(\"{: <25}\\t{}\".format(str(name), str(self.arg_config[name])))\n\n        for name in self.json_config:\n            if name not in self.arg_config:\n                print(\"{: <25}\\t{}\" %\n                      (str(name), str(self.json_config[name])))\n\n        for name in self.yaml_config:\n            if name not in self.arg_config:\n                print(\"{: <25}\\t{}\" %\n                      (str(name), str(self.yaml_config[name])))\n\n        print(\"-\" * 70)\n\n\nif __name__ == \"__main__\":\n    pd_config = PDConfig(yaml_file=\"./test/bert_config.yaml\")\n    pd_config += (\"my_age\", int, 18, \"I am forever 18.\")\n    pd_config.build()\n\n    print(pd_config.do_train)\n    print(pd_config.hidden_size)\n    print(pd_config.my_age)\n"
  },
  {
    "path": "paddlepalm/utils/plot_helper.py",
    "content": ""
  },
  {
    "path": "paddlepalm/utils/print_helper.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nMAXLEN = 70\ndef print_dict(dic, title=\"\"):\n\n    if title:\n        title = ' ' + title + ' '\n        left_len = (MAXLEN - len(title)) // 2\n        title = '-' * left_len + title\n        right_len = MAXLEN - len(title)\n        title = title + '-' * right_len\n    else:\n        title = '-' * MAXLEN\n    print(title)\n    for name in dic:\n        print(\"{: <25}\\t{}\".format(str(name), str(dic[name])))\n    print(\"\")\n    # print(\"-\" * MAXLEN + '\\n')\n"
  },
  {
    "path": "paddlepalm/utils/reader_helper.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport os\nimport sys\nimport random\nimport logging\nimport numpy as np\nimport paddle\nfrom paddle import fluid\nfrom paddle.fluid import layers\nfrom paddlepalm.distribute import gpu_dev_count, cpu_dev_count\nimport six\ndev_count = 1 if gpu_dev_count <= 1 else gpu_dev_count\n\n\ndef create_feed_batch_process_fn(net_inputs):\n    \n    def feed_batch_process_fn(data, id=-1, phase='train', is_multi=False):\n        temp = {}\n        if dev_count > 1 and phase=='train' and is_multi:\n            inputs = net_inputs[id]\n        else:\n            inputs= net_inputs\n\n        for q, var in inputs.items():\n            \n            if isinstance(var, str) or (six.PY3 and isinstance(var, bytes)) or (six.PY2 and isinstance(var, unicode)):\n                temp[var] = data[q]\n            else:\n                temp[var.name] = data[q]\n        return temp\n\n    return feed_batch_process_fn\n\n\n# def create_multihead_feed_batch_process_fn(net_inputs):\n# \n#     def feed_batch_process_fn(data, id=-1):\n#         # temps = {}\n#         # for i in range(len(net_inputs)):\n#         temp = {}\n#         inputs = net_inputs[id] if id != -1 else net_inputs\n#         \n#         for q, var in inputs.items():\n#             if isinstance(var, str) or isinstance(var, unicode):\n#                 temp[var] = data[q]\n#             else:\n#                 temp[var.name] = data[q]\n#             # temps[i] = temp\n#             \n#         return temp\n# \n#     return feed_batch_process_fn\n\n\ndef check_io(in_attr, out_attr, strict=False, in_name=\"left\", out_name=\"right\"):\n    for name, attr in in_attr.items():\n        assert name in out_attr, in_name+': '+name+' not found in '+out_name\n        if attr != out_attr[name]:\n            if strict:\n                raise ValueError(name+': shape or dtype not consistent!')\n            else:\n                logging.warning('{}: shape or dtype not consistent!\\n{}:\\n{}\\n{}:\\n{}'.format(name, in_name, attr, out_name, out_attr[name]))\n\n\ndef _check_and_adapt_shape_dtype(rt_val, attr, message=\"\"):\n    if not isinstance(rt_val, np.ndarray):\n        if rt_val is None:\n            raise Exception(message+\": get None value. \")\n        rt_val = np.array(rt_val)\n        assert rt_val.dtype != np.dtype('O'), message+\"yielded data is not a valid tensor (number of elements on some dimension may not consistent): {}\".format(rt_val)\n        if rt_val.dtype == np.dtype('float64'):\n            rt_val = rt_val.astype('float32')\n    \n    shape, dtype = attr\n    assert rt_val.dtype == np.dtype(dtype), message+\"yielded data type not consistent with attr settings. Expect: {}, receive: {}.\".format(rt_val.dtype, np.dtype(dtype))\n    assert len(shape) == rt_val.ndim, message+\"yielded data rank(ndim) not consistent with attr settings. Expect: {}, receive: {}.\".format(len(shape), rt_val.ndim)\n    for rt, exp in zip(rt_val.shape, shape):\n        if exp is None or exp < 0:\n            continue\n        assert rt == exp, \"yielded data shape is not consistent with attr settings.Expected:{}Actual:{}\".format(exp, rt)\n    return rt_val\n    \n\ndef _zero_batch(attrs):\n    pos_attrs = []\n    for shape, dtype in attrs:\n        pos_shape = [size if size and size > 0 else 1 for size in shape]\n        pos_attrs.append([pos_shape, dtype])\n\n    return [np.zeros(shape=shape, dtype=dtype) for shape, dtype in pos_attrs]\n\n\ndef _zero_batch_x(attrs, batch_size):\n    pos_attrs = []\n    for shape, dtype in attrs:\n        pos_shape = [size for size in shape]\n        if pos_shape[0] == -1:\n            pos_shape[0] = batch_size\n        if pos_shape[1] == -1:\n            pos_shape[1] = 512 # max seq len\n        pos_attrs.append([pos_shape, dtype])\n\n    return [np.zeros(shape=shape, dtype=dtype) for shape, dtype in pos_attrs]\n\n\ndef create_net_inputs(input_attrs, is_async=False, iterator_fn=None, dev_count=1, n_prefetch=1):\n    inputs = []\n    ret = {}\n    for name, shape, dtype in input_attrs:\n        p = layers.data(name, shape=shape, dtype=dtype)\n        ret[name] = p\n        inputs.append(p)\n\n    if is_async:\n        assert iterator_fn is not None, \"iterator_fn is needed for building async input layer.\"\n        reader = fluid.io.PyReader(inputs, capacity=dev_count, iterable=False)\n        reader.decorate_batch_generator(iterator_fn)\n        reader.start()\n\n    return ret\n\n\ndef create_iterator_fn(iterator, iterator_prefix, shape_and_dtypes, outname_to_pos, verbose=0, return_type='list'):\n\n    pos_to_outname = {j:i for i,j in outname_to_pos.items()}\n    \n    def iterator_fn():\n        v = verbose\n        for outputs in iterator:\n            results = [None] * len(outname_to_pos)\n            prefix = iterator_prefix\n            for outname, val in outputs.items():\n                task_outname = prefix + '.' + outname\n\n                if outname in outname_to_pos:\n                    idx = outname_to_pos[outname]\n                    val = _check_and_adapt_shape_dtype(val, shape_and_dtypes[idx])\n                    results[idx] = val\n\n                if task_outname in outname_to_pos:\n                    idx = outname_to_pos[task_outname]\n                    val = _check_and_adapt_shape_dtype(val, shape_and_dtypes[idx])\n                    results[idx] = val\n            if return_type == 'list':\n                yield results\n            elif return_type == 'dict':\n                temp = {}\n                for pos, i in enumerate(results):\n                    temp[pos_to_outname[pos]] = i\n\n                yield temp\n\n    return iterator_fn\n\ndef create_multihead_inference_fn(iterators, iterator_prefixes, joint_shape_and_dtypes, names, outname_to_pos, task_name2id, dev_count=1):\n    \n    def iterator(task_name):\n        while True:\n            id = task_name2id[task_name]\n            # id = np.random.choice(task_ids, p=weights)\n            task_id_tensor = np.array([id]).astype(\"int64\")\n            \n            for i in range(dev_count):\n                \n                outputs = next(iterators[id]) # dict type\n\n                prefix = iterator_prefixes[id]\n                results = {}\n                results['__task_id'] = task_id_tensor\n                for outname, val in outputs.items():\n                    task_outname = prefix + '.' + outname\n\n                    if outname in names[id]:\n                        idx = outname_to_pos[id][outname]\n                        val = _check_and_adapt_shape_dtype(val, joint_shape_and_dtypes[id][idx], message=outname+': ')\n                        results[outname] = val\n\n                    if task_outname in names[id]:\n                        idx = outname_to_pos[id][task_outname]\n                        val = _check_and_adapt_shape_dtype(val, joint_shape_and_dtypes[id][idx], message=task_outname+': ')\n                        results[task_outname] = val\n\n                yield results\n\n    return iterator\n\n\ndef create_multihead_iterator_fn(iterators, iterator_prefixes, joint_shape_and_dtypes, mrs, names, outname_to_pos, dev_count=1, keep_one_task=True):\n    task_ids = range(len(iterators))\n    weights = [mr / float(sum(mrs)) for mr in mrs]\n    if not keep_one_task:\n        dev_count = 1\n\n    def iterator():\n        while True:\n            id = np.random.choice(task_ids, p=weights)\n            task_id_tensor = np.array([id]).astype(\"int64\")\n            \n            for i in range(dev_count):\n                \n                outputs = next(iterators[id]) # dict type\n\n                prefix = iterator_prefixes[id]\n                results = {}\n                results['__task_id'] = task_id_tensor\n                for outname, val in outputs.items():\n                    task_outname = prefix + '.' + outname\n\n                    if outname in names[id]:\n                        idx = outname_to_pos[id][outname]\n                        val = _check_and_adapt_shape_dtype(val, joint_shape_and_dtypes[id][idx], message=outname+': ')\n                        results[outname] = val\n\n                    if task_outname in names[id]:\n                        idx = outname_to_pos[id][task_outname]\n                        val = _check_and_adapt_shape_dtype(val, joint_shape_and_dtypes[id][idx], message=task_outname+': ')\n                        results[task_outname] = val\n\n                yield results\n\n    return iterator\n\n\ndef create_joint_iterator_fn(iterators, iterator_prefixes, joint_shape_and_dtypes, mrs, outname_to_pos, dev_count=1, keep_one_task=True, verbose=0):\n    \"\"\"\n        joint_shape_and_dtypes: 本质上是根据bb和parad的attr设定的，并且由reader中的attr自动填充-1（可变）维度得到，因此通过与iterator的校验可以完成runtime的batch正确性检查\n    \"\"\"\n\n    task_ids = range(len(iterators))\n    weights = [mr / float(sum(mrs)) for mr in mrs]\n    if not keep_one_task:\n        dev_count = 1\n\n    results = _zero_batch(joint_shape_and_dtypes)\n    outbuf = {}\n    for id in task_ids:\n        outputs = next(iterators[id]) # dict type\n        outbuf[id] = outputs\n        prefix = iterator_prefixes[id]\n        for outname, val in outputs.items():\n            task_outname = prefix + '.' + outname\n\n            if outname in outname_to_pos:\n                idx = outname_to_pos[outname]\n                val = _check_and_adapt_shape_dtype(val, joint_shape_and_dtypes[idx], message=outname+': ')\n                results[idx] = val\n\n            if task_outname in outname_to_pos:\n                idx = outname_to_pos[task_outname]\n                val = _check_and_adapt_shape_dtype(val, joint_shape_and_dtypes[idx], message=task_outname+': ')\n                results[idx] = val\n\n    fake_batch = results\n    dev_count_bak = dev_count\n\n    def iterator():\n        v = verbose\n        has_show_warn = False\n        while True:\n            id = np.random.choice(task_ids, p=weights)\n            results = fake_batch\n            if v > 0:\n                print('----- debug joint iterator -----')\n                print('sampled task id: '+str(id))\n            task_id_tensor = np.array([[id]]).astype(\"int64\")\n            \n            for i in range(dev_count):\n                \n                results[outname_to_pos['__task_id']] = task_id_tensor\n                assert outname_to_pos['__task_id'] == 0\n\n                if id in outbuf:\n                    outputs = outbuf[id]\n                    del outbuf[id]\n                else:\n                    outputs = next(iterators[id]) # dict type\n\n                if 'token_ids' in outputs:\n                    val1 = len(outputs['token_ids'])\n                    val = _check_and_adapt_shape_dtype([val1], [[1], 'int64'])\n                    results[outname_to_pos['batch_size']] = val\n\n                    val2 = len(outputs['token_ids'][0])\n                    val = _check_and_adapt_shape_dtype([val2], [[1], 'int64'])\n                    results[outname_to_pos['seqlen']] = val\n\n                    val = _check_and_adapt_shape_dtype([val1*val2], [[1], 'int64'])\n                    results[outname_to_pos['batchsize_x_seqlen']] = val\n                else:\n                    if not has_show_warn:\n                        print('WARNING: token_ids not found in current batch, failed to yield batch_size, seqlen and batchsize_x_seqlen. (This message would be shown only once.)')\n                        has_show_warn = True\n\n                prefix = iterator_prefixes[id]\n                for outname, val in outputs.items():\n                    if v > 0:\n                        print('reader generate: '+outname)\n                    task_outname = prefix + '.' + outname\n\n                    if outname in outname_to_pos:\n                        idx = outname_to_pos[outname]\n                        if v > 0:\n                            print(outname + ' is insert in idx ' + str(idx))\n                        val = _check_and_adapt_shape_dtype(val, joint_shape_and_dtypes[idx], message=outname+': ')\n                        results[idx] = val\n\n                    if task_outname in outname_to_pos:\n                        idx = outname_to_pos[task_outname]\n                        if v > 0:\n                            print(task_outname + ' is insert in idx ' + str(idx))\n                        val = _check_and_adapt_shape_dtype(val, joint_shape_and_dtypes[idx], message=task_outname+': ')\n                        results[idx] = val\n\n                if v > 0:\n                    print('yielded batch len and shapes:')\n                    print(len(results))\n                    for i in results:\n                        print(np.shape(i))\n                    print('')\n                    v -= 1\n                yield results\n\n    return iterator\n\n\ndef merge_input_attrs(backbone_attr, task_attrs, insert_taskid=True, insert_batchsize=False, insert_seqlen=False, insert_batchsize_x_seqlen=False):\n    \"\"\"\n    Args:\n        task_attrs(list[dict]|dict): task input attributes, key=attr_name, val=[shape, dtype], support single task and nested tasks\n    \"\"\"\n    if isinstance(task_attrs, dict):\n        task_attrs = [task_attrs]\n\n    ret = []\n    names = []\n    start = 0\n    if insert_taskid:\n        ret.append(([1, 1], 'int64'))\n        names.append('__task_id')\n        start += 1\n    \n    if insert_batchsize:\n        ret.append(([1], 'int64'))\n        names.append('batch_size')\n        start += 1\n\n    if insert_seqlen:\n        ret.append(([1], 'int64'))\n        names.append('seqlen')\n        start += 1\n\n    if insert_batchsize_x_seqlen:\n        ret.append(([1], 'int64'))\n        names.append(u'batchsize_x_seqlen')\n        start += 1\n        \n    names += sorted(backbone_attr.keys())\n    ret.extend([backbone_attr[k] for k in names[start:]])\n    name_to_position = {}\n    # pos=0 is for task_id, thus we start from 1\n    for pos, k in enumerate(names):\n        name_to_position[k] = pos\n    for task_attr in task_attrs:\n        task_names = sorted(task_attr.keys())\n        names.extend(task_names)\n        ret.extend([task_attr[k] for k in task_names])\n        for pos, k in enumerate(task_names, start=len(name_to_position)):\n            name_to_position[k] = pos\n    return names, ret, name_to_position\n"
  },
  {
    "path": "paddlepalm/utils/saver.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom __future__ import print_function\n\nimport os\nimport six\nimport ast\nimport copy\nimport tarfile\nimport shutil\n\nimport numpy as np\nimport paddle.fluid as fluid\n\ndef init_checkpoint(exe, init_checkpoint_path, main_program, skip_list = []):\n    assert os.path.exists(\n        init_checkpoint_path), \"[%s] cann't be found.\" % init_checkpoint_path\n\n    def existed_persitables(var):\n        if not fluid.io.is_persistable(var):\n            return False\n        if var.name in skip_list:\n            return False\n        return os.path.exists(os.path.join(init_checkpoint_path, var.name))\n\n    fluid.io.load_vars(\n        exe,\n        init_checkpoint_path,\n        main_program=main_program,\n        predicate=existed_persitables)\n    print(\"Load model from {}\".format(init_checkpoint_path))\n\n\ndef init_pretraining_params(exe,\n                            pretraining_params_path,\n                            convert,\n                            main_program,\n                            strict=False):\n                            \n    assert os.path.exists(pretraining_params_path\n                          ), \"[%s] cann't be found.\" % pretraining_params_path\n\n    if convert:\n        assert os.path.exists(os.path.join(pretraining_params_path, '__palmmodel__')), \"__palmmodel__ not found.\"\n\n        with tarfile.open(os.path.join(pretraining_params_path, '__palmmodel__'), 'r') as f:\n            f.extractall(os.path.join(pretraining_params_path, '.temp'))\n        \n        log_path = os.path.join(pretraining_params_path, '__palmmodel__')\n        pretraining_params_path = os.path.join(pretraining_params_path, '.temp')\n\n    else:\n        log_path = pretraining_params_path\n    \n    print(\"Loading pretraining parameters from {}...\".format(pretraining_params_path))\n\n    def existed_params(var):\n        if not isinstance(var, fluid.framework.Parameter):\n            return False\n        if not os.path.exists(os.path.join(pretraining_params_path, var.name)):\n            if strict:\n                raise Exception('Error: {} not found in {}.'.format(var.name, log_path))\n            else:\n                print('Warning: {} not found in {}.'.format(var.name, log_path))\n        return os.path.exists(os.path.join(pretraining_params_path, var.name))\n\n    fluid.io.load_vars(\n        exe,\n        pretraining_params_path,\n        main_program=main_program,\n        predicate=existed_params)\n    if convert:\n        shutil.rmtree(pretraining_params_path)\n    print('')\n\n\n"
  },
  {
    "path": "paddlepalm/utils/textprocess_helper.py",
    "content": "# -*- coding: UTF-8 -*-\n#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\ndef is_whitespace(c):\n    if c == \" \" or c == \"\\t\" or c == \"\\r\" or c == \"\\n\" or ord(c) == 0x202F:\n        return True\n    return False\n"
  },
  {
    "path": "setup.cfg",
    "content": "[metadata]\n\nname = paddlepalm\n\nauthor = zhangyiming\nauthor_email = zhangyiming04@baidu.com\n\nversion = 2.1.0\n\ndescription = PaddlePALM\nlong_description = file: README.md\nlong_description_content_type = text/markdown\n\nhome_page = https://github.com/PaddlePaddle/PALM\nlicense = Apache 2.0\n\nclassifier =\n    Private :: Do Not Upload\n    Programming Language :: Python\n    Programming Language :: Python :: 2\n    Programming Language :: Python :: 2.7\n    Programming Language :: Python :: 3\n    Programming Language :: Python :: 3.5\n    Programming Language :: Python :: 3.6\n    Programming Language :: Python :: 3.7\n\nkeywords =\n    paddlepaddle\n    paddle\n    nlp\n    pretrain\n    multi-task-learning\n\n[options]\n\npackages = find:\n\ninclude_package_data = True\nzip_safe = False\n\n[sdist]\ndist_dir = output/dist\n\n[bdist_wheel]\ndist_dir = output/dist\n\n[easy_install]\nindex_url = http://pip.baidu.com/root/baidu/+simple/\n\n\n\n"
  },
  {
    "path": "setup.py",
    "content": "# -*- coding: UTF-8 -*-\n################################################################################\n#\n#   Copyright (c) 2019  Baidu.com, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\"\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n################################################################################\n\"\"\"\nSetup script.\nAuthors: zhouxiangyang(zhouxiangyang@baidu.com)\nDate:    2020/2/4 00:00:01\n\"\"\"\nimport setuptools\nwith open(\"README.md\", \"r\") as fh:\n    long_description = fh.read()\nsetuptools.setup(\n    name=\"paddlepalm\",\n    version=\"2.1.0\",\n    author=\"PaddlePaddle\",\n    author_email=\"zhangyiming04@baidu.com\",\n    description=\"a flexible, general and easy-to-use NLP large-scale pretraining and multi-task learning framework.\",\n    # long_description=long_description,\n    # long_description_content_type=\"text/markdown\",\n    url=\"https://github.com/PaddlePaddle/PALM\",\n    # packages=setuptools.find_packages(),\n    packages = ['paddlepalm', \n        'paddlepalm.backbone', \n        'paddlepalm.backbone.utils', \n        'paddlepalm.optimizer',\n        'paddlepalm.reader', \n        'paddlepalm.reader.utils', \n        'paddlepalm.head', \n        'paddlepalm.distribute', \n        'paddlepalm.lr_sched', \n        'paddlepalm.tokenizer', \n        'paddlepalm.utils'],\n    package_dir={'paddlepalm':'./paddlepalm',\n                 'paddlepalm.backbone':'./paddlepalm/backbone',\n                 'paddlepalm.backbone.utils':'./paddlepalm/backbone/utils',\n                 'paddlepalm.optimizer':'./paddlepalm/optimizer',\n                 'paddlepalm.lr_sched': './paddlepalm/lr_sched',\n                 'paddlepalm.distribute': './paddlepalm/distribute',\n                 'paddlepalm.reader':'./paddlepalm/reader',\n                 'paddlepalm.reader.utils':'./paddlepalm/reader/utils',\n                 'paddlepalm.head':'./paddlepalm/head',\n                 'paddlepalm.tokenizer':'./paddlepalm/tokenizer',\n                 'paddlepalm.utils':'./paddlepalm/utils'},\n    platforms = \"any\",\n    license='Apache 2.0',\n    classifiers = [\n            'License :: OSI Approved :: Apache Software License',\n            'Programming Language :: Python',\n            'Programming Language :: Python :: 2',\n            'Programming Language :: Python :: 2.7',\n            'Programming Language :: Python :: 3',\n            'Programming Language :: Python :: 3.5',\n            'Programming Language :: Python :: 3.6',\n            'Programming Language :: Python :: 3.7',\n          ],\n    install_requires = [\n        'paddlepaddle-gpu>=1.8.0'\n    ]\n)\n\n\n"
  },
  {
    "path": "test/test2/config.yaml",
    "content": "ask_instance: \"mrqa, mlm4mrqa, match4mrqa\"\ntarget_tag: 1, 0, 0\nmix_ratio: 1.0, 0.5, 0.5\n\nsave_path: \"output_model/secondrun\"\n\nbackbone: \"ernie\"\nbackbone_config_path: \"../../pretrain_model/ernie/ernie_config.json\"\n\nvocab_path: \"../../pretrain_model/ernie/vocab.txt\"\ndo_lower_case: True\nmax_seq_len: 512\n\nbatch_size: 4\nnum_epochs: 2\noptimizer: \"adam\"\nlearning_rate: 3e-5\nwarmup_proportion: 0.1\nweight_decay: 0.1\n\nprint_every_n_steps: 1\n"
  },
  {
    "path": "test/test2/run.py",
    "content": "# coding=utf-8\nimport paddlepalm as palm\nimport json\n\nif __name__ == '__main__':\n\n    max_seqlen = 512\n    batch_size = 4\n    num_epochs = 2\n    lr = 1e-3\n    vocab_path = './pretrain/ernie/vocab.txt'\n\n    train_file = './data/cls4mrqa/train.tsv'\n    predict_file = './data/cls4mrqa/dev.tsv'\n\n    config = json.load(open('./pretrain/ernie/ernie_config.json'))\n    # ernie = palm.backbone.ERNIE(...)\n    ernie = palm.backbone.ERNIE.from_config(config)\n\n    # cls_reader2 = palm.reader.cls(train_file_topic, vocab_path, batch_size, max_seqlen)\n    # cls_reader3 = palm.reader.cls(train_file_subj, vocab_path, batch_size, max_seqlen)\n    # topic_trainer = palm.Trainer('topic_cls', cls_reader2, cls)\n    # subj_trainer = palm.Trainer('subj_cls', cls_reader3, cls)\n\n    # 创建该分类任务的reader，由诸多参数控制数据集读入格式、文件数量、预处理规则等\n    cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen)\n    cls_reader2 = palm.reader.ClassifyReader(vocab_path, max_seqlen)\n    print(cls_reader.outputs_attr)\n    # 不同的backbone会对任务reader有不同的特征要求，例如对于分类任务，基本的输入feature为token_ids和label_ids，但是对于BERT，还要求从输入中额外提取position、segment、input_mask等特征，因此经过register后，reader会自动补充backbone所要求的字段\n    cls_reader.register_with(ernie)\n    cls_reader2.register_with(ernie)\n    print(cls_reader.outputs_attr)\n\n    print(\"preparing data...\")\n    print(cls_reader.num_examples)\n    cls_reader.load_data(train_file, batch_size)\n    cls_reader2.load_data(train_file, batch_size)\n    print(cls_reader.num_examples)\n    print('done!')\n\n    # 创建任务头（task head），如分类、匹配、机器阅读理解等。每个任务头有跟该任务相关的必选/可选参数。注意，任务头与reader是解耦合的，只要任务头依赖的数据集侧的字段能被reader提供，那么就是合法的\n    cls_head = palm.head.Classify(4, 1024, 0.1)\n    cls_head2 = palm.head.Classify(4, 1024, 0.1)\n\n    # 根据reader和任务头来创建一个训练器trainer，trainer代表了一个训练任务，内部维护着训练进程、和任务的关键信息，并完成合法性校验，该任务的模型保存、载入等相关规则控制\n    trainer = palm.Trainer('cls')\n    trainer2 = palm.Trainer('senti_cls')\n    mh_trainer = palm.MultiHeadTrainer([trainer, trainer2])\n\n    # match4mrqa.reuse_head_with(mrc4mrqa)\n\n    # data_vars = cls_reader.build()\n    # output_vars = ernie.build(data_vars)\n    # cls_head.build({'backbone': output_vars, 'reader': data_vars})\n\n    loss_var = mh_trainer.build_forward(ernie, [cls_head, cls_head2])\n\n    n_steps = cls_reader.num_examples * num_epochs // batch_size\n    warmup_steps = int(0.1 * n_steps)\n    print(warmup_steps)\n    sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps)\n\n    adam = palm.optimizer.Adam(loss_var, lr, sched)\n\n    mh_trainer.build_backward(optimizer=adam, weight_decay=0.001)\n    \n    # mh_trainer.random_init_params()\n    mh_trainer.load_pretrain('pretrain/ernie/params')\n\n    # trainer.train(iterator_fn, print_steps=1, save_steps=5, save_path='outputs', save_type='ckpt,predict')\n    mh_trainer.fit_readers_with_mixratio([cls_reader, cls_reader2], 'cls', 2)\n    mh_trainer.train(print_steps=1)\n    # trainer.save()\n\n"
  },
  {
    "path": "test/test2/run.sh",
    "content": "export CUDA_VISIBLE_DEVICES=3\npython run.py \n\n"
  },
  {
    "path": "test/test3/config.yaml",
    "content": "task_instance: \"cls1, cls2, cls3, cls4, cls5, cls6\"\n\ntask_reuse_tag: 0,0,1,1,0,2\n\nsave_path: \"output_model/thirdrun\"\n\nbackbone: \"ernie\"\nbackbone_config_path: \"../../pretrain_model/ernie/ernie_config.json\"\n\nvocab_path: \"../../pretrain_model/ernie/vocab.txt\"\ndo_lower_case: True\nmax_seq_len: 512\n\nbatch_size: 4\nnum_epochs: 2\noptimizer: \"adam\"\nlearning_rate: 3e-5\nwarmup_proportion: 0.1\nweight_decay: 0.1\n\nprint_every_n_steps: 1\n"
  },
  {
    "path": "test/test3/run.py",
    "content": "# coding=utf-8\nimport paddlepalm as palm\nimport json\n\nif __name__ == '__main__':\n\n    max_seqlen = 512\n    batch_size = 4\n    num_epochs = 2\n    lr = 1e-3\n    vocab_path = './pretrain/ernie/vocab.txt'\n\n    train_file = './data/cls4mrqa/train.tsv'\n    predict_file = './data/cls4mrqa/dev.tsv'\n\n    config = json.load(open('./pretrain/ernie/ernie_config.json'))\n    # ernie = palm.backbone.ERNIE(...)\n    ernie = palm.backbone.ERNIE.from_config(config)\n\n    # cls_reader2 = palm.reader.cls(train_file_topic, vocab_path, batch_size, max_seqlen)\n    # cls_reader3 = palm.reader.cls(train_file_subj, vocab_path, batch_size, max_seqlen)\n    # topic_trainer = palm.Trainer('topic_cls', cls_reader2, cls)\n    # subj_trainer = palm.Trainer('subj_cls', cls_reader3, cls)\n\n    # 创建该分类任务的reader，由诸多参数控制数据集读入格式、文件数量、预处理规则等\n    cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen)\n<<<<<<< HEAD:test/test2/run.py\n    cls_reader2 = palm.reader.ClassifyReader(vocab_path, max_seqlen)\n=======\n    predict_cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, phase='predict')\n>>>>>>> remotes/upstream/r0.3-api:test/test3/run.py\n    print(cls_reader.outputs_attr)\n    print(predict_cls_reader.outputs_attr)\n    # 不同的backbone会对任务reader有不同的特征要求，例如对于分类任务，基本的输入feature为token_ids和label_ids，但是对于BERT，还要求从输入中额外提取position、segment、input_mask等特征，因此经过register后，reader会自动补充backbone所要求的字段\n    cls_reader.register_with(ernie)\n    cls_reader2.register_with(ernie)\n    print(cls_reader.outputs_attr)\n<<<<<<< HEAD:test/test2/run.py\n\n    print(\"preparing data...\")\n    print(cls_reader.num_examples)\n    cls_reader.load_data(train_file, batch_size)\n    cls_reader2.load_data(train_file, batch_size)\n=======\n    print(predict_cls_reader.outputs_attr)\n\n    print(\"preparing data...\")\n    print(cls_reader.num_examples)\n    cls_reader.load_data(train_file, batch_size, num_epochs=num_epochs)\n>>>>>>> remotes/upstream/r0.3-api:test/test3/run.py\n    print(cls_reader.num_examples)\n    print('done!')\n\n    # 创建任务头（task head），如分类、匹配、机器阅读理解等。每个任务头有跟该任务相关的必选/可选参数。注意，任务头与reader是解耦合的，只要任务头依赖的数据集侧的字段能被reader提供，那么就是合法的\n    cls_head = palm.head.Classify(4, 1024, 0.1)\n<<<<<<< HEAD:test/test2/run.py\n    cls_head2 = palm.head.Classify(4, 1024, 0.1)\n\n    # 根据reader和任务头来创建一个训练器trainer，trainer代表了一个训练任务，内部维护着训练进程、和任务的关键信息，并完成合法性校验，该任务的模型保存、载入等相关规则控制\n    trainer = palm.Trainer('cls')\n    trainer2 = palm.Trainer('senti_cls')\n    mh_trainer = palm.MultiHeadTrainer([trainer, trainer2])\n=======\n\n    # 根据reader和任务头来创建一个训练器trainer，trainer代表了一个训练任务，内部维护着训练进程、和任务的关键信息，并完成合法性校验，该任务的模型保存、载入等相关规则控制\n    trainer = palm.Trainer('senti_cls')\n>>>>>>> remotes/upstream/r0.3-api:test/test3/run.py\n\n    # match4mrqa.reuse_head_with(mrc4mrqa)\n\n    # data_vars = cls_reader.build()\n    # output_vars = ernie.build(data_vars)\n    # cls_head.build({'backbone': output_vars, 'reader': data_vars})\n\n<<<<<<< HEAD:test/test2/run.py\n    loss_var = mh_trainer.build_forward(ernie, [cls_head, cls_head2])\n\n    n_steps = cls_reader.num_examples * num_epochs // batch_size\n    warmup_steps = int(0.1 * n_steps)\n    print(warmup_steps)\n    sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps)\n=======\n    loss_var = trainer.build_forward(ernie, cls_head)\n\n    # controller.build_forward()\n    # Error! a head/backbone can be only build once! Try NOT to call build_forward method for any Trainer!\n\n    # n_steps = cls_reader.num_examples * num_epochs // batch_size\n    # warmup_steps = int(0.1 * n_steps)\n    # print(warmup_steps)\n    # sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps)\n    sched = None\n>>>>>>> remotes/upstream/r0.3-api:test/test3/run.py\n\n    adam = palm.optimizer.Adam(loss_var, lr, sched)\n\n    mh_trainer.build_backward(optimizer=adam, weight_decay=0.001)\n    \n    # mh_trainer.random_init_params()\n    mh_trainer.load_pretrain('pretrain/ernie/params')\n\n    # trainer.train(iterator_fn, print_steps=1, save_steps=5, save_path='outputs', save_type='ckpt,predict')\n<<<<<<< HEAD:test/test2/run.py\n    mh_trainer.fit_readers_with_mixratio([cls_reader, cls_reader2], 'cls', 2)\n    mh_trainer.train(print_steps=1)\n    # trainer.save()\n\n=======\n    trainer.fit_reader(cls_reader)\n    trainer.train(print_steps=1)\n    # trainer.save()\n\n    print('prepare to predict...')\n    pred_ernie = palm.backbone.ERNIE.from_config(config, phase='pred')\n    cls_pred_head = palm.head.Classify(4, 1024, phase='pred')\n    trainer.build_predict_forward(pred_ernie, cls_pred_head)\n\n    predict_cls_reader.load_data(predict_file, 8)\n    print(predict_cls_reader.num_examples)\n    predict_cls_reader.register_with(pred_ernie)\n    trainer.fit_reader(predict_cls_reader, phase='predict')\n    print('predicting..')\n    trainer.predict(print_steps=20)\n\n\n\n\n\n\n\n\n    # controller = palm.Controller([mrqa, match4mrqa, mlm4mrqa])\n\n    # loss = controller.build_forward(bb, mask_task=[])\n\n    # n_steps = controller.estimate_train_steps(basetask=mrqa, num_epochs=2, batch_size=8, dev_count=4)\n    # adam = palm.optimizer.Adam(loss)\n    # sched = palm.schedualer.LinearWarmup(learning_rate, max_train_steps=n_steps, warmup_steps=0.1*n_steps)\n    # \n    # controller.build_backward(optimizer=adam, schedualer=sched, weight_decay=0.001, use_ema=True, ema_decay=0.999)\n\n    # controller.random_init_params()\n    # controller.load_pretrain('../../pretrain_model/ernie/params')\n    # controller.train()\n\n\n\n\n\n    # controller = palm.Controller(config='config.yaml', task_dir='tasks', for_train=False)\n    # controller.pred('mrqa', inference_model_dir='output_model/secondrun/mrqa/infer_model')\n\n\n>>>>>>> remotes/upstream/r0.3-api:test/test3/run.py\n"
  },
  {
    "path": "test/test3/run.sh",
    "content": "export CUDA_VISIBLE_DEVICES=3\n\npython run.py\n\n"
  }
]