Repository: PaddlePaddle/PALM Branch: master Commit: 2555c0e2a5fa Files: 98 Total size: 443.8 KB Directory structure: gitextract_o6rx2q6_/ ├── .gitignore ├── README.md ├── README_zh.md ├── customization_cn.md ├── examples/ │ ├── classification/ │ │ ├── README.md │ │ ├── download.py │ │ ├── evaluate.py │ │ └── run.py │ ├── matching/ │ │ ├── README.md │ │ ├── download.py │ │ ├── evaluate.py │ │ ├── process.py │ │ └── run.py │ ├── mrc/ │ │ ├── README.md │ │ ├── download.py │ │ ├── evaluate.py │ │ └── run.py │ ├── multi-task/ │ │ ├── README.md │ │ ├── download.py │ │ ├── evaluate_intent.py │ │ ├── evaluate_slot.py │ │ ├── joint_predict.py │ │ ├── predict_intent.py │ │ ├── predict_slot.py │ │ ├── process.py │ │ └── run.py │ ├── predict/ │ │ ├── README.md │ │ ├── download.py │ │ ├── evaluate.py │ │ └── run.py │ ├── tagging/ │ │ ├── README.md │ │ ├── download.py │ │ ├── evaluate.py │ │ └── run.py │ └── train_with_eval/ │ ├── README.md │ ├── download.py │ ├── evaluate.py │ └── run.py ├── paddlepalm/ │ ├── __init__.py │ ├── _downloader.py │ ├── backbone/ │ │ ├── README.md │ │ ├── __init__.py │ │ ├── base_backbone.py │ │ ├── bert.py │ │ ├── ernie.py │ │ └── utils/ │ │ ├── __init__.py │ │ └── transformer.py │ ├── distribute/ │ │ ├── __init__.py │ │ └── reader.py │ ├── downloader.py │ ├── head/ │ │ ├── __init__.py │ │ ├── base_head.py │ │ ├── cls.py │ │ ├── match.py │ │ ├── mlm.py │ │ ├── mrc.py │ │ └── ner.py │ ├── lr_sched/ │ │ ├── __init__.py │ │ ├── base_schedualer.py │ │ ├── slanted_triangular_schedualer.py │ │ └── warmup_schedualer.py │ ├── multihead_trainer.py │ ├── optimizer/ │ │ ├── __init__.py │ │ ├── adam.py │ │ └── base_optimizer.py │ ├── reader/ │ │ ├── __init__.py │ │ ├── base_reader.py │ │ ├── cls.py │ │ ├── match.py │ │ ├── mlm.py │ │ ├── mrc.py │ │ ├── seq_label.py │ │ └── utils/ │ │ ├── __init__.py │ │ ├── batching4bert.py │ │ ├── batching4ernie.py │ │ ├── mlm_batching.py │ │ ├── mrqa_helper.py │ │ └── reader4ernie.py │ ├── tokenizer/ │ │ ├── __init__.py │ │ ├── bert_tokenizer.py │ │ └── ernie_tokenizer.py │ ├── trainer.py │ └── utils/ │ ├── __init__.py │ ├── basic_helper.py │ ├── config_helper.py │ ├── plot_helper.py │ ├── print_helper.py │ ├── reader_helper.py │ ├── saver.py │ └── textprocess_helper.py ├── setup.cfg ├── setup.py └── test/ ├── test2/ │ ├── config.yaml │ ├── run.py │ └── run.sh └── test3/ ├── config.yaml ├── run.py └── run.sh ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ *.pyc paddlepalm.egg-info data __pycache__ *egg-info pretrain_model pretrain output* output_model build dist paddle_palm.egg-info mrqa_output *.log ================================================ FILE: README.md ================================================ # PaddlePALM English | [简体中文](./README_zh.md) PaddlePALM (PArallel Learning from Multi-tasks) is a fast, flexible, extensible and easy-to-use NLP large-scale pretraining and multi-task learning framework. PaddlePALM is a high level framework **aiming at fastly developing high-performance NLP models**. With PaddlePALM, it is easy to achieve effecient exploration of robust learning of NLP models with multiple auxilary tasks. For example, based on PaddlePALM, the produced robust MRC model, [D-Net](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/Research/MRQA2019-D-NET), has achieved **the 1st place** in [EMNLP2019 MRQA](https://mrqa.github.io) track.

Sample

MRQA2019 Leaderboard

Beyond the research scope, PaddlePALM has been applied on **Baidu Search Engine** to seek for more accurate user query understanding and answer mining, which implies the high reliability and performance of PaddlePALM. #### Features: - **Easy-to-use:** with PALM, *8 steps* to achieve a typical NLP task. Moreover, all basic components (e.g., the model backbone, dataset reader, task output head, optimizer...) have been decoupled, which allows the replacement of any component to other candidates with quite minor changes of your code. - **Built-in Popular NLP Backbones and Pre-trained models:** multiple state-of-the-art general purpose model architectures and pretrained models (e.g., BERT,ERNIE,RoBERTa,...) are built-in. - **Easy to play Multi-task Learning:** only one API is needed for jointly training of several tasks with parameters reusement. - **Support train/eval with Multi-GPUs:** automatically recognize and adapt to multiple gpus mode to accelerate training and inference. - **Pre-training friendly:** self-supervised tasks (e.g., mask language model) are built-in to facilitate pre-training. Easy to train from scratch. - **Easy to Customize:** support customized development of any component (e.g, backbone, task head, reader and optimizer) with reusement of pre-defined ones, which gives developers high flexibility and effeciency to adapt for diverse NLP scenes. You can easily re-produce following competitive results with minor codes, which covers most of NLP tasks such as classification, matching, sequence labeling, reading comprehension, dialogue understanding and so on. More details can be found in `examples`.
Dataset
chnsenticorp
Quora Question Pairs matching
MSRA-NER
(SIGHAN2006)
CMRC2018

Metric

accuracy

f1-score
accuracy

f1-score
f1-score
em
f1-score
test
test
test
dev
ERNIE Base 95.8 95.8 86.2 82.2 99.2 64.3 85.2
## Overview

Sample

Architecture Diagram

PaddlePALM is a well-designed high-level NLP framework. You can efficiently achieve **supervised learning, unsupervised/self-supervised learning, multi-task learning and transfer learning** with minor codes based on PaddlePALM. There are three layers in PaddlePALM architecture, i.e., component layer, trainer layer and high-level trainer layer from bottom to top. In component layer, PaddlePALM supplies 6 **decoupled** components to achieve a NLP task. Each component contains rich `pre-defined` classes and a `Base` class. Pre-defined classes are aiming at typical NLP tasks, and the base class is to help users develop a new Class (based on pre-defined ones or from the base). The trainer layer is to establish a computation graph with selected components and do training and predicting. The training strategy, model saving and loading, evaluation and predicting procedures are described in this layer. Noted a trainer can only process one task. The high-level trainer layer is for complicated learning and inference strategy, e.g., multi-task learning. You can add auxilary tasks to train robust NLP models (improve test set and out-of-domain performance of a model), or jointly training multiple related tasks to gain more performance for each task. | module | illustration | | - | - | | **paddlepalm** | an open source NLP pretraining and multitask learning framework, built on paddlepaddle. | | **paddlepalm.reader** | a collection of elastic task-specific dataset readers. | | **paddlepalm.backbone** | a collection of classic NLP representation models, e.g., BERT, ERNIE, RoBERTa. | | **paddlepalm.head** | a collection of task-specific output layers. | | **paddlepalm.lr_sched** | a collection of learning rate schedualers. | | **paddlepalm.optimizer** | a collection of optimizers. | | **paddlepalm.downloader** | a download module for pretrained models with configure and vocab files. | | **paddlepalm.Trainer** | the core unit to start a single task training/predicting session. A trainer is to build computation graph, manage training and evaluation process, achieve model/checkpoint saving and pretrain_model/checkpoint loading.| | **paddlepalm.MultiHeadTrainer** | the core unit to start a multi-task training/predicting session. A MultiHeadTrainer is built based on several Trainers. Beyond the inheritance of Trainer, it additionally achieves model backbone reuse across tasks, trainer sampling for multi-task learning, and multi-head inference for effective evaluation and prediction. | ## Installation PaddlePALM support both python2 and python3, linux and windows, CPU and GPU. The preferred way to install PaddlePALM is via `pip`. Just run following commands in your shell. ```bash pip install paddlepalm ``` ### Installing via source ```shell git clone https://github.com/PaddlePaddle/PALM.git cd PALM && python setup.py install ``` ### Library Dependencies - Python >= 2.7 - cuda >= 9.0 - cudnn >= 7.0 - PaddlePaddle >= 1.7.0 (Please refer to [this](http://www.paddlepaddle.org/#quick-start) to install) ### Downloading pretrain models We incorporate many pretrained models to initialize model backbone parameters. Training big NLP model, e.g., 12-layer transformers, with pretrained models is practically much more effective than that with randomly initialized parameters. To see all the available pretrained models and download, run following code in python interpreter (input command `python` in shell): ```python >>> from paddlepalm import downloader >>> downloader.ls('pretrain') Available pretrain items: => RoBERTa-zh-base => RoBERTa-zh-large => ERNIE-v2-en-base => ERNIE-v2-en-large => XLNet-cased-base => XLNet-cased-large => ERNIE-v1-zh-base => ERNIE-v1-zh-base-max-len-512 => BERT-en-uncased-large-whole-word-masking => BERT-en-cased-large-whole-word-masking => BERT-en-uncased-base => BERT-en-uncased-large => BERT-en-cased-base => BERT-en-cased-large => BERT-multilingual-uncased-base => BERT-multilingual-cased-base => BERT-zh-base >>> downloader.download('pretrain', 'BERT-en-uncased-base', './pretrain_models') ... ``` ## Usage #### Quick Start 8 steps to start a typical NLP training task. 1. use `paddlepalm.reader` to create a *reader* for dataset loading and input features generation, then call `reader.load_data` method to load your training data. 2. use `paddlepalm.backbone` to create a model *backbone* to extract text features (e.g., contextual word embedding, sentence embedding). 3. register your *reader* with your *backbone* through `reader.register_with` method. After this step, your reader is able to yield input features used by backbone. 4. use `paddlepalm.head` to create a task output *head*. This head can provide task loss for training and predicting results for model inference. 5. create a task *trainer* with `paddlepalm.Trainer`, then build forward graph with backbone and task head (created in step 2 and 4) through `trainer.build_forward`. 6. use `paddlepalm.optimizer` (and `paddlepalm.lr_sched` if is necessary) to create a *optimizer*, then build backward through `trainer.build_backward`. 7. fit prepared reader and data (achieved in step 1) to trainer with `trainer.fit_reader` method. 8. load pretrain model with `trainer.load_pretrain`, or load checkpoint with `trainer.load_ckpt` or nothing to do for training from scratch, then do training with `trainer.train`. For more implementation details, see following demos: - [Sentiment Classification](https://github.com/PaddlePaddle/PALM/tree/master/examples/classification) - [Question Pairs matching](https://github.com/PaddlePaddle/PALM/tree/master/examples/matching) - [Named Entity Recognition](https://github.com/PaddlePaddle/PALM/tree/master/examples/tagging) - [SQuAD-like Machine Reading Comprehension](https://github.com/PaddlePaddle/PALM/tree/master/examples/mrc). #### Multi-task Learning To run with multi-task learning mode: 1. repeatedly create components (i.e., reader, backbone and head) for each task followed with step 1~5 above. 2. create empty trainers (each trainer is corresponded to one task) and pass them to create a `MultiHeadTrainer`. 3. build multi-task forward graph with `multi_head_trainer.build_forward` method. 4. use `paddlepalm.optimizer` (and `paddlepalm.lr_sched` if is necessary) to create a *optimizer*, then build backward through `multi_head_trainer.build_backward`. 5. fit all prepared readers and data to multi_head_trainer with `multi_head_trainer.fit_readers` method. 6. load pretrain model with `multi_head_trainer.load_pretrain`, or load checkpoint with `multi_head_trainer.load_ckpt` or nothing to do for training from scratch, then do training with `multi_head_trainer.train`. The save/load and predict operations of a multi_head_trainer is the same as a trainer. For more implementation details with `multi_head_trainer`, see - [ATIS: joint training of dialogue intent recognition and slot filling](https://github.com/PaddlePaddle/PALM/tree/master/examples/multi-task) #### Save models To save models/checkpoints and logs during training, just call `trainer.set_saver` method. More implementation details see [this](https://github.com/PaddlePaddle/PALM/tree/master/examples). #### Evaluation/Inference To do predict/evaluation after a training stage, just create another three reader, backbone and head instance with `phase='predict'` (repeat step 1~4 above). Then do predicting with `predict` method in trainer (no need to create another trainer). More implementation details see [this](https://github.com/PaddlePaddle/PALM/tree/master/examples/predict). If you want to do evaluation during training process, use `trainer.train_one_step()` instead of `trainer.train()`. The `trainer.train_one_step(batch)` achieves to train only one step, thus you can insert evaluation code into any point of training process. The argument `batch` can be fetched from `trainer.get_one_batch`. PaddlePALM also supports multi-head inference, please reference `examples/multi-task/joint_predict.py`. #### Play with Multiple GPUs If there exists multiple GPUs in your environment, you can control the number and index of these GPUs through the environment variable [CUDA_VISIBLE_DEVICES](https://devblogs.nvidia.com/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/). For example, if 4 GPUs in your enviroment, indexed with 0,1,2,3, you can run with GPU2 only with following commands ```shell CUDA_VISIBLE_DEVICES=2 python run.py ``` Multiple GPUs should be seperated with `,`. For example, running with GPU2 and GPU3, following commands is refered: ```shell CUDA_VISIBLE_DEVICES=2,3 python run.py ``` On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. Therefore, when running with multiple cards, **you need to ensure that the set batch_size can be divided by the number of cards.** ## License This tutorial is contributed by [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) and licensed under the [Apache-2.0 license](https://github.com/PaddlePaddle/models/blob/develop/LICENSE). ================================================ FILE: README_zh.md ================================================ # PaddlePALM [English](./README.md) | 简体中文 PaddlePALM (PArallel Learning from Multi-tasks) 是一个灵活,通用且易于使用的NLP大规模预训练和多任务学习框架。 PALM是一个旨在**快速开发高性能NLP模型**的上层框架。 使用PaddlePALM,可以非常轻松灵活的探索具有多种任务辅助训练的“高鲁棒性”阅读理解模型,基于PALM训练的模型[D-Net](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/Research/MRQA2019-D-NET)在[EMNLP2019国际阅读理解评测](https://mrqa.github.io/)中夺得冠军。

Sample

MRQA2019 排行榜

除了降低NLP研究成本以外,PaddlePALM已被应用于“百度搜索引擎”,有效地提高了用户查询的理解准确度和挖掘出的答案质量,具备高可靠性和高训练/推理性能。 #### 特点: - **易于使用**:使用PALM, *8个步骤*即可实现一个典型的NLP任务。此外,模型主干网络、数据集读取工具和任务输出层已经解耦,只需对代码进行相当小的更改,就可以将任何组件替换为其他候选组件。 - **支持多任务学习**:*6个步骤*即可实现多任务学习任务。 - **支持大规模任务和预训练**:可自动利用多gpu加速训练和推理。集群上的分布式训练需要较少代码。 - **流行的NLP骨架和预训练模型**:内置多种最先进的通用模型架构和预训练模型(如BERT、ERNIE、RoBERTa等)。 - **易于定制**:支持任何组件的定制开发(例如:主干网络,任务头,读取工具和优化器)与预定义组件的复用,这给了开发人员高度的灵活性和效率,以适应不同的NLP场景。 你可以很容易地用较少的代码复现出很好的性能,涵盖了大多数NLP任务,如分类、匹配、序列标记、阅读理解、对话理解等等。更多细节可以在`examples`中找到。
数据集
chnsenticorp
Quora Question Pairs matching
MSRA-NER
(SIGHAN2006)
CMRC2018

评价标准

accuracy

f1-score
accuracy

f1-score
f1-score
em
f1-score
test
test
test
dev
ERNIE Base 95.8 95.8 86.2 82.2 99.2 64.3 85.2
## Package概览

Sample

PALM架构图

PaddlePALM是一个设计良好的高级NLP框架。基于PaddlePALM的轻量级代码可以高效实现**监督学习、非监督/自监督学习、多任务学习和迁移学习**。在PaddlePALM架构中有三层,从下到上依次是component层、trainer层、high-level trainer层。 在组件层,PaddlePALM提供了6个 **解耦的**组件来实现NLP任务。每个组件包含丰富的预定义类和一个基类。预定义类是针对典型的NLP任务的,而基类是帮助用户开发一个新类(基于预定义类或基类)。 训练器层是用选定的构件建立计算图,进行训练和预测。该层描述了训练策略、模型保存和加载、评估和预测过程。一个训练器只能处理一个任务。 高级训练器层用于复杂的学习和推理策略,如多任务学习。您可以添加辅助任务来训练健壮的NLP模型(提高模型的测试集和领域外的性能),或者联合训练多个相关任务来获得每个任务的更高性能。 | 模块 | 描述 | | - | - | | **paddlepalm** | 基于PaddlePaddle框架的high-level NLP预训练和多任务学习框架。 | | **paddlepalm.reader** | 预置的任务数据集读取与预处理工具。| | **paddlepalm.backbone** | 预置的主干网络,如BERT, ERNIE, RoBERTa。| | **paddlepalm.head** | 预置的任务输出层。| | **paddlepalm.lr_sched** | 预置的学习率规划策略。| | **paddlepalm.optimizer** | 预置的优化器。| | **paddlepalm.downloader** | 预训练模型管理与下载模块。| | **paddlepalm.Trainer** | 任务训练/预测单元。训练器用于建立计算图,管理训练和评估过程,实现模型/检查点保存和pretrain_model/检查点加载等。| | **paddlepalm.MultiHeadTrainer** | 完成多任务训练/预测的模块。一个MultiHeadTrainer建立在几个Trainer的基础上。实现了模型主干网络跨任务复用、多任务学习、多任务推理等。| ## 安装 PaddlePALM 支持 python2 和 python3, linux 和 windows, CPU 和 GPU。安装PaddlePALM的首选方法是通过`pip`。只需运行以下命令: ```bash pip install paddlepalm ``` ### 通过源码安装 ```shell git clone https://github.com/PaddlePaddle/PALM.git cd PALM && python setup.py install ``` ### 库依赖 - Python >= 2.7 - cuda >= 9.0 - cudnn >= 7.0 - PaddlePaddle >= 1.7.0 (请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装) ### 下载预训练模型 我们提供了许多预训练的模型来初始化模型主干网络参数。用预先训练好的模型训练大的NLP模型,如12层Transformer,实际上比用随机初始化的参数更有效。要查看所有可用的预训练模型并下载,请在python解释器中运行以下代码(在shell中输入命令`python`): ```python >>> from paddlepalm import downloader >>> downloader.ls('pretrain') Available pretrain items: => RoBERTa-zh-base => RoBERTa-zh-large => ERNIE-v2-en-base => ERNIE-v2-en-large => XLNet-cased-base => XLNet-cased-large => ERNIE-v1-zh-base => ERNIE-v1-zh-base-max-len-512 => BERT-en-uncased-large-whole-word-masking => BERT-en-cased-large-whole-word-masking => BERT-en-uncased-base => BERT-en-uncased-large => BERT-en-cased-base => BERT-en-cased-large => BERT-multilingual-uncased-base => BERT-multilingual-cased-base => BERT-zh-base >>> downloader.download('pretrain', 'BERT-en-uncased-base', './pretrain_models') ... ``` ## 使用 #### 快速开始 8个步骤开始一个典型的NLP训练任务。 1. 使用`paddlepalm.reader` 为数据集加载和输入特征生成创建一个`reader`,然后调用`reader.load_data`方法加载训练数据。 2. 使用`paddlepalm.load_data`创建一个模型*主干网络*来提取文本特征(例如,上下文单词嵌入,句子嵌入)。 3. 通过`reader.register_with`将`reader`注册到主干网络上。在这一步之后,reader能够使用主干网络产生的输入特征。 4. 使用`paddlepalm.head`。创建一个任务*head*,可以为训练提供任务损失,为模型推理提供预测结果。 5. 使用`paddlepalm.Trainer`创建一个任务`Trainer`,然后通过`Trainer.build_forward`构建包含主干网络和任务头的前向图(在步骤2和步骤4中创建)。 6. 使用`paddlepalm.optimizer`(如果需要,创建`paddlepalm.lr_sched`)来创建一个*优化器*,然后通过`train.build_back`向后构建。 7. 使用`trainer.fit_reader`将准备好的reader和数据(在步骤1中实现)给到trainer。 8. 使用`trainer.load_pretrain`加载预训练模型或使用 `trainer.load_pretrain`加载checkpoint,或不加载任何已训练好的参数,然后使用`trainer.train`进行训练。 更多实现细节请见示例: - [情感分析](https://github.com/PaddlePaddle/PALM/tree/master/examples/classification) - [Quora问题相似度匹配](https://github.com/PaddlePaddle/PALM/tree/master/examples/matching) - [命名实体识别](https://github.com/PaddlePaddle/PALM/tree/master/examples/tagging) - [类SQuAD机器阅读理解](https://github.com/PaddlePaddle/PALM/tree/master/examples/mrc) #### 多任务学习 多任务学习模式下运行: 1. 重复创建组件(每个任务按照上述第1~5步执行)。 2. 创建空的`Trainer`(每个`Trainer`对应一个任务),并通过它们创建一个`MultiHeadTrainer`。 3. 使用`multi_head_trainer.build_forward`构建多任务前向图。 4. 使用`paddlepalm.optimizer`(如果需要,创建`paddlepalm.lr_sched`)来创建一个*optimizer*,然后通过` multi_head_trainer.build_backward`创建反向。 5. 使用`multi_head_trainer.fit_readers`将所有准备好的读取器和数据放入`multi_head_trainer`中。 6. 使用`multi_head_trainer.load_pretrain`加载预训练模型或使用 `multi_head_trainer.load_pretrain`加载checkpoint,或不加载任何已经训练好的参数,然后使用`multi_head_trainer.train`进行训练。 multi_head_trainer的保存/加载和预测操作与trainer相同。 更多实现`multi_head_trainer`的细节,请见 - [ATIS: 对话意图识别和插槽填充的联合训练](https://github.com/PaddlePaddle/PALM/tree/master/examples/multi-task) #### 设置saver 在训练时保存 models/checkpoints 和 logs,调用 `trainer.set_saver` 方法。更多实现细节见[这里](https://github.com/PaddlePaddle/PALM/tree/master/examples)。 #### 评估/预测 训练结束后进行预测和评价, 只需创建额外的reader, backbone和head(重复上面1~4步骤),注意创建时需设`phase='predict'`。 然后使用trainer的`predict`方法进行预测(不需创建额外的trainer)。更多实现细节请见[这里](https://github.com/PaddlePaddle/PALM/tree/master/examples/predict)。 #### 使用多GPU 如果您的环境中存在多个GPU,您可以通过环境变量控制这些GPU的数量和索引[CUDA_VISIBLE_DEVICES](https://devblogs.nvidia.com/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/)。例如,如果您的环境中有4个gpu,索引为0、1、2、3,那么您可以运行以下命令来只使用GPU2: ```shell CUDA_VISIBLE_DEVICES=2 python run.py ``` 多GPU的使用需要 `,`作为分隔。例如,使用GPU2和GPU3,运行以下命令: ```shell CUDA_VISIBLE_DEVICES=2,3 python run.py ``` 在多GPU模式下,PaddlePALM会自动将每个batch数据分配到可用的GPU上。例如,如果`batch_size`设置为64,并且有4个GPU可以用于PaddlePALM,那么每个GPU中的batch_size实际上是64/4=16。因此,**当使用多个GPU时,您需要确保batch_size可以被暴露给PALM的GPU数量整除**。 ## 许可证书 此向导由[PaddlePaddle](https://github.com/PaddlePaddle/Paddle)贡献,受[Apache-2.0 license](https://github.com/PaddlePaddle/models/blob/develop/LICENSE)许可认证。 ================================================ FILE: customization_cn.md ================================================ # PALM组件定制化教程 PALM支持对如下组件自定义: - **head** 定义一个新的任务输出头,接收来自backbone和reader的输入,输出训练阶段的loss和预测阶段的预测结果。例如:分类任务头,序列标注任务头,机器阅读理解任务头等。 - **backbone** 定义一个新的主干网络,接收来自reader的文本相关的序列特征输入(如token ids),输出文本的特征向量表示(如词向量、上下文相关的词向量表示、句子向量等)。例如:BERT encoder,CNN encoder等。 - **reader** 定义一个新的数据集载入与预处理模块,接收来自原始数据集文件的输入(纯文本,原始标签等),输出文本相关的序列特征(如token ids,position ids等)。例如:文本分类数据集处理模块;文本匹配数据集处理模块等。 - **optimizer** 定义一个新的优化器 - **lr_sched** 定义一种新的学习率规划策略 PALM中的每个组件均使用类来描述,因此可以允许存在内部记忆(成员变量)。 新增某种类型的组件时,只需要实现该组件类型所在目录下的接口类中所描述的方法。若希望新增的组件跟框架的某个内置组件功能相似,那么实现新增组件时,可以继承自已有的内置组件,且仅对需要变动的方法进行修改即可。 ### head自定义 head的接口类(Interface)位于`paddlepalm/head/base_head.py`。 该接口类定义如下: ```python # -*- coding: UTF-8 -*- #   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # #     http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import os import json import copy class Head(object):     def __init__(self, phase='train'):         """该函数完成一个任务头的构造,至少需要包含一个phase参数。         注意:实现该构造函数时,必须保证对基类构造函数的调用,以创建必要的框架内建的成员变量。         Args:             phase: str类型。用于区分任务头被调用时所处的任务运行阶段,目前支持训练阶段train和预测阶段predict             """         self._stop_gradient = {}         self._phase = phase         self._prog = None         self._results_buffer = []     @property     def inputs_attrs(self):         """step级别的任务输入对象声明。         描述该任务头所依赖的reader、backbone和来自其他任务头的输出对象(每个step获取一次)。使用字典进行描述,         字典的key为输出对象所在的组件(如’reader‘,’backbone‘等),value为该组件下任务头所需要的输出对象集。         输出对象集使用字典描述,key为输出对象的名字(该名字需保证在相关组件的输出对象集中),value为该输出对象         的shape和dtype。当某个输出对象的某个维度长度可变时,shape中的相应维度设置为-1。         Return:             dict类型。描述该任务头所依赖的step级输入,即来自各个组件的输出对象。"""         raise NotImplementedError()     @property     def outputs_attr(self):         """step级别的任务输出对象声明。         描述该任务头的输出对象(每个step输出一次),包括每个输出对象的名字,shape和dtype。输出对象会被加入到         fetch_list中,从而在每个训练/推理step时得到实时的计算结果,该计算结果可以传入batch_postprocess方         法中进行当前step的后处理。当某个对象为标量数据类型(如str, int, float等)时,shape设置为空列表[],         当某个对象的某个维度长度可变时,shape中的相应维度设置为-1。         Return:             dict类型。描述该任务头所产生的输出对象。注意,在训练阶段时必须包含名为loss的输出对象。             """         raise NotImplementedError()     @property     def epoch_inputs_attrs(self):         """epoch级别的任务输入对象声明。         描述该任务所依赖的来自reader、backbone和来自其他任务头的输出对象(每个epoch结束后产生一次),如完整的         样本集,有效的样本数等。使用字典进行描述,字典的key为输出对象所在的组件(如’reader‘,’backbone‘等),         value为该组件下任务头所需要的输出对象集。输出对象集使用字典描述,key为输出对象的名字(该名字需保证在相关         组件的输出对象集中),value为该输出对象的shape和dtype。当某个输出对象的某个维度长度可变时,shape中的相         应维度设置为-1。                  Return:             dict类型。描述该任务头所产生的输出对象。注意,在训练阶段时必须包含名为loss的输出对象。         """         return {}     def build(self, inputs, scope_name=""):         """建立任务头的计算图。         将符合inputs_attrs描述的来自各个对象集的静态图Variables映射成符合outputs_attr描述的静态图Variable输出。         Args:             inputs: dict类型。字典中包含inputs_attrs中的对象名到计算图Variable的映射,inputs中至少会包含inputs_attr中定义的对象         Return:            需要输出的计算图变量,输出对象会被加入到fetch_list中,从而在每个训练/推理step时得到runtime的计算结果,该计算结果会被传入postprocess方法中供用户处理。         """         raise NotImplementedError()     def batch_postprocess(self, rt_outputs):         """batch/step级别的后处理。         每个训练或推理step后针对当前batch的任务头输出对象的实时计算结果来进行相关后处理。         默认将输出结果存储到缓冲区self._results_buffer中。"""         if isinstance(rt_outputs, dict):             keys = rt_outputs.keys()             vals = [rt_outputs[k] for k in keys]             lens = [len(v) for v in vals]             if len(set(lens)) == 1:                 results = [dict(zip(*[keys, i])) for i in zip(*vals)]                 self._results_buffer.extend(results)                 return results             else:                 print('WARNING: irregular output results. visualize failed.')                 self._results_buffer.append(rt_outputs)         return None     def reset(self):         """清空该任务头的缓冲区(在训练或推理过程中积累的处理结果)"""         self._results_buffer = []     def get_results(self):         """返回当前任务头积累的处理结果。"""         return copy.deepcopy(self._results_buffer)              def epoch_postprocess(self, post_inputs=None, output_dir=None):         """epoch级别的后处理。         每个训练或推理epoch结束后,对积累的各样本的后处理结果results进行后处理。默认情况下,当output_dir为None时,直接将results打印到         屏幕上。当指定output_dir时,将results存储在指定的文件夹内,并以任务头所处阶段来作为存储文件的文件名。         Args:             post_inputs: 当声明的epoch_inputs_attr不为空时,该参数会携带对应的输入变量的内容。             output_dir: 积累结果的保存路径。         """         if output_dir is not None:             for i in self._results_buffer:                 print(i)         else:             if not os.path.exists(output_dir):                 os.makedirs(output_dir)             with open(os.path.join(output_dir, self._phase), 'w') as writer:                 for i in self._results_buffer:                     writer.write(json.dumps(i)+'\n') ``` 在基类的基础上,定义一个全新的Head时需要至少实现的方法有: - \_\_init\_\_ - inputs_attrs - outputs_attr - build 可以重写的方法有: - epoch_inputs_attrs - batch_postprocess - epoch_postprocess ### backbone自定义 backbone的接口类(Interface)位于`paddlepalm/backbone/base_backbone.py`。 该接口类定义如下: ```python # -*- coding: UTF-8 -*- #   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # #     http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. class Backbone(object):     """interface of backbone model."""     def __init__(self, phase):         """该函数完成一个主干网络的构造,至少需要包含一个phase参数。         注意:实现该构造函数时,必须保证对基类构造函数的调用,以创建必要的框架内建的成员变量。         Args:             phase: str类型。用于区分主干网络被调用时所处的运行阶段,目前支持训练阶段train和预测阶段predict             """         assert isinstance(config, dict)     @property     def inputs_attr(self):         """描述backbone从reader处需要得到的输入对象的属性,包含各个对象的名字、shape以及数据类型。当某个对象         为标量数据类型(如str, int, float等)时,shape设置为空列表[],当某个对象的某个维度长度可变时,shape         中的相应维度设置为-1。         Return:             dict类型。对各个输入对象的属性描述。例如,             对于文本分类和匹配任务,bert backbone依赖的reader对象主要包含如下的对象                 {"token_ids": ([-1, max_len], 'int64'),                  "input_ids": ([-1, max_len], 'int64'),                  "segment_ids": ([-1, max_len], 'int64'),                  "input_mask": ([-1, max_len], 'float32')}"""         raise NotImplementedError()     @property     def outputs_attr(self):         """描述backbone输出对象的属性,包含各个对象的名字、shape以及数据类型。当某个对象为标量数据类型(如         str, int, float等)时,shape设置为空列表[],当某个对象的某个维度长度可变时,shape中的相应维度设置为-1。                  Return:             dict类型。对各个输出对象的属性描述。例如,             对于文本分类和匹配任务,bert backbone的输出内容可能包含如下的对象                 {"word_emb": ([-1, max_seqlen, word_emb_size], 'float32'),                  "sentence_emb": ([-1, hidden_size], 'float32'),                  "sim_vec": ([-1, hidden_size], 'float32')}"""          raise NotImplementedError()     def build(self, inputs):         """建立backbone的计算图。将符合inputs_attr描述的静态图Variable输入映射成符合outputs_attr描述的静态图Variable输出。         Args:             inputs: dict类型。字典中包含inputs_attr中的对象名到计算图Variable的映射,inputs中至少会包含inputs_attr中定义的对象         Return:            需要输出的计算图变量,输出对象会被加入到fetch_list中,从而在每个训练/推理step时得到runtime的计算结果,该计算结果会被传入postprocess方法中供用户处理。             """ raise NotImplementedError() ``` 在基类的基础上,定义一个全新的Backbone时需要至少实现的方法有: - \_\_init\_\_ - input_attrs - output_attr - build ### reader自定义 reader的接口类(Interface)位于`paddlepalm/reader/base_reader.py`。 该接口类定义如下: ```python # -*- coding: UTF-8 -*- #   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # #     http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from copy import copy class Reader(object):     """interface of data reader."""     def __init__(self, phase='train'):         """该函数完成一个Reader的构造,至少需要包含一个phase参数。         注意:实现该构造函数时,必须保证对基类构造函数的调用,以创建必要的框架内建的成员变量。         Args:             phase: str类型。用于区分主干网络被调用时所处的运行阶段,目前支持训练阶段train和预测阶段predict             """                  self._phase = phase         self._batch_size = None         self._num_epochs = 1         self._register = set()         self._registered_backbone = None     @classmethod     def create_register(self):         return set()              def clone(self, phase='train'):         """拷贝一个新的reader对象。"""         if phase == self._phase:             return copy(self)         else:             ret = copy(self)             ret._phase = phase             return ret     def require_attr(self, attr_name):         """在注册器中新增一个需要产生的对象。         Args:             attr_name: 需要产出的对象的对象名,例如’segment_ids‘。             """         self._register.add(attr_name)                  def register_with(self, backbone):         """根据backbone对输入对象的依赖,在注册器中对每个依赖的输入对象进行注册。         Args:             backbone: 需要对接的主干网络。         """         for attr in backbone.inputs_attr:             self.require_attr(attr)         self._registered_backbone = backbone     def get_registered_backbone(self):         """返回该reader所注册的backbone。"""         return self._registered_backbone     def _get_registed_attrs(self, attrs):         ret = {}         for i in self._register:             if i not in attrs:                 raise NotImplementedError('output attr {} is not found in this reader.'.format(i))             ret[i] = attrs[i]         return ret     def load_data(self, input_file, batch_size, num_epochs=None, \                   file_format='tsv', shuffle_train=True):         """将磁盘上的数据载入到reader中。         注意:实现该方法时需要同步创建self._batch_size和self._num_epochs。         Args:             input_file: 数据集文件路径。文件格式需要满足`file_format`参数的要求。             batch_size: 迭代器每次yield出的样本数量。注意:当环境中存在多个GPU时,batch_size需要保证被GPU卡数整除。             num_epochs: 数据集遍历次数。默认为None, 在单任务模式下代表遍历一次,在多任务模式下该参数会被上层的Trainer进行自动赋值。该参数仅对训练阶段有效。             file_format: 输入文件的文件格式。目前支持的格式: tsv. 默认为tsv.             shuffle_train: 是否打乱训练集中的样本。默认为True。该参数仅对训练阶段有效。         """         raise NotImplementedError()     @property     def outputs_attr(self):         """描述reader输出对象(被yield出的对象)的属性,包含各个对象的名字、shape以及数据类型。当某个对象为标量数据         类型(如str, int, float等)时,shape设置为空列表[],当某个对象的某个维度长度可变时,shape中的相应维度设置为-1。         注意:当使用mini-batch梯度下降学习策略时,,应为常规的输入对象设置batch_size维度(一般为-1)         Return:             dict类型。对各个输入对象的属性描述。例如,             对于文本分类和匹配任务,yield的输出内容可能包含如下的对象(下游backbone和task可按需访问其中的对象)                 {"token_ids": ([-1, max_len], 'int64'),                  "input_ids": ([-1, max_len], 'int64'),                  "segment_ids": ([-1, max_len], 'int64'),                  "input_mask": ([-1, max_len], 'float32'),                  "label": ([-1], 'int')}         """         raise NotImplementedError()          def _iterator(self):         """数据集遍历接口,注意,当数据集遍历到尾部时该接口应自动完成指针重置,即重新从数据集头部开始新的遍历。         Yield:             dict类型。符合outputs_attr描述的当前step的输出对象。         """         raise NotImplementedError()     def get_epoch_outputs(self):         """返回数据集每个epoch遍历后的输出对象。"""         raise NotImplementedError()     @property     def num_examples(self):         """数据集中的样本数量,即每个epoch中iterator所生成的样本数。注意,使用滑动窗口等可能导致数据集样本数发生变化的策略时         该接口应返回runtime阶段的实际样本数。"""         raise NotImplementedError()     @property     def num_epochs(self):         """数据集遍历次数"""         return self._num_epochs ``` 在基类的基础上,定义一个全新的Reader时需要至少实现的方法有: - \_\_init\_\_ - outputs_attr - load_data - _iterator - num_examples 可以重写的方法有: - get_epoch_outputs ================================================ FILE: examples/classification/README.md ================================================ ## Example 1: Classification This task is a sentiment analysis task. The following sections detail model preparation, dataset preparation, and how to run the task. ### Step 1: Prepare Pre-trained Model & Dataset #### Pre-trained Model The pre-training model of this mission is: [ERNIE-v1-zh-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api). Make sure you have downloaded the required pre-training model in the current folder. #### Dataset This example demonstrates with [ChnSentiCorp](https://github.com/SophonPlus/ChineseNlpCorpus/tree/master/datasets/ChnSentiCorp_htl_all), a Chinese sentiment analysis dataset. Download dataset: ```shell python download.py ``` If everything goes well, there will be a folder named `data/` created with all the data files in it. The dataset file (for training) should have 2 fields, `text_a` and `label`, stored with [tsv](https://en.wikipedia.org/wiki/Tab-separated_values) format. Here shows an example: ``` label text_a 0 当当网名不符实,订货多日不见送货,询问客服只会推托,只会要求用户再下订单。如此服务留不住顾客的。去别的网站买书服务更好。 0 XP的驱动不好找!我的17号提的货,现在就降价了100元,而且还送杀毒软件! 1 <荐书> 推荐所有喜欢<红楼>的红迷们一定要收藏这本书,要知道当年我听说这本书的时候花很长时间去图书馆找和借都没能如愿,所以这次一看到当当有,马上买了,红迷们也要记得备货哦! ``` ### Step 2: Train & Predict The code used to perform this task is in `run.py`. If you have prepared the pre-training model and the data set required for the task, run: ```shell python run.py ``` If you want to specify a specific gpu or use multiple gpus for training, please use **`CUDA_VISIBLE_DEVICES`**, for example: ```shell CUDA_VISIBLE_DEVICES=0,1 python run.py ``` Note: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.** Some logs will be shown below: ``` step 1/154 (epoch 0), loss: 5.512, speed: 0.51 steps/s step 2/154 (epoch 0), loss: 2.595, speed: 3.36 steps/s step 3/154 (epoch 0), loss: 1.798, speed: 3.48 steps/s ``` After the run, you can view the saved models in the `outputs/` folder and the predictions in the `outputs/predict` folder. Here are some examples of predictions: ``` {"index": 0, "logits": [-0.2014336884021759, 0.6799028515815735], "probs": [0.29290086030960083, 0.7070990800857544], "label": 1} {"index": 1, "logits": [0.8593899011611938, -0.29743513464927673], "probs": [0.7607553601264954, 0.23924466967582703], "label": 0} {"index": 2, "logits": [0.7462944388389587, -0.7083730101585388], "probs": [0.8107157349586487, 0.18928426504135132], "label": 0} ``` ### Step 3: Evaluate Once you have the prediction, you can run the evaluation script to evaluate the model: ```shell python evaluate.py ``` The evaluation results are as follows: ``` data num: 1200 accuracy: 0.9575, precision: 0.9634, recall: 0.9523, f1: 0.9578 ``` ================================================ FILE: examples/classification/download.py ================================================ # -*- coding: utf-8 -*- from __future__ import print_function import os import tarfile import shutil import sys import urllib URLLIB=urllib if sys.version_info >= (3, 0): import urllib.request URLLIB=urllib.request def download(src, url): def _reporthook(count, chunk_size, total_size): bytes_so_far = count * chunk_size percent = float(bytes_so_far) / float(total_size) if percent > 1: percent = 1 print('\r>> Downloading... {:.1%}'.format(percent), end="") URLLIB.urlretrieve(url, src, reporthook=_reporthook) abs_path = os.path.abspath(__file__) download_url = "https://ernie.bj.bcebos.com/task_data_zh.tgz" downlaod_path = os.path.join(os.path.dirname(abs_path), "task_data_zh.tgz") target_dir = os.path.dirname(abs_path) download(downlaod_path, download_url) tar = tarfile.open(downlaod_path) tar.extractall(target_dir) os.remove(downlaod_path) abs_path = os.path.abspath(__file__) dst_dir = os.path.join(os.path.dirname(abs_path), "data") if not os.path.exists(dst_dir) or not os.path.isdir(dst_dir): os.makedirs(dst_dir) for file in os.listdir(os.path.join(target_dir, 'task_data', 'chnsenticorp')): shutil.move(os.path.join(target_dir, 'task_data', 'chnsenticorp', file), dst_dir) shutil.rmtree(os.path.join(target_dir, 'task_data')) print(" done!") ================================================ FILE: examples/classification/evaluate.py ================================================ # -*- coding: utf-8 -*- import json import numpy as np def accuracy(preds, labels): preds = np.array(preds) labels = np.array(labels) return (preds == labels).mean() def pre_recall_f1(preds, labels): preds = np.array(preds) labels = np.array(labels) # recall=TP/(TP+FN) tp = np.sum((labels == '1') & (preds == '1')) fp = np.sum((labels == '0') & (preds == '1')) fn = np.sum((labels == '1') & (preds == '0')) r = tp * 1.0 / (tp + fn) # Precision=TP/(TP+FP) p = tp * 1.0 / (tp + fp) epsilon = 1e-31 f1 = 2 * p * r / (p+r+epsilon) return p, r, f1 def res_evaluate(res_dir="./outputs/predict/predictions.json", eval_phase='test'): if eval_phase == 'test': data_dir="./data/test.tsv" elif eval_phase == 'dev': data_dir="./data/dev.tsv" else: assert eval_phase in ['dev', 'test'], 'eval_phase should be dev or test' labels = [] with open(data_dir, "r") as file: first_flag = True for line in file: line = line.split("\t") label = line[0] if label=='label': continue labels.append(str(label)) file.close() preds = [] with open(res_dir, "r") as file: for line in file.readlines(): line = json.loads(line) pred = line['label'] preds.append(str(pred)) file.close() assert len(labels) == len(preds), "prediction result doesn't match to labels" print('data num: {}'.format(len(labels))) p, r, f1 = pre_recall_f1(preds, labels) print("accuracy: {:.4f}, precision: {:.4f}, recall: {:.4f}, f1: {:.4f}".format(accuracy(preds, labels), p, r, f1)) res_evaluate() ================================================ FILE: examples/classification/run.py ================================================ # coding=utf-8 import paddlepalm as palm import json if __name__ == '__main__': # configs max_seqlen = 256 batch_size = 8 num_epochs = 10 lr = 5e-5 weight_decay = 0.01 vocab_path = './pretrain/ERNIE-v1-zh-base/vocab.txt' train_file = './data/train.tsv' predict_file = './data/test.tsv' config = json.load(open('./pretrain/ERNIE-v1-zh-base/ernie_config.json')) input_dim = config['hidden_size'] num_classes = 2 dropout_prob = 0.1 random_seed = 1 task_name = 'chnsenticorp' save_path = './outputs/' pred_output = './outputs/predict/' save_type = 'ckpt' print_steps = 20 pre_params = './pretrain/ERNIE-v1-zh-base/params' # ----------------------- for training ----------------------- # step 1-1: create readers for training cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed) # step 1-2: load the training data cls_reader.load_data(train_file, batch_size, num_epochs=num_epochs) # step 2: create a backbone of the model to extract text features ernie = palm.backbone.ERNIE.from_config(config) # step 3: register the backbone in reader cls_reader.register_with(ernie) # step 4: create the task output head cls_head = palm.head.Classify(num_classes, input_dim, dropout_prob) # step 5-1: create a task trainer trainer = palm.Trainer(task_name) # step 5-2: build forward graph with backbone and task head loss_var = trainer.build_forward(ernie, cls_head) # step 6-1*: use warmup n_steps = cls_reader.num_examples * num_epochs // batch_size warmup_steps = int(0.1 * n_steps) sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps) # step 6-2: create a optimizer adam = palm.optimizer.Adam(loss_var, lr, sched) # step 6-3: build backward trainer.build_backward(optimizer=adam, weight_decay=weight_decay) # step 7: fit prepared reader and data trainer.fit_reader(cls_reader) # step 8-1*: load pretrained parameters trainer.load_pretrain(pre_params) # step 8-2*: set saver to save model # save_steps = n_steps save_steps = 2396 trainer.set_saver(save_steps=save_steps, save_path=save_path, save_type=save_type) # step 8-3: start training trainer.train(print_steps=print_steps) # ----------------------- for prediction ----------------------- # step 1-1: create readers for prediction print('prepare to predict...') predict_cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed, phase='predict') # step 1-2: load the training data predict_cls_reader.load_data(predict_file, batch_size) # step 2: create a backbone of the model to extract text features pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict') # step 3: register the backbone in reader predict_cls_reader.register_with(pred_ernie) # step 4: create the task output head cls_pred_head = palm.head.Classify(num_classes, input_dim, phase='predict') # step 5: build forward graph with backbone and task head trainer.build_predict_forward(pred_ernie, cls_pred_head) # step 6: load checkpoint # model_path = './outputs/ckpt.step'+str(save_steps) model_path = './outputs/ckpt.step'+str(11980) trainer.load_ckpt(model_path) # step 7: fit prepared reader and data trainer.fit_reader(predict_cls_reader, phase='predict') # step 8: predict print('predicting..') trainer.predict(print_steps=print_steps, output_dir=pred_output) ================================================ FILE: examples/matching/README.md ================================================ ## Example 2: Matching This task is a sentence pair matching task. The following sections detail model preparation, dataset preparation, and how to run the task with PaddlePALM. ### Step 1: Prepare Pre-trained Models & Datasets #### Download Pre-trained Model The pre-training model of this mission is: [ERNIE-v2-en-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api). Make sure you have downloaded the required pre-training model in the current folder. #### Dataset Here takes the [Quora Question Pairs](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset as the testbed for matching. Download dataset: ```shell python download.py ``` After the dataset is downloaded, you should convert the data format for training: ```shell python process.py data/quora_duplicate_questions.tsv data/train.tsv data/test.tsv ``` If everything goes well, there will be a folder named `data/` created with all the converted datas in it. The dataset file (for training) should have 3 fields, `text_a`, `text_b` and `label`, stored with [tsv](https://en.wikipedia.org/wiki/Tab-separated_values) format. Here shows an example: ``` text_a text_b label How can the arrangement of corynebacterium xerosis be described? How would you describe waves? 0 How do you fix a Google Play Store account that isn't working? What can cause the Google Play store to not open? How are such probelms fixed? 1 Which is the best earphone under 1000? What are the best earphones under 1k? 1 What are the differences between the Dell Inspiron 3000, 5000, and 7000 series laptops? "Should I buy an Apple MacBook Pro 15"" or a Dell Inspiron 17 5000 series?" 0 ``` ### Step 2: Train & Predict The code used to perform this task is in `run.py`. If you have prepared the pre-training model and the data set required for the task, run: ```shell python run.py ``` If you want to specify a specific gpu or use multiple gpus for training, please use **`CUDA_VISIBLE_DEVICES`**, for example: ```shell CUDA_VISIBLE_DEVICES=0,1 python run.py ``` Note: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.** Some logs will be shown below: ``` step 20/49087 (epoch 0), loss: 1.079, speed: 3.48 steps/s step 40/49087 (epoch 0), loss: 1.251, speed: 5.18 steps/s step 60/49087 (epoch 0), loss: 1.193, speed: 5.04 steps/s ``` After the run, you can view the saved models in the `outputs/` folder and the predictions in the `outputs/predict` folder. Here are some examples of predictions: ``` {"index": 0, "logits": [-0.32688724994659424, -0.8568955063819885], "probs": [0.629485011100769, 0.3705149292945862], "label": 0} {"index": 1, "logits": [-0.2735646963119507, -0.7983021140098572], "probs": [0.6282548904418945, 0.37174513936042786], "label": 0} {"index": 2, "logits": [-0.3381381630897522, -0.8614270091056824], "probs": [0.6279165148735046, 0.37208351492881775], "label": 0} ``` ### Step 3: Evaluate Once you have the prediction, you can run the evaluation script to evaluate the model: ```shell python evaluate.py ``` The evaluation results are as follows: ``` data num: 4300 accuracy: 0.8619, precision: 0.8061, recall: 0.8377, f1: 0.8216 ``` ================================================ FILE: examples/matching/download.py ================================================ # -*- coding: utf-8 -*- from __future__ import print_function import os import sys import urllib URLLIB=urllib if sys.version_info >= (3, 0): import urllib.request URLLIB=urllib.request def download(src, url): def _reporthook(count, chunk_size, total_size): bytes_so_far = count * chunk_size percent = float(bytes_so_far) / float(total_size) if percent > 1: percent = 1 print('\r>> Downloading... {:.1%}'.format(percent), end="") URLLIB.urlretrieve(url, src, reporthook=_reporthook) abs_path = os.path.abspath(__file__) data_dir = os.path.join(os.path.dirname(abs_path), "data") if not os.path.exists(data_dir) or not os.path.isdir(data_dir): os.makedirs(data_dir) download_url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv" downlaod_path = os.path.join(data_dir, "quora_duplicate_questions.tsv") download(downlaod_path, download_url) print(" done!") ================================================ FILE: examples/matching/evaluate.py ================================================ # -*- coding: utf-8 -*- import json import numpy as np def accuracy(preds, labels): preds = np.array(preds) labels = np.array(labels) return (preds == labels).mean() def pre_recall_f1(preds, labels): preds = np.array(preds) labels = np.array(labels) # recall=TP/(TP+FN) tp = np.sum((labels == '1') & (preds == '1')) fp = np.sum((labels == '0') & (preds == '1')) fn = np.sum((labels == '1') & (preds == '0')) r = tp * 1.0 / (tp + fn) # Precision=TP/(TP+FP) p = tp * 1.0 / (tp + fp) epsilon = 1e-31 f1 = 2 * p * r / (p+r+epsilon) return p, r, f1 def res_evaluate(res_dir="./outputs/predict/predictions.json", eval_phase='test'): if eval_phase == 'test': data_dir="./data/test.tsv" elif eval_phase == 'dev': data_dir="./data/dev.tsv" else: assert eval_phase in ['dev', 'test'], 'eval_phase should be dev or test' labels = [] with open(data_dir, "r") as file: first_flag = True for line in file: line = line.split("\t") label = line[2][:-1] if label=='label': continue labels.append(str(label)) file.close() preds = [] with open(res_dir, "r") as file: for line in file.readlines(): line = json.loads(line) pred = line['label'] preds.append(str(pred)) file.close() assert len(labels) == len(preds), "prediction result({}) doesn't match to labels({})".format(len(preds),len(labels)) print('data num: {}'.format(len(labels))) p, r, f1 = pre_recall_f1(preds, labels) print("accuracy: {:.4f}, precision: {:.4f}, recall: {:.4f}, f1: {:.4f}".format(accuracy(preds, labels), p, r, f1)) res_evaluate() ================================================ FILE: examples/matching/process.py ================================================ # -*- coding: utf-8 -*- import sys import os if len(sys.argv) != 4: exit(0) data_dir = sys.argv[1] if not os.path.exists(data_dir): print("%s not exists" % data_dir) exit(0) train_dir = sys.argv[2] train_file = open(train_dir, "w") train_file.write("text_a\ttext_b\tlabel\n") test_dir = sys.argv[3] test_file = open(test_dir, "w") test_file.write("text_a\ttext_b\tlabel\n") with open(data_dir, "r") as file: before = "" cnt = 0 for line in file: line = line.strip("\n") line_t = line.split("\t") flag = 0 if len(line_t) < 6: if flag: flag = 0 out_line = "{}{}\n".format(out_line, line) else: flag = 1 outline = "{}".format(line) continue else: out_line = "{}\t{}\t{}\n".format(line_t[3], line_t[4], line_t[5]) cnt += 1 if 2 <= cnt <= 4301: test_file.write(out_line) if 4301 <= cnt <= 104301: train_file.write(out_line) train_file.close() test_file.close() ================================================ FILE: examples/matching/run.py ================================================ # coding=utf-8 import paddlepalm as palm import json if __name__ == '__main__': # configs max_seqlen = 128 batch_size = 16 num_epochs = 3 lr = 3e-5 weight_decay = 0.0 num_classes = 2 random_seed = 1 dropout_prob = 0.1 save_path = './outputs/' save_type = 'ckpt' pred_model_path = './outputs/ckpt.step'+str(18732) print_steps = 50 pred_output = './outputs/predict/' pre_params = './pretrain/ERNIE-v2-en-base/params' task_name = 'Quora Question Pairs matching' vocab_path = './pretrain/ERNIE-v2-en-base/vocab.txt' train_file = './data/train.tsv' predict_file = './data/test.tsv' config = json.load(open('./pretrain/ERNIE-v2-en-base/ernie_config.json')) input_dim = config['hidden_size'] # ----------------------- for training ----------------------- # step 1-1: create readers for training match_reader = palm.reader.MatchReader(vocab_path, max_seqlen, seed=random_seed) # step 1-2: load the training data match_reader.load_data(train_file, file_format='tsv', num_epochs=num_epochs, batch_size=batch_size) # step 2: create a backbone of the model to extract text features ernie = palm.backbone.ERNIE.from_config(config) # step 3: register the backbone in reader match_reader.register_with(ernie) # step 4: create the task output head match_head = palm.head.Match(num_classes, input_dim, dropout_prob) # step 5-1: create a task trainer trainer = palm.Trainer(task_name) # step 5-2: build forward graph with backbone and task head loss_var = trainer.build_forward(ernie, match_head) # step 6-1*: use warmup n_steps = match_reader.num_examples * num_epochs // batch_size warmup_steps = int(0.1 * n_steps) print('total_steps: {}'.format(n_steps)) print('warmup_steps: {}'.format(warmup_steps)) sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps) # step 6-2: create a optimizer adam = palm.optimizer.Adam(loss_var, lr, sched) # step 6-3: build backward trainer.build_backward(optimizer=adam, weight_decay=weight_decay) # step 7: fit prepared reader and data trainer.fit_reader(match_reader) # step 8-1*: load pretrained parameters trainer.load_pretrain(pre_params, False) # step 8-2*: set saver to save model # save_steps = n_steps-16 save_steps = 6244 trainer.set_saver(save_path=save_path, save_steps=save_steps, save_type=save_type) # step 8-3: start training trainer.train(print_steps=print_steps) # ----------------------- for prediction ----------------------- # step 1-1: create readers for prediction print('prepare to predict...') predict_match_reader = palm.reader.MatchReader(vocab_path, max_seqlen, seed=random_seed, phase='predict') # step 1-2: load the training data predict_match_reader.load_data(predict_file, batch_size) # step 2: create a backbone of the model to extract text features pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict') # step 3: register the backbone in reader predict_match_reader.register_with(pred_ernie) # step 4: create the task output head match_pred_head = palm.head.Match(num_classes, input_dim, phase='predict') # step 5: build forward graph with backbone and task head trainer.build_predict_forward(pred_ernie, match_pred_head) # step 6: load checkpoint trainer.load_ckpt(pred_model_path) # step 7: fit prepared reader and data trainer.fit_reader(predict_match_reader, phase='predict') # step 8: predict print('predicting..') trainer.predict(print_steps=print_steps, output_dir=pred_output) ================================================ FILE: examples/mrc/README.md ================================================ ## Example 4: Machine Reading Comprehension This task is a machine reading comprehension task. The following sections detail model preparation, dataset preparation, and how to run the task. ### Step 1: Prepare Pre-trained Models & Datasets #### Pre-trianed Model The pre-training model of this mission is: [ERNIE-v1-zh-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api). Make sure you have downloaded the required pre-training model in the current folder. #### Dataset This task uses the `CMRC2018` dataset. `CMRC2018` is an evaluation conducted by Chinese information society. The task of evaluation is to extract reading comprehension. Download dataset: ```shell python download.py ``` If everything goes well, there will be a folder named `data/` created with all the datas in it. Here is some example datas: ```json "paragraphs": [ { "id": "TRAIN_36", "context": "NGC 6231是一个位于天蝎座的疏散星团,天球座标为赤经16时54分,赤纬-41度48分,视觉观测大小约45角分,亮度约2.6视星等,距地球5900光年。NGC 6231年龄约为三百二十万年,是一个非常年轻的星团,星团内的最亮星是5等的天蝎座 ζ1星。用双筒望远镜或小型望远镜就能看到个别的行星。NGC 6231在1654年被意大利天文学家乔瓦尼·巴蒂斯特·霍迪尔纳(Giovanni Battista Hodierna)以Luminosae的名字首次纪录在星表中,但是未见记载于夏尔·梅西耶的天体列表和威廉·赫歇尔的深空天体目录。这个天体在1678年被爱德蒙·哈雷(I.7)、1745年被夏西亚科斯(Jean-Phillippe Loys de Cheseaux)(9)、1751年被尼可拉·路易·拉卡伊(II.13)分别再次独立发现。", "qas": [ { "question": "NGC 6231的经纬度是多少?", "id": "TRAIN_36_QUERY_0", "answers": [ { "text": "赤经16时54分,赤纬-41度48分", "answer_start": 27 } ] } } ``` ### Step 2: Train & Predict The code used to perform this task is in `run.py`. If you have prepared the pre-training model and the data set required for the task, run: ```shell python run.py ``` If you want to specify a specific gpu or use multiple gpus for training, please use **`CUDA_VISIBLE_DEVICES`**, for example: ```shell CUDA_VISIBLE_DEVICES=0,1 python run.py ``` Note: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.** Some logs will be shown below: ``` step 1/1515 (epoch 0), loss: 6.251, speed: 0.31 steps/s step 2/1515 (epoch 0), loss: 6.206, speed: 0.80 steps/s step 3/1515 (epoch 0), loss: 6.172, speed: 0.86 steps/s ``` After the run, you can view the saved models in the `outputs/` folder and the predictions in the `outputs/predict` folder. Here are some examples of predictions: ```json { "DEV_0_QUERY_0": "光 荣 和 ω-force 开 发", "DEV_0_QUERY_1": "任 天 堂 游 戏 谜 之 村 雨 城", "DEV_0_QUERY_2": "战 史 演 武 」&「 争 霸 演 武 」。", "DEV_1_QUERY_0": "大 陆 传 统 器 乐 及 戏 曲 里 面 常 用 的 打 击 乐 记 谱 方 法 , 以 中 文 字 的 声 音 模 拟 敲 击 乐 的 声 音 , 纪 录 打 击 乐 的 各 种 不 同 的 演 奏 方 法 。", "DEV_1_QUERY_1": "「 锣 鼓 点", "DEV_1_QUERY_2": "锣 鼓 的 运 用 有 约 定 俗 成 的 程 式 , 依 照 角 色 行 当 的 身 份 、 性 格 、 情 绪 以 及 环 境 , 配 合 相 应 的 锣 鼓 点", "DEV_1_QUERY_3": "鼓 、 锣 、 钹 和 板 四 类 型", "DEV_2_QUERY_0": "364.6 公 里", } ``` ### Step 3: Evaluate #### Library Dependencies Before the evaluation, you need to install `nltk` and download the `punkt` tokenizer for nltk: ```shell pip insall nltk python -m nltk.downloader punkt ``` #### Evaluate You can run the evaluation script to evaluate the model: ```shell python evaluate.py ``` The evaluation results are as follows: ``` data_num: 3219 em_sroce: 0.6434, f1: 0.8518 ``` ================================================ FILE: examples/mrc/download.py ================================================ # -*- coding: utf-8 -*- from __future__ import print_function import os import tarfile import shutil import sys import urllib URLLIB=urllib if sys.version_info >= (3, 0): import urllib.request URLLIB=urllib.request def download(src, url): def _reporthook(count, chunk_size, total_size): bytes_so_far = count * chunk_size percent = float(bytes_so_far) / float(total_size) if percent > 1: percent = 1 print('\r>> Downloading... {:.1%}'.format(percent), end="") URLLIB.urlretrieve(url, src, reporthook=_reporthook) abs_path = os.path.abspath(__file__) download_url = "https://ernie.bj.bcebos.com/task_data_zh.tgz" downlaod_path = os.path.join(os.path.dirname(abs_path), "task_data_zh.tgz") target_dir = os.path.dirname(abs_path) download(downlaod_path, download_url) tar = tarfile.open(downlaod_path) tar.extractall(target_dir) os.remove(downlaod_path) abs_path = os.path.abspath(__file__) dst_dir = os.path.join(os.path.dirname(abs_path), "data") if not os.path.exists(dst_dir) or not os.path.isdir(dst_dir): os.makedirs(dst_dir) for file in os.listdir(os.path.join(target_dir, 'task_data', 'cmrc2018')): shutil.move(os.path.join(target_dir, 'task_data', 'cmrc2018', file), dst_dir) shutil.rmtree(os.path.join(target_dir, 'task_data')) print(" done!") ================================================ FILE: examples/mrc/evaluate.py ================================================ # -*- coding: utf-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. ''' Evaluation script for CMRC 2018 version: v5 Note: v5 formatted output, add usage description v4 fixed segmentation issues ''' from __future__ import absolute_import from __future__ import division from __future__ import print_function from __future__ import unicode_literals from __future__ import absolute_import from collections import Counter, OrderedDict import string import re import argparse import json import sys import nltk import pdb # split Chinese with English def mixed_segmentation(in_str, rm_punc=False): in_str = in_str.lower().strip() segs_out = [] temp_str = "" sp_char = [ '-', ':', '_', '*', '^', '/', '\\', '~', '`', '+', '=', ',', '。', ':', '?', '!', '“', '”', ';', '’', '《', '》', '……', '·', '、', '「', '」', '(', ')', '-', '~', '『', '』',' ' ] for char in in_str: if rm_punc and char in sp_char: continue if re.search(r'[\u4e00-\u9fa5]', char) or char in sp_char: if temp_str != "": ss = nltk.word_tokenize(temp_str) segs_out.extend(ss) temp_str = "" segs_out.append(char) else: temp_str += char #handling last part if temp_str != "": ss = nltk.word_tokenize(temp_str) segs_out.extend(ss) return segs_out # remove punctuation def remove_punctuation(in_str): in_str = in_str.lower().strip() sp_char = [ '-', ':', '_', '*', '^', '/', '\\', '~', '`', '+', '=', ',', '。', ':', '?', '!', '“', '”', ';', '’', '《', '》', '……', '·', '、', '「', '」', '(', ')', '-', '~', '『', '』', ' ' ] out_segs = [] for char in in_str: if char in sp_char: continue else: out_segs.append(char) return ''.join(out_segs) # find longest common string def find_lcs(s1, s2): m = [[0 for i in range(len(s2) + 1)] for j in range(len(s1) + 1)] mmax = 0 p = 0 for i in range(len(s1)): for j in range(len(s2)): if s1[i] == s2[j]: m[i + 1][j + 1] = m[i][j] + 1 if m[i + 1][j + 1] > mmax: mmax = m[i + 1][j + 1] p = i + 1 return s1[p - mmax:p], mmax def evaluate(ground_truth_file, prediction_file): f1 = 0 em = 0 total_count = 0 skip_count = 0 for instances in ground_truth_file["data"]: for instance in instances["paragraphs"]: context_text = instance['context'].strip() for qas in instance['qas']: total_count += 1 query_id = qas['id'].strip() query_text = qas['question'].strip() answers = [ans["text"] for ans in qas["answers"]] if query_id not in prediction_file: print('Unanswered question: {}\n'.format( query_id)) skip_count += 1 continue prediction = prediction_file[query_id] f1 += calc_f1_score(answers, prediction) em += calc_em_score(answers, prediction) f1_score = f1 / total_count em_score = em / total_count return f1_score, em_score, total_count, skip_count def calc_f1_score(answers, prediction): f1_scores = [] for ans in answers: ans_segs = mixed_segmentation(ans, rm_punc=True) prediction_segs = mixed_segmentation(prediction, rm_punc=True) lcs, lcs_len = find_lcs(ans_segs, prediction_segs) if lcs_len == 0: f1_scores.append(0) continue precision = 1.0 * lcs_len / len(prediction_segs) recall = 1.0 * lcs_len / len(ans_segs) f1 = (2 * precision * recall) / (precision + recall) f1_scores.append(f1) return max(f1_scores) def calc_em_score(answers, prediction): em = 0 for ans in answers: ans_ = remove_punctuation(ans) prediction_ = remove_punctuation(prediction) if ans_ == prediction_: em = 1 break return em def eval_file(dataset_file, prediction_file): ground_truth_file = json.load(open(dataset_file, 'r')) prediction_file = json.load(open(prediction_file, 'r')) F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file) AVG = (EM + F1) * 0.5 return EM, F1, AVG, TOTAL if __name__ == '__main__': EM, F1, AVG, TOTAL = eval_file("data/dev.json", "outputs/predict/predictions.json") print('data_num: {}'.format(TOTAL)) print('em_sroce: {:.4f}, f1: {:.4f}'.format(EM,F1)) ================================================ FILE: examples/mrc/run.py ================================================ # coding=utf-8 import paddlepalm as palm import json if __name__ == '__main__': # configs max_seqlen = 512 batch_size = 8 num_epochs = 2 lr = 3e-5 doc_stride = 128 max_query_len = 64 max_ans_len = 128 weight_decay = 0.01 print_steps = 20 vocab_path = './pretrain/ERNIE-v1-zh-base/vocab.txt' do_lower_case = True train_file = './data/train.json' predict_file = './data/dev.json' save_path = './outputs/' pred_output = './outputs/predict/' save_type = 'ckpt' task_name = 'cmrc2018' pre_params = './pretrain/ERNIE-v1-zh-base/params' config = json.load(open('./pretrain/ERNIE-v1-zh-base/ernie_config.json')) # ----------------------- for training ----------------------- # step 1-1: create readers for training mrc_reader = palm.reader.MRCReader(vocab_path, max_seqlen, max_query_len, doc_stride, do_lower_case=do_lower_case) # step 1-2: load the training data mrc_reader.load_data(train_file, file_format='json', num_epochs=num_epochs, batch_size=batch_size) # step 2: create a backbone of the model to extract text features ernie = palm.backbone.ERNIE.from_config(config) # step 3: register the backbone in reader mrc_reader.register_with(ernie) # step 4: create the task output head mrc_head = palm.head.MRC(max_query_len, config['hidden_size'], do_lower_case=do_lower_case, max_ans_len=max_ans_len) # step 5-1: create a task trainer trainer = palm.Trainer(task_name) # step 5-2: build forward graph with backbone and task head loss_var = trainer.build_forward(ernie, mrc_head) # step 6-1*: use warmup n_steps = mrc_reader.num_examples * num_epochs // batch_size warmup_steps = int(0.1 * n_steps) sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps) # step 6-2: create a optimizer adam = palm.optimizer.Adam(loss_var, lr, sched) # step 6-3: build backward trainer.build_backward(optimizer=adam, weight_decay=weight_decay) # step 7: fit prepared reader and data trainer.fit_reader(mrc_reader) # step 8-1*: load pretrained parameters trainer.load_pretrain(pre_params) # step 8-2*: set saver to save model save_steps = 3040 trainer.set_saver(save_path=save_path, save_steps=save_steps, save_type=save_type) # step 8-3: start training trainer.train(print_steps=print_steps) # ----------------------- for prediction ----------------------- # step 1-1: create readers for prediction predict_mrc_reader = palm.reader.MRCReader(vocab_path, max_seqlen, max_query_len, doc_stride, do_lower_case=do_lower_case, phase='predict') # step 1-2: load the training data predict_mrc_reader.load_data(predict_file, batch_size) # step 2: create a backbone of the model to extract text features pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict') # step 3: register the backbone in reader predict_mrc_reader.register_with(pred_ernie) # step 4: create the task output head mrc_pred_head = palm.head.MRC(max_query_len, config['hidden_size'], do_lower_case=do_lower_case, max_ans_len=max_ans_len, phase='predict') # step 5: build forward graph with backbone and task head trainer.build_predict_forward(pred_ernie, mrc_pred_head) # step 6: load checkpoint pred_model_path = './outputs/ckpt.step'+str(3040) trainer.load_ckpt(pred_model_path) # step 7: fit prepared reader and data trainer.fit_reader(predict_mrc_reader, phase='predict') # step 8: predict print('predicting..') trainer.predict(print_steps=print_steps, output_dir="outputs/predict") ================================================ FILE: examples/multi-task/README.md ================================================ ## Example 6: Joint Training of Dialogue Intent Recognition and Slot Filling This example achieves the joint training ofg Dialogue Intent Recognition and Slot Filling. The intent recognition can be regared as a text classification task, and slot filling as sequence labeling task. Both classification and sequence labeling have been built-in in PaddlePALM. ### Step 1: Prepare Pre-trained Models & Datasets #### Pre-trained Model We prepare [ERNIE-v2-en-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api) as our pre-trained model for this example. Make sure you have downloaded `ERNIE` to current folder. #### Dataset Here we use `Airline Travel Information System` dataset as our testbed. Download dataset: ```shell python download.py ``` After the dataset is downloaded, you should convert the data format for training: ```shell python process.py ``` If everything goes well, there will be a folder named `data/atis/` created with all the datas in it. Here is some example datas: `data/atis/atis_slot/train.tsv` : ``` text_a label i want to fly from boston at 838 am and arrive in denver at 1110 in the morning O O O O O B-fromloc.city_name O B-depart_time.time I-depart_time.time O O O B-toloc.city_name O B-arrive_time.time O O B-arrive_time.period_of_day what flights are available from pittsburgh to baltimore on thursday morning O O O O O B-fromloc.city_name O B-toloc.city_name O B-depart_date.day_name B-depart_time.period_of_day what is the arrival time in san francisco for the 755 am flight leaving washington O O O B-flight_time I-flight_time O B-fromloc.city_name I-fromloc.city_name O O B-depart_time.time I-depart_time.time O O B-fromloc.city_name cheapest airfare from tacoma to orlando B-cost_relative O O B-fromloc.city_name O B-toloc.city_name ``` `data/atis/atis_intent/train.tsv` : ``` label text_a 0 i want to fly from boston at 838 am and arrive in denver at 1110 in the morning 0 what flights are available from pittsburgh to baltimore on thursday morning 1 what is the arrival time in san francisco for the 755 am flight leaving washington 2 cheapest airfare from tacoma to orlando ``` ### Step 2: Train & Predict The code used to perform this task is in `run.py`. If you have prepared the pre-training model and the data set required for the task, run: ```shell python run.py ``` If you want to specify a specific gpu or use multiple gpus for training, please use **`CUDA_VISIBLE_DEVICES`**, for example: ```shell CUDA_VISIBLE_DEVICES=0,1 python run.py ``` Note: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.** Some logs will be shown below: ``` global step: 5, slot: step 3/309 (epoch 0), loss: 68.965, speed: 0.58 steps/s global step: 10, intent: step 3/311 (epoch 0), loss: 3.407, speed: 8.76 steps/s global step: 15, slot: step 12/309 (epoch 0), loss: 54.611, speed: 1.21 steps/s global step: 20, intent: step 7/311 (epoch 0), loss: 3.487, speed: 10.28 steps/s ``` After the run, you can view the saved models in the `outputs/` folder. If you want to use the trained model to predict the `atis_slot & atis_intent` data, run: ```shell python predict-slot.py python predict-intent.py ``` If you want to specify a specific gpu or use multiple gpus for predict, please use **`CUDA_VISIBLE_DEVICES`**, for example: ```shell CUDA_VISIBLE_DEVICES=0,1 python predict-slot.py CUDA_VISIBLE_DEVICES=0,1 python predict-intent.py ``` Note: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.** After the run, you can view the predictions in the `outputs/predict-slot` folder and `outputs/predict-intent` folder. Here are some examples of predictions: `atis_slot`: ``` [129, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 5, 19, 1, 1, 1, 1, 1, 21, 21, 68, 129] [129, 1, 39, 37, 1, 1, 1, 1, 1, 2, 1, 5, 19, 1, 23, 3, 4, 129, 129, 129, 129, 129] [129, 1, 39, 37, 1, 1, 1, 1, 1, 1, 2, 1, 5, 19, 129, 129, 129, 129, 129, 129, 129, 129] [129, 1, 1, 1, 1, 1, 1, 14, 15, 1, 2, 1, 5, 19, 1, 39, 37, 129, 24, 129, 129, 129] ``` `atis_intent`: ``` {"index": 0, "logits": [9.938603401184082, -0.3914794623851776, -0.050973162055015564, -1.0229418277740479, 0.04799401015043259, -0.9632213115692139, -0.6427211761474609, -1.337939739227295, -0.7969412803649902, -1.4441455602645874, -0.6339573264122009, -1.0393054485321045, -0.9242327213287354, -1.9637483358383179, 0.16733427345752716, -0.5280354619026184, -1.7195699214935303, -2.199411630630493, -1.2833174467086792, -1.3081035614013672, -1.6036226749420166, -1.8527079820632935, -2.289180040359497, -2.267214775085449, -2.2578916549682617, -2.2010505199432373], "probs": [0.999531626701355, 3.26210938510485e-05, 4.585415081237443e-05, 1.7348344044876285e-05, 5.06243304698728e-05, 1.8415948943584226e-05, 2.5373808966833167e-05, 1.266065828531282e-05, 2.174747896788176e-05, 1.1384962817828637e-05, 2.5597169951652177e-05, 1.7066764485207386e-05, 1.914815220516175e-05, 6.771284006390488e-06, 5.70411684748251e-05, 2.8457265216275118e-05, 8.644025911053177e-06, 5.349628736439627e-06, 1.3371440218179487e-05, 1.3044088518654462e-05, 9.706698619993404e-06, 7.5665011536329985e-06, 4.890325726591982e-06, 4.99892985317274e-06, 5.045753368904116e-06, 5.340866664482746e-06], "label": 0} {"index": 1, "logits": [0.8863624930381775, -2.232290506362915, 8.191509246826172, -0.03161466494202614, -0.9149583578109741, -2.172696352005005, -0.3937145471572876, -0.3954394459724426, 1.5333592891693115, 0.8630291223526001, -0.9684226512908936, -2.722721815109253, -0.0060247331857681274, -0.9865402579307556, 1.6328885555267334, 0.3972966969013214, 0.27919167280197144, -1.4911551475524902, -0.9552251696586609, -0.9169244170188904, -0.810670793056488, -1.5118697881698608, -2.0140435695648193, -1.6299077272415161, -1.8589974641799927, -2.07601261138916], "probs": [0.0006675600307062268, 2.9517297662096098e-05, 0.9932880997657776, 0.0002665741485543549, 0.0001102013120544143, 3.132982965325937e-05, 0.00018559220188762993, 0.00018527248175814748, 0.0012749042361974716, 0.0006521637551486492, 0.00010446414671605453, 1.8075270418194123e-05, 0.0002734838053584099, 0.00010258861584588885, 0.0014083238784223795, 0.00040934717981144786, 0.00036374686169438064, 6.193659646669403e-05, 0.00010585198469925672, 0.00010998480865964666, 0.0001223145518451929, 6.0666847275570035e-05, 3.671637750812806e-05, 5.391232480178587e-05, 4.287416595616378e-05, 3.4510172554291785e-05], "label": 0} {"index": 2, "logits": [9.789957046508789, -0.1730862706899643, -0.7198237776756287, -1.0460278987884521, 0.23521068692207336, -0.5075851678848267, -0.44724929332733154, -1.2945927381515503, -0.6984466314315796, -1.8749892711639404, -0.4631594121456146, -0.6256799697875977, -1.0252169370651245, -1.951456069946289, -0.17572557926177979, -0.6771697402000427, -1.7992591857910156, -2.1457295417785645, -1.4203097820281982, -1.4963451623916626, -1.692310094833374, -1.9219486713409424, -2.2533645629882812, -2.430952310562134, -2.3094685077667236, -2.2399914264678955], "probs": [0.9994625449180603, 4.708383130491711e-05, 2.725377635215409e-05, 1.9667899323394522e-05, 7.082601223373786e-05, 3.3697724575176835e-05, 3.579350595828146e-05, 1.5339375750045292e-05, 2.784266871458385e-05, 8.58508519741008e-06, 3.522853512549773e-05, 2.9944207199150696e-05, 2.0081495677004568e-05, 7.953084605105687e-06, 4.695970710599795e-05, 2.8441407266655006e-05, 9.26048778637778e-06, 6.548832516273251e-06, 1.3527245755540207e-05, 1.2536826943687629e-05, 1.030578732752474e-05, 8.19125762063777e-06, 5.880556273041293e-06, 4.923717369820224e-06, 5.559719284065068e-06, 5.9597273320832755e-06], "label": 0} {"index": 3, "logits": [9.787659645080566, -0.6223222017288208, -0.03971472755074501, -1.038114070892334, 0.24018540978431702, -0.8904737830162048, -0.7114139795303345, -1.2315020561218262, -0.5120854377746582, -1.4273980855941772, -0.44618460536003113, -1.0241562128067017, -0.9727545380592346, -1.8587366342544556, 0.020689941942691803, -0.6228570342063904, -1.6020199060440063, -2.130260467529297, -1.370570421218872, -1.40530526638031, -1.6782578229904175, -1.94076669216156, -2.2038567066192627, -2.336832284927368, -2.268157720565796, -2.140028953552246], "probs": [0.9994485974311829, 3.0113611501292326e-05, 5.392447565100156e-05, 1.986949791898951e-05, 7.134198676794767e-05, 2.303065048181452e-05, 2.7546762794372626e-05, 1.6375688574044034e-05, 3.362310235388577e-05, 1.3462414244713727e-05, 3.591357381083071e-05, 2.0148761905147694e-05, 2.12115264730528e-05, 8.74570196174318e-06, 5.728216274292208e-05, 3.0097504350123927e-05, 1.1305383850412909e-05, 6.666126409982098e-06, 1.4249604646465741e-05, 1.3763145034317859e-05, 1.0475521776243113e-05, 8.056933438638225e-06, 6.193143690325087e-06, 5.422014055511681e-06, 5.807448815176031e-06, 6.601325367228128e-06], "label": 0} ``` ### Step 3: Evaluate Once you have the prediction, you can run the evaluation script to evaluate the model: ```shell python evaluate-slot.py python evaluate-intent.py ``` The evaluation results are as follows: `atis_slot`: ``` data num: 891 f1: 0.8934 ``` `atis_intent`: ``` data num: 893 accuracy: 0.7088, precision: 1.0000, recall: 1.0000, f1: 1.0000 ``` ================================================ FILE: examples/multi-task/download.py ================================================ # -*- coding: utf-8 -*- from __future__ import print_function import os import tarfile import shutil import sys import urllib URLLIB=urllib if sys.version_info >= (3, 0): import urllib.request URLLIB=urllib.request def download(src, url): def _reporthook(count, chunk_size, total_size): bytes_so_far = count * chunk_size percent = float(bytes_so_far) / float(total_size) if percent > 1: percent = 1 print('\r>> Downloading... {:.1%}'.format(percent), end="") URLLIB.urlretrieve(url, src, reporthook=_reporthook) abs_path = os.path.abspath(__file__) download_url = "https://baidu-nlp.bj.bcebos.com/dmtk_data_1.0.0.tar.gz" downlaod_path = os.path.join(os.path.dirname(abs_path), "dmtk_data_1.0.0.tar.gz") target_dir = os.path.dirname(abs_path) download(downlaod_path, download_url) tar = tarfile.open(downlaod_path) tar.extractall(target_dir) os.remove(downlaod_path) shutil.rmtree(os.path.join(target_dir, 'data/dstc2/')) shutil.rmtree(os.path.join(target_dir, 'data/mrda/')) shutil.rmtree(os.path.join(target_dir, 'data/multi-woz/')) shutil.rmtree(os.path.join(target_dir, 'data/swda/')) shutil.rmtree(os.path.join(target_dir, 'data/udc/')) print(" done!") ================================================ FILE: examples/multi-task/evaluate_intent.py ================================================ # -*- coding: utf-8 -*- import json import numpy as np def accuracy(preds, labels): preds = np.array(preds) labels = np.array(labels) return (preds == labels).mean() def pre_recall_f1(preds, labels): preds = np.array(preds) labels = np.array(labels) # recall=TP/(TP+FN) tp = np.sum((labels == '1') & (preds == '1')) fp = np.sum((labels == '0') & (preds == '1')) fn = np.sum((labels == '1') & (preds == '0')) r = tp * 1.0 / (tp + fn) # Precision=TP/(TP+FP) p = tp * 1.0 / (tp + fp) epsilon = 1e-31 f1 = 2 * p * r / (p+r+epsilon) return p, r, f1 def res_evaluate(res_dir="./outputs/predict-intent/predictions.json", eval_phase='test'): if eval_phase == 'test': data_dir="./data/atis/atis_intent/test.tsv" elif eval_phase == 'dev': data_dir="./data/dev.tsv" else: assert eval_phase in ['dev', 'test'], 'eval_phase should be dev or test' labels = [] with open(data_dir, "r") as file: first_flag = True for line in file: line = line.split("\t") label = line[0] if label=='label': continue labels.append(str(label)) file.close() preds = [] with open(res_dir, "r") as file: for line in file.readlines(): line = json.loads(line) pred = line['label'] preds.append(str(pred)) file.close() assert len(labels) == len(preds), "prediction result doesn't match to labels" print('data num: {}'.format(len(labels))) p, r, f1 = pre_recall_f1(preds, labels) print("accuracy: {:.4f}, precision: {:.4f}, recall: {:.4f}, f1: {:.4f}".format(accuracy(preds, labels), p, r, f1)) res_evaluate() ================================================ FILE: examples/multi-task/evaluate_slot.py ================================================ # -*- coding: utf-8 -*- import json def load_label_map(map_dir="./data/atis/atis_slot/label_map.json"): """ :param map_dir: dict indictuing chunk type :return: """ return json.load(open(map_dir, "r")) def cal_chunk(pred_label, refer_label): tp = dict() fn = dict() fp = dict() for i in range(len(refer_label)): if refer_label[i] == pred_label[i]: if refer_label[i] not in tp: tp[refer_label[i]] = 0 tp[refer_label[i]] += 1 else: if pred_label[i] not in fp: fp[pred_label[i]] = 0 fp[pred_label[i]] += 1 if refer_label[i] not in fn: fn[refer_label[i]] = 0 fn[refer_label[i]] += 1 tp_total = sum(tp.values()) fn_total = sum(fn.values()) fp_total = sum(fp.values()) p_total = float(tp_total) / (tp_total + fp_total) r_total = float(tp_total) / (tp_total + fn_total) f_micro = 2 * p_total * r_total / (p_total + r_total) return f_micro def res_evaluate(res_dir="./outputs/predict-slot/predictions.json", data_dir="./data/atis/atis_slot/test.tsv"): label_map = load_label_map() total_label = [] with open(data_dir, "r") as file: first_flag = True for line in file: if first_flag: first_flag = False continue line = line.strip("\n") if len(line) == 0: continue line = line.split("\t") if len(line) < 2: continue labels = line[1][:-1].split("\x02") total_label.append(labels) total_label = [[label_map[j] for j in i] for i in total_label] total_res = [] with open(res_dir, "r") as file: cnt = 0 for line in file: line = line.strip("\n") if len(line) == 0: continue try: res_arr = json.loads(line) if len(total_label[cnt]) < len(res_arr): total_res.append(res_arr[1: 1 + len(total_label[cnt])]) elif len(total_label[cnt]) == len(res_arr): total_res.append(res_arr) else: total_res.append(res_arr) total_label[cnt] = total_label[cnt][: len(res_arr)] except: print("json format error: {}".format(cnt)) print(line) cnt += 1 total_res_equal = [] total_label_equal = [] assert len(total_label) == len(total_res), "prediction result doesn't match to labels" for i in range(len(total_label)): num = len(total_label[i]) total_label_equal.extend(total_label[i]) total_res[i] = total_res[i][:num] total_res_equal.extend(total_res[i]) f1 = cal_chunk(total_res_equal, total_label_equal) print('data num: {}'.format(len(total_label))) print("f1: {:.4f}".format(f1)) res_evaluate() ================================================ FILE: examples/multi-task/joint_predict.py ================================================ # coding=utf-8 import paddlepalm as palm import json import numpy as np if __name__ == '__main__': # configs max_seqlen = 128 batch_size = 128 num_epochs = 20 print_steps = 5 lr = 2e-5 num_classes = 130 weight_decay = 0.01 num_classes_intent = 26 dropout_prob = 0.1 random_seed = 0 label_map = './data/atis/atis_slot/label_map.json' vocab_path = './pretrain/ERNIE-v2-en-base/vocab.txt' train_slot = './data/atis/atis_slot/train.tsv' train_intent = './data/atis/atis_intent/train.tsv' config = json.load(open('./pretrain/ERNIE-v2-en-base/ernie_config.json')) input_dim = config['hidden_size'] # ----------------------- for training ----------------------- # step 1-1: create readers slot_reader = palm.reader.SequenceLabelReader(vocab_path, max_seqlen, label_map, seed=random_seed, phase='predict') intent_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed, phase='predict') # step 1-2: load train data slot_reader.load_data(train_slot, file_format='tsv', num_epochs=None, batch_size=batch_size) intent_reader.load_data(train_intent, batch_size=batch_size, num_epochs=None) # step 2: create a backbone of the model to extract text features ernie = palm.backbone.ERNIE.from_config(config, phase='predict') # step 3: register readers with ernie backbone slot_reader.register_with(ernie) intent_reader.register_with(ernie) # step 4: create task output heads slot_head = palm.head.SequenceLabel(num_classes, input_dim, dropout_prob, phase='predict') intent_head = palm.head.Classify(num_classes_intent, input_dim, dropout_prob, phase='predict') # step 5-1: create task trainers and multiHeadTrainer trainer_slot = palm.Trainer("slot", mix_ratio=1.0) trainer_intent = palm.Trainer("intent", mix_ratio=1.0) trainer = palm.MultiHeadTrainer([trainer_slot, trainer_intent]) # # step 5-2: build forward graph with backbone and task head vars = trainer_intent.build_predict_forward(ernie, intent_head) vars = trainer_slot.build_predict_forward(ernie, slot_head) loss_var = trainer.build_predict_forward() # load checkpoint trainer.load_ckpt('outputs/ckpt.step300') # merge inference readers joint_iterator = trainer.merge_inference_readers([slot_reader, intent_reader]) # for test # batch = next(joint_iterator('slot')) # results = trainer.predict_one_batch('slot', batch) # batch = next(joint_iterator('intent')) # results = trainer.predict_one_batch('intent', batch) # predict slot filling print('processing slot filling examples...') print('num examples: '+str(slot_reader.num_examples)) cnt = 0 for batch in joint_iterator('slot'): cnt += len(trainer.predict_one_batch('slot', batch)['logits']) if cnt % 1000 <= 128: print(str(cnt)+'th example processed.') print(str(cnt)+'th example processed.') # predict intent recognition print('processing intent recognition examples...') print('num examples: '+str(intent_reader.num_examples)) cnt = 0 for batch in joint_iterator('intent'): cnt += len(trainer.predict_one_batch('intent', batch)['logits']) if cnt % 1000 <= 128: print(str(cnt)+'th example processed.') print(str(cnt)+'th example processed.') ================================================ FILE: examples/multi-task/predict_intent.py ================================================ # coding=utf-8 import paddlepalm as palm import json from paddlepalm.distribute import gpu_dev_count if __name__ == '__main__': # configs max_seqlen = 256 batch_size = 16 num_epochs = 6 print_steps = 5 num_classes = 26 vocab_path = './pretrain/ERNIE-v2-en-base/vocab.txt' predict_file = './data/atis/atis_intent/test.tsv' save_path = './outputs/' pred_output = './outputs/predict-intent/' save_type = 'ckpt' random_seed = 0 config = json.load(open('./pretrain/ERNIE-v2-en-base/ernie_config.json')) input_dim = config['hidden_size'] # ----------------------- for prediction ----------------------- # step 1-1: create readers for prediction print('prepare to predict...') predict_cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed, phase='predict') # step 1-2: load the training data predict_cls_reader.load_data(predict_file, batch_size) # step 2: create a backbone of the model to extract text features pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict') # step 3: register the backbone in reader predict_cls_reader.register_with(pred_ernie) # step 4: create the task output head cls_pred_head = palm.head.Classify(num_classes, input_dim, phase='predict') # step 5-1: create a task trainer trainer = palm.Trainer("intent") # step 5-2: build forward graph with backbone and task head trainer.build_predict_forward(pred_ernie, cls_pred_head) # step 6: load checkpoint pred_model_path = './outputs/ckpt.step4641' trainer.load_ckpt(pred_model_path) # step 7: fit prepared reader and data trainer.fit_reader(predict_cls_reader, phase='predict') # step 8: predict print('predicting..') trainer.predict(print_steps=print_steps, output_dir=pred_output) ================================================ FILE: examples/multi-task/predict_slot.py ================================================ # coding=utf-8 import paddlepalm as palm import json from paddlepalm.distribute import gpu_dev_count if __name__ == '__main__': # configs max_seqlen = 256 batch_size = 16 num_epochs = 6 print_steps = 5 num_classes = 130 label_map = './data/atis/atis_slot/label_map.json' vocab_path = './pretrain/ERNIE-v2-en-base/vocab.txt' predict_file = './data/atis/atis_slot/test.tsv' save_path = './outputs/' pred_output = './outputs/predict-slot/' save_type = 'ckpt' random_seed = 0 config = json.load(open('./pretrain/ERNIE-v2-en-base/ernie_config.json')) input_dim = config['hidden_size'] # ----------------------- for prediction ----------------------- # step 1-1: create readers for prediction print('prepare to predict...') predict_seq_label_reader = palm.reader.SequenceLabelReader(vocab_path, max_seqlen, label_map, seed=random_seed, phase='predict') # step 1-2: load the training data predict_seq_label_reader.load_data(predict_file, batch_size) # step 2: create a backbone of the model to extract text features pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict') # step 3: register the backbone in reader predict_seq_label_reader.register_with(pred_ernie) # step 4: create the task output head seq_label_pred_head = palm.head.SequenceLabel(num_classes, input_dim, phase='predict') # step 5-1: create a task trainer trainer_seq_label = palm.Trainer("slot") # step 5-2: build forward graph with backbone and task head trainer_seq_label.build_predict_forward(pred_ernie, seq_label_pred_head) # step 6: load checkpoint pred_model_path = './outputs/ckpt.step4641' trainer_seq_label.load_ckpt(pred_model_path) # step 7: fit prepared reader and data trainer_seq_label.fit_reader(predict_seq_label_reader, phase='predict') # step 8: predict print('predicting..') trainer_seq_label.predict(print_steps=print_steps, output_dir=pred_output) ================================================ FILE: examples/multi-task/process.py ================================================ import os import json label_new = "data/atis/atis_slot/label_map.json" label_old = "data/atis/atis_slot/map_tag_slot_id.txt" train_old = "data/atis/atis_slot/train.txt" train_new = "data/atis/atis_slot/train.tsv" dev_old = "data/atis/atis_slot/dev.txt" dev_new = "data/atis/atis_slot/dev.tsv" test_old = "data/atis/atis_slot/test.txt" test_new = "data/atis/atis_slot/test.tsv" intent_test = "data/atis/atis_intent/test.tsv" os.rename("data/atis/atis_intent/test.txt", intent_test) intent_train = "data/atis/atis_intent/train.tsv" os.rename("data/atis/atis_intent/train.txt", intent_train) intent_dev = "data/atis/atis_intent/dev.tsv" os.rename("data/atis/atis_intent/dev.txt", intent_dev) with open(intent_dev, 'r+') as f: content = f.read() f.seek(0, 0) f.write("label\ttext_a\n"+content) f.close() with open(intent_test, 'r+') as f: content = f.read() f.seek(0, 0) f.write("label\ttext_a\n"+content) f.close() with open(intent_train, 'r+') as f: content = f.read() f.seek(0, 0) f.write("label\ttext_a\n"+content) f.close() os.mknod(label_new) os.mknod(train_new) os.mknod(dev_new) os.mknod(test_new) tag = [] id = [] map = {} with open(label_old, "r") as f: with open(label_new, "w") as f2: for line in f.readlines(): line = line.split('\t') tag.append(line[0]) id.append(int(line[1][:-1])) map[line[1][:-1]] = line[0] re = {tag[i]:id[i] for i in range(len(tag))} re = json.dumps(re) f2.write(re) f2.close() f.close() with open(train_old, "r") as f: with open(train_new, "w") as f2: f2.write("text_a\tlabel\n") for line in f.readlines(): line = line.split('\t') text = line[0].split(' ') label = line[1].split(' ') for t in text: f2.write(t) f2.write('\2') f2.write('\t') for t in label: if t.endswith('\n'): t = t[:-1] f2.write(map[t]) f2.write('\2') f2.write('\n') f2.close() f.close() with open(test_old, "r") as f: with open(test_new, "w") as f2: f2.write("text_a\tlabel\n") for line in f.readlines(): line = line.split('\t') text = line[0].split(' ') label = line[1].split(' ') for t in text: f2.write(t) f2.write('\2') f2.write('\t') for t in label: if t.endswith('\n'): t = t[:-1] f2.write(map[t]) f2.write('\2') f2.write('\n') f2.close() f.close() with open(dev_old, "r") as f: with open(dev_new, "w") as f2: f2.write("text_a\tlabel\n") for line in f.readlines(): line = line.split('\t') text = line[0].split(' ') label = line[1].split(' ') for t in text: f2.write(t) f2.write('\2') f2.write('\t') for t in label: if t.endswith('\n'): t = t[:-1] f2.write(map[t]) f2.write('\2') f2.write('\n') f2.close() f.close() os.remove(label_old) os.remove(train_old) os.remove(test_old) os.remove(dev_old) ================================================ FILE: examples/multi-task/run.py ================================================ # coding=utf-8 import paddlepalm as palm import json if __name__ == '__main__': # configs max_seqlen = 128 batch_size = 16 num_epochs = 20 print_steps = 5 lr = 2e-5 num_classes = 130 weight_decay = 0.01 num_classes_intent = 26 dropout_prob = 0.1 random_seed = 0 label_map = './data/atis/atis_slot/label_map.json' vocab_path = './pretrain/ERNIE-v2-en-base/vocab.txt' train_slot = './data/atis/atis_slot/train.tsv' train_intent = './data/atis/atis_intent/train.tsv' config = json.load(open('./pretrain/ERNIE-v2-en-base/ernie_config.json')) input_dim = config['hidden_size'] # ----------------------- for training ----------------------- # step 1-1: create readers seq_label_reader = palm.reader.SequenceLabelReader(vocab_path, max_seqlen, label_map, seed=random_seed) cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed) # step 1-2: load train data seq_label_reader.load_data(train_slot, file_format='tsv', num_epochs=None, batch_size=batch_size) cls_reader.load_data(train_intent, batch_size=batch_size, num_epochs=None) # step 2: create a backbone of the model to extract text features ernie = palm.backbone.ERNIE.from_config(config) # step 3: register readers with ernie backbone seq_label_reader.register_with(ernie) cls_reader.register_with(ernie) # step 4: create task output heads seq_label_head = palm.head.SequenceLabel(num_classes, input_dim, dropout_prob) cls_head = palm.head.Classify(num_classes_intent, input_dim, dropout_prob) # step 5-1: create task trainers and multiHeadTrainer trainer_seq_label = palm.Trainer("slot", mix_ratio=1.0) trainer_cls = palm.Trainer("intent", mix_ratio=1.0) trainer = palm.MultiHeadTrainer([trainer_seq_label, trainer_cls]) # # step 5-2: build forward graph with backbone and task head loss1 = trainer_cls.build_forward(ernie, cls_head) loss2 = trainer_seq_label.build_forward(ernie, seq_label_head) loss_var = trainer.build_forward() # step 6-1*: enable warmup for better fine-tuning n_steps = seq_label_reader.num_examples * 1.5 * num_epochs // batch_size warmup_steps = int(0.1 * n_steps) sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps) # step 6-2: build a optimizer adam = palm.optimizer.Adam(loss_var, lr, sched) # step 6-3: build backward graph trainer.build_backward(optimizer=adam, weight_decay=weight_decay) # step 7: fit readers to trainer trainer.fit_readers_with_mixratio([seq_label_reader, cls_reader], "slot", num_epochs) # step 8-1*: load pretrained model trainer.load_pretrain('./pretrain/ERNIE-v2-en-base') # step 8-2*: set saver to save models during training trainer.set_saver(save_path='./outputs/', save_steps=300) # step 8-3: start training trainer.train(print_steps=10) ================================================ FILE: examples/predict/README.md ================================================ ## Example 5: Prediction This example demonstrates how to directly do prediction with PaddlePALM. You can either initialize the model from a checkpoint, a pretrained model or just randomly initialization. Here we reuse the task and data in example 1. Hence repeat the step 1 in example 1 to pretrain data. After you have prepared the pre-training model and the data set required for the task, run: ```shell python run.py ``` If you want to specify a specific gpu or use multiple gpus for predict, please use **`CUDA_VISIBLE_DEVICES`**, for example: ```shell CUDA_VISIBLE_DEVICES=0,1 python run.py ``` Note: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.** Some logs will be shown below: ``` step 1/154, speed: 0.51 steps/s step 2/154, speed: 3.36 steps/s step 3/154, speed: 3.48 steps/s ``` After the run, you can view the predictions in the `outputs/predict` folder. Here are some examples of predictions: ``` {"index": 0, "logits": [-0.2014336884021759, 0.6799028515815735], "probs": [0.29290086030960083, 0.7070990800857544], "label": 1} {"index": 1, "logits": [0.8593899011611938, -0.29743513464927673], "probs": [0.7607553601264954, 0.23924466967582703], "label": 0} {"index": 2, "logits": [0.7462944388389587, -0.7083730101585388], "probs": [0.8107157349586487, 0.18928426504135132], "label": 0} ``` ### Step 3: Evaluate Once you have the prediction, you can run the evaluation script to evaluate the model: ```shell python evaluate.py ``` The evaluation results are as follows: ``` data num: 1200 accuracy: 0.4758, precision: 0.4730, recall: 0.3026, f1: 0.3691 ``` ================================================ FILE: examples/predict/download.py ================================================ # -*- coding: utf-8 -*- from __future__ import print_function import os import tarfile import shutil import sys import urllib URLLIB=urllib if sys.version_info >= (3, 0): import urllib.request URLLIB=urllib.request def download(src, url): def _reporthook(count, chunk_size, total_size): bytes_so_far = count * chunk_size percent = float(bytes_so_far) / float(total_size) if percent > 1: percent = 1 print('\r>> Downloading... {:.1%}'.format(percent), end="") URLLIB.urlretrieve(url, src, reporthook=_reporthook) abs_path = os.path.abspath(__file__) download_url = "https://ernie.bj.bcebos.com/task_data_zh.tgz" downlaod_path = os.path.join(os.path.dirname(abs_path), "task_data_zh.tgz") target_dir = os.path.dirname(abs_path) download(downlaod_path, download_url) tar = tarfile.open(downlaod_path) tar.extractall(target_dir) os.remove(downlaod_path) abs_path = os.path.abspath(__file__) dst_dir = os.path.join(os.path.dirname(abs_path), "data") if not os.path.exists(dst_dir) or not os.path.isdir(dst_dir): os.makedirs(dst_dir) for file in os.listdir(os.path.join(target_dir, 'task_data', 'chnsenticorp')): shutil.move(os.path.join(target_dir, 'task_data', 'chnsenticorp', file), dst_dir) shutil.rmtree(os.path.join(target_dir, 'task_data')) print(" done!") ================================================ FILE: examples/predict/evaluate.py ================================================ # -*- coding: utf-8 -*- import json import numpy as np def accuracy(preds, labels): preds = np.array(preds) labels = np.array(labels) return (preds == labels).mean() def pre_recall_f1(preds, labels): preds = np.array(preds) labels = np.array(labels) # recall=TP/(TP+FN) tp = np.sum((labels == '1') & (preds == '1')) fp = np.sum((labels == '0') & (preds == '1')) fn = np.sum((labels == '1') & (preds == '0')) r = tp * 1.0 / (tp + fn) # Precision=TP/(TP+FP) p = tp * 1.0 / (tp + fp) epsilon = 1e-31 f1 = 2 * p * r / (p+r+epsilon) return p, r, f1 def res_evaluate(res_dir="./outputs/predict/predictions.json", eval_phase='test'): if eval_phase == 'test': data_dir="./data/test.tsv" elif eval_phase == 'dev': data_dir="./data/dev.tsv" else: assert eval_phase in ['dev', 'test'], 'eval_phase should be dev or test' labels = [] with open(data_dir, "r") as file: first_flag = True for line in file: line = line.split("\t") label = line[0] if label=='label': continue labels.append(str(label)) file.close() preds = [] with open(res_dir, "r") as file: for line in file.readlines(): line = json.loads(line) pred = line['label'] preds.append(str(pred)) file.close() assert len(labels) == len(preds), "prediction result doesn't match to labels" print('data num: {}'.format(len(labels))) p, r, f1 = pre_recall_f1(preds, labels) print("accuracy: {:.4f}, precision: {:.4f}, recall: {:.4f}, f1: {:.4f}".format(accuracy(preds, labels), p, r, f1)) res_evaluate() ================================================ FILE: examples/predict/run.py ================================================ # coding=utf-8 import paddlepalm as palm import json if __name__ == '__main__': # configs max_seqlen = 256 batch_size = 8 vocab_path = './pretrain/ERNIE-v1-zh-base/vocab.txt' predict_file = './data/test.tsv' random_seed = 1 config = json.load(open('./pretrain/ERNIE-v1-zh-base/ernie_config.json')) input_dim = config['hidden_size'] num_classes = 2 task_name = 'chnsenticorp' pred_output = './outputs/predict/' print_steps = 20 pre_params = './pretrain/ERNIE-v1-zh-base/params' # ----------------------- for prediction ----------------------- # step 1-1: create readers for prediction print('prepare to predict...') predict_cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed, phase='predict') # step 1-2: load the training data predict_cls_reader.load_data(predict_file, batch_size) # step 2: create a backbone of the model to extract text features pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict') # step 3: register the backbone in reader predict_cls_reader.register_with(pred_ernie) # step 4: create the task output head cls_pred_head = palm.head.Classify(num_classes, input_dim, phase='predict') # step 5-1: create a task trainer trainer = palm.Trainer(task_name) # step 5-2: build forward graph with backbone and task head trainer.build_predict_forward(pred_ernie, cls_pred_head) # step 6: load checkpoint trainer.load_predict_model(pre_params) # step 7: fit prepared reader and data trainer.fit_reader(predict_cls_reader, phase='predict') # step 8: predict print('predicting..') trainer.predict(print_steps=print_steps, output_dir=pred_output) ================================================ FILE: examples/tagging/README.md ================================================ ## Example 3: Tagging This task is a named entity recognition task. The following sections detail model preparation, dataset preparation, and how to run the task. ### Step 1: Prepare Pre-trained Models & Datasets #### Pre-trianed Model The pre-training model of this mission is: [ERNIE-v1-zh-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api). Make sure you have downloaded the required pre-training model in the current folder. #### Dataset This task uses the `MSRA-NER(SIGHAN2006)` dataset. Download dataset: ```shell python download.py ``` If everything goes well, there will be a folder named `data/` created with all the datas in it. The data should have 2 fields, `text_a label`, with tsv format. Here is some example datas: ``` text_a label 在 这 里 恕 弟 不 恭 之 罪 , 敢 在 尊 前 一 诤 : 前 人 论 书 , 每 曰 “ 字 字 有 来 历 , 笔 笔 有 出 处 ” , 细 读 公 字 , 何 尝 跳 出 前 人 藩 篱 , 自 隶 变 而 后 , 直 至 明 季 , 兄 有 何 新 出 ? O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 相 比 之 下 , 青 岛 海 牛 队 和 广 州 松 日 队 的 雨 中 之 战 虽 然 也 是 0 ∶ 0 , 但 乏 善 可 陈 。 O O O O O B-ORG I-ORG I-ORG I-ORG I-ORG O B-ORG I-ORG I-ORG I-ORG I-ORG O O O O O O O O O O O O O O O O O O O 理 由 多 多 , 最 无 奈 的 却 是 : 5 月 恰 逢 双 重 考 试 , 她 攻 读 的 博 士 学 位 论 文 要 通 考 ; 她 任 教 的 两 所 学 校 , 也 要 在 这 段 时 日 大 考 。 O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O ``` ### Step 2: Train & Predict The code used to perform this task is in `run.py`. If you have prepared the pre-training model and the data set required for the task, run: ```shell python run.py ``` If you want to specify a specific gpu or use multiple gpus for training, please use **`CUDA_VISIBLE_DEVICES`**, for example: ```shell CUDA_VISIBLE_DEVICES=0,1 python run.py ``` Note: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.** Some logs will be shown below: ``` step 1/652 (epoch 0), loss: 216.002, speed: 0.32 steps/s step 2/652 (epoch 0), loss: 202.567, speed: 1.28 steps/s step 3/652 (epoch 0), loss: 170.677, speed: 1.05 steps/s ``` After the run, you can view the saved models in the `outputs/` folder and the predictions in the `outputs/predict` folder. Here are some examples of predictions: ``` [6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 6, 4, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6] [6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6] [6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6] ``` ### Step 3: Evaluate Once you have the prediction, you can run the evaluation script to evaluate the model: ```python python evaluate.py ``` The evaluation results are as follows: ``` data num: 4636 f1: 0.9918 ``` ================================================ FILE: examples/tagging/download.py ================================================ # -*- coding: utf-8 -*- from __future__ import print_function import os import tarfile import shutil import sys import urllib URLLIB=urllib if sys.version_info >= (3, 0): import urllib.request URLLIB=urllib.request def download(src, url): def _reporthook(count, chunk_size, total_size): bytes_so_far = count * chunk_size percent = float(bytes_so_far) / float(total_size) if percent > 1: percent = 1 print('\r>> Downloading... {:.1%}'.format(percent), end="") URLLIB.urlretrieve(url, src, reporthook=_reporthook) abs_path = os.path.abspath(__file__) download_url = "https://ernie.bj.bcebos.com/task_data_zh.tgz" downlaod_path = os.path.join(os.path.dirname(abs_path), "task_data_zh.tgz") target_dir = os.path.dirname(abs_path) download(downlaod_path, download_url) tar = tarfile.open(downlaod_path) tar.extractall(target_dir) os.remove(downlaod_path) abs_path = os.path.abspath(__file__) dst_dir = os.path.join(os.path.dirname(abs_path), "data") if not os.path.exists(dst_dir) or not os.path.isdir(dst_dir): os.makedirs(dst_dir) for file in os.listdir(os.path.join(target_dir, 'task_data', 'msra_ner')): shutil.move(os.path.join(target_dir, 'task_data', 'msra_ner', file), dst_dir) shutil.rmtree(os.path.join(target_dir, 'task_data')) print(" done!") ================================================ FILE: examples/tagging/evaluate.py ================================================ # -*- coding: utf-8 -*- import json def load_label_map(map_dir="./data/label_map.json"): """ :param map_dir: dict indictuing chunk type :return: """ return json.load(open(map_dir, "r")) def cal_chunk(pred_label, refer_label): tp = dict() fn = dict() fp = dict() for i in range(len(refer_label)): if refer_label[i] == pred_label[i]: if refer_label[i] not in tp: tp[refer_label[i]] = 0 tp[refer_label[i]] += 1 else: if pred_label[i] not in fp: fp[pred_label[i]] = 0 fp[pred_label[i]] += 1 if refer_label[i] not in fn: fn[refer_label[i]] = 0 fn[refer_label[i]] += 1 tp_total = sum(tp.values()) fn_total = sum(fn.values()) fp_total = sum(fp.values()) p_total = float(tp_total) / (tp_total + fp_total) r_total = float(tp_total) / (tp_total + fn_total) f_micro = 2 * p_total * r_total / (p_total + r_total) return f_micro def res_evaluate(res_dir="./outputs/predict/predictions.json", data_dir="./data/test.tsv"): label_map = load_label_map() total_label = [] with open(data_dir, "r") as file: first_flag = True for line in file: if first_flag: first_flag = False continue line = line.strip("\n") if len(line) == 0: continue line = line.split("\t") if len(line) < 2: continue labels = line[1].split("\x02") total_label.append(labels) total_label = [[label_map[j] for j in i] for i in total_label] total_res = [] with open(res_dir, "r") as file: cnt = 0 for line in file: line = line.strip("\n") if len(line) == 0: continue try: res_arr = json.loads(line) if len(total_label[cnt]) < len(res_arr): total_res.append(res_arr[1: 1 + len(total_label[cnt])]) elif len(total_label[cnt]) == len(res_arr): total_res.append(res_arr) else: total_res.append(res_arr) total_label[cnt] = total_label[cnt][: len(res_arr)] except: print("json format error: {}".format(cnt)) print(line) cnt += 1 total_res_equal = [] total_label_equal = [] assert len(total_label) == len(total_res), "prediction result doesn't match to labels" for i in range(len(total_label)): num = len(total_label[i]) total_label_equal.extend(total_label[i]) total_res[i] = total_res[i][:num] total_res_equal.extend(total_res[i]) f1 = cal_chunk(total_res_equal, total_label_equal) print('data num: {}'.format(len(total_label))) print("f1: {:.4f}".format(f1)) res_evaluate() ================================================ FILE: examples/tagging/run.py ================================================ # coding=utf-8 import paddlepalm as palm import json if __name__ == '__main__': # configs max_seqlen = 256 batch_size = 16 num_epochs = 6 lr = 5e-5 num_classes = 7 weight_decay = 0.01 dropout_prob = 0.1 vocab_path = './pretrain/ERNIE-v1-zh-base/vocab.txt' label_map = './data/label_map.json' random_seed = 1 train_file = './data/train.tsv' predict_file = './data/test.tsv' save_path='./outputs/' save_type='ckpt' pre_params = './pretrain/ERNIE-v1-zh-base/params' config = json.load(open('./pretrain/ERNIE-v1-zh-base/ernie_config.json')) input_dim = config['hidden_size'] task_name = 'msra_ner' pred_output = './outputs/predict/' train_print_steps = 10 pred_print_steps = 20 # ----------------------- for training ----------------------- # step 1-1: create readers for training seq_label_reader = palm.reader.SequenceLabelReader(vocab_path, max_seqlen, label_map, seed=random_seed) # step 1-2: load the training data seq_label_reader.load_data(train_file, file_format='tsv', num_epochs=num_epochs, batch_size=batch_size) # step 2: create a backbone of the model to extract text features ernie = palm.backbone.ERNIE.from_config(config) # step 3: register the backbone in reader seq_label_reader.register_with(ernie) # step 4: create the task output head seq_label_head = palm.head.SequenceLabel(num_classes, input_dim, dropout_prob) # step 5-1: create a task trainer trainer = palm.Trainer(task_name) # step 5-2: build forward graph with backbone and task head loss_var = trainer.build_forward(ernie, seq_label_head) # step 6-1*: use warmup n_steps = seq_label_reader.num_examples * num_epochs // batch_size warmup_steps = int(0.1 * n_steps) print('total_steps: {}'.format(n_steps)) print('warmup_steps: {}'.format(warmup_steps)) sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps) # step 6-2: create a optimizer adam = palm.optimizer.Adam(loss_var, lr, sched) # step 6-3: build backward trainer.build_backward(optimizer=adam, weight_decay=weight_decay) # step 7: fit prepared reader and data trainer.fit_reader(seq_label_reader) # step 8-1*: load pretrained parameters trainer.load_pretrain(pre_params) # step 8-2*: set saver to save model save_steps = 1951 # print('save_steps: {}'.format(save_steps)) trainer.set_saver(save_path=save_path, save_steps=save_steps, save_type=save_type) # # step 8-3: start training trainer.train(print_steps=train_print_steps) # ----------------------- for prediction ----------------------- # step 1-1: create readers for prediction print('prepare to predict...') predict_seq_label_reader = palm.reader.SequenceLabelReader(vocab_path, max_seqlen, label_map, seed=random_seed, phase='predict') # step 1-2: load the training data predict_seq_label_reader.load_data(predict_file, batch_size) # step 2: create a backbone of the model to extract text features pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict') # step 3: register the backbone in reader predict_seq_label_reader.register_with(pred_ernie) # step 4: create the task output head seq_label_pred_head = palm.head.SequenceLabel(num_classes, input_dim, phase='predict') # step 5: build forward graph with backbone and task head trainer.build_predict_forward(pred_ernie, seq_label_pred_head) # step 6: load checkpoint pred_model_path = './outputs/ckpt.step' + str(save_steps) trainer.load_ckpt(pred_model_path) # step 7: fit prepared reader and data trainer.fit_reader(predict_seq_label_reader, phase='predict') # step 8: predict print('predicting..') trainer.predict(print_steps=pred_print_steps, output_dir=pred_output) ================================================ FILE: examples/train_with_eval/README.md ================================================ ## Train with Evaluation version of Example 1: Classification This task is a sentiment analysis task. The following sections detail model preparation, dataset preparation, and how to run the task. Here to demonstrate how to do evaluation during training in PaddlePALM. ### Step 1: Prepare Pre-trained Model & Dataset #### Pre-trained Model The pre-training model of this mission is: [ERNIE-v1-zh-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api). Make sure you have downloaded the required pre-training model in the current folder. #### Dataset This example demonstrates with [ChnSentiCorp](https://github.com/SophonPlus/ChineseNlpCorpus/tree/master/datasets/ChnSentiCorp_htl_all), a Chinese sentiment analysis dataset. Download dataset: ```shell python download.py ``` If everything goes well, there will be a folder named `data/` created with all the data files in it. The dataset file (for training) should have 2 fields, `text_a` and `label`, stored with [tsv](https://en.wikipedia.org/wiki/Tab-separated_values) format. Here shows an example: ``` label text_a 0 当当网名不符实,订货多日不见送货,询问客服只会推托,只会要求用户再下订单。如此服务留不住顾客的。去别的网站买书服务更好。 0 XP的驱动不好找!我的17号提的货,现在就降价了100元,而且还送杀毒软件! 1 <荐书> 推荐所有喜欢<红楼>的红迷们一定要收藏这本书,要知道当年我听说这本书的时候花很长时间去图书馆找和借都没能如愿,所以这次一看到当当有,马上买了,红迷们也要记得备货哦! ``` ### Step 2: Train & Predict The code used to perform this task is in `run.py`. If you have prepared the pre-training model and the data set required for the task, run: ```shell python run.py ``` If you want to specify a specific gpu or use multiple gpus for training, please use **`CUDA_VISIBLE_DEVICES`**, for example: ```shell CUDA_VISIBLE_DEVICES=0,1 python run.py ``` Note: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.** Some logs will be shown below: ``` step 1/154 (epoch 0), loss: 5.512, speed: 0.51 steps/s step 2/154 (epoch 0), loss: 2.595, speed: 3.36 steps/s step 3/154 (epoch 0), loss: 1.798, speed: 3.48 steps/s ``` After the run, you can view the saved models in the `outputs/` folder and the predictions in the `outputs/predict` folder. Here are some examples of predictions: ``` {"index": 0, "logits": [-0.2014336884021759, 0.6799028515815735], "probs": [0.29290086030960083, 0.7070990800857544], "label": 1} {"index": 1, "logits": [0.8593899011611938, -0.29743513464927673], "probs": [0.7607553601264954, 0.23924466967582703], "label": 0} {"index": 2, "logits": [0.7462944388389587, -0.7083730101585388], "probs": [0.8107157349586487, 0.18928426504135132], "label": 0} ``` ### Step 3: Evaluate Once you have the prediction, you can run the evaluation script to evaluate the model: ```shell python evaluate.py ``` The evaluation results are as follows: ``` data num: 1200 accuracy: 0.9575, precision: 0.9634, recall: 0.9523, f1: 0.9578 ``` ================================================ FILE: examples/train_with_eval/download.py ================================================ # -*- coding: utf-8 -*- from __future__ import print_function import os import tarfile import shutil import sys import urllib URLLIB=urllib if sys.version_info >= (3, 0): import urllib.request URLLIB=urllib.request def download(src, url): def _reporthook(count, chunk_size, total_size): bytes_so_far = count * chunk_size percent = float(bytes_so_far) / float(total_size) if percent > 1: percent = 1 print('\r>> Downloading... {:.1%}'.format(percent), end="") URLLIB.urlretrieve(url, src, reporthook=_reporthook) abs_path = os.path.abspath(__file__) download_url = "https://ernie.bj.bcebos.com/task_data_zh.tgz" downlaod_path = os.path.join(os.path.dirname(abs_path), "task_data_zh.tgz") target_dir = os.path.dirname(abs_path) download(downlaod_path, download_url) tar = tarfile.open(downlaod_path) tar.extractall(target_dir) os.remove(downlaod_path) abs_path = os.path.abspath(__file__) dst_dir = os.path.join(os.path.dirname(abs_path), "data") if not os.path.exists(dst_dir) or not os.path.isdir(dst_dir): os.makedirs(dst_dir) for file in os.listdir(os.path.join(target_dir, 'task_data', 'chnsenticorp')): shutil.move(os.path.join(target_dir, 'task_data', 'chnsenticorp', file), dst_dir) shutil.rmtree(os.path.join(target_dir, 'task_data')) print(" done!") ================================================ FILE: examples/train_with_eval/evaluate.py ================================================ # -*- coding: utf-8 -*- import json import numpy as np def accuracy(preds, labels): preds = np.array(preds) labels = np.array(labels) return (preds == labels).mean() def pre_recall_f1(preds, labels): preds = np.array(preds) labels = np.array(labels) # recall=TP/(TP+FN) tp = np.sum((labels == '1') & (preds == '1')) fp = np.sum((labels == '0') & (preds == '1')) fn = np.sum((labels == '1') & (preds == '0')) r = tp * 1.0 / (tp + fn) # Precision=TP/(TP+FP) p = tp * 1.0 / (tp + fp) epsilon = 1e-31 f1 = 2 * p * r / (p+r+epsilon) return p, r, f1 def res_evaluate(res_dir="./outputs/predict/predictions.json", eval_phase='test'): if eval_phase == 'test': data_dir="./data/test.tsv" elif eval_phase == 'dev': data_dir="./data/dev.tsv" else: assert eval_phase in ['dev', 'test'], 'eval_phase should be dev or test' labels = [] with open(data_dir, "r") as file: first_flag = True for line in file: line = line.split("\t") label = line[0] if label=='label': continue labels.append(str(label)) file.close() preds = [] with open(res_dir, "r") as file: for line in file.readlines(): line = json.loads(line) pred = line['label'] preds.append(str(pred)) file.close() assert len(labels) == len(preds), "prediction result doesn't match to labels" print('data num: {}'.format(len(labels))) p, r, f1 = pre_recall_f1(preds, labels) print("accuracy: {:.4f}, precision: {:.4f}, recall: {:.4f}, f1: {:.4f}".format(accuracy(preds, labels), p, r, f1)) res_evaluate() ================================================ FILE: examples/train_with_eval/run.py ================================================ # coding=utf-8 import paddlepalm as palm import json if __name__ == '__main__': # configs max_seqlen = 256 batch_size = 8 num_epochs = 10 lr = 5e-5 weight_decay = 0.01 vocab_path = './pretrain/ERNIE-v1-zh-base/vocab.txt' train_file = './data/train.tsv' predict_file = './data/test.tsv' config = json.load(open('./pretrain/ERNIE-v1-zh-base/ernie_config.json')) input_dim = config['hidden_size'] num_classes = 2 dropout_prob = 0.1 random_seed = 1 task_name = 'chnsenticorp' save_path = './outputs/' pred_output = './outputs/predict/' save_type = 'ckpt' print_steps = 20 pre_params = './pretrain/ERNIE-v1-zh-base/params' # ----------------------- for training ----------------------- # step 1-1: create readers for training cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed) # step 1-2: load the training data cls_reader.load_data(train_file, batch_size, num_epochs=num_epochs) # step 2: create a backbone of the model to extract text features ernie = palm.backbone.ERNIE.from_config(config) # step 3: register the backbone in reader cls_reader.register_with(ernie) # step 4: create the task output head cls_head = palm.head.Classify(num_classes, input_dim, dropout_prob) # step 5-1: create a task trainer trainer = palm.Trainer(task_name) # step 5-2: build forward graph with backbone and task head loss_var = trainer.build_forward(ernie, cls_head) # step 6-1*: use warmup n_steps = cls_reader.num_examples * num_epochs // batch_size warmup_steps = int(0.1 * n_steps) sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps) # step 6-2: create a optimizer adam = palm.optimizer.Adam(loss_var, lr, sched) # step 6-3: build backward trainer.build_backward(optimizer=adam, weight_decay=weight_decay) # step 7: fit prepared reader and data iterator = trainer.fit_reader(cls_reader) # step 8-1*: load pretrained parameters trainer.load_pretrain(pre_params) # step 8-2*: set saver to save model # save_steps = n_steps save_steps = 2396 trainer.set_saver(save_steps=save_steps, save_path=save_path, save_type=save_type) # step 8-3: start training # you can repeatly get one train batch with trainer.get_one_batch() # batch = trainer.get_one_batch() for step, batch in enumerate(iterator, start=1): trainer.train_one_step(batch) if step % 100 == 0: print('do evaluation.') # insert evaluation code here ================================================ FILE: paddlepalm/__init__.py ================================================ from . import downloader # from mtl_controller import Controller #import controller from . import optimizer from . import lr_sched from . import backbone from . import reader from . import head from .trainer import Trainer from .multihead_trainer import MultiHeadTrainer #del interface #del task_instance #del default_settings #del utils ================================================ FILE: paddlepalm/_downloader.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from __future__ import print_function import os import tarfile import shutil from collections import OrderedDict import sys import urllib URLLIB=urllib if sys.version_info >= (3, 0): import urllib.request URLLIB=urllib.request __all__ = ["download", "ls"] _pretrain = (('RoBERTa-zh-base', 'https://bert-models.bj.bcebos.com/chinese_roberta_wwm_ext_L-12_H-768_A-12.tar.gz'), ('RoBERTa-zh-large', 'https://bert-models.bj.bcebos.com/chinese_roberta_wwm_large_ext_L-24_H-1024_A-16.tar.gz'), ('ERNIE-v2-en-base', 'https://ernie.bj.bcebos.com/ERNIE_Base_en_stable-2.0.0.tar.gz'), ('ERNIE-v2-en-large', 'https://ernie.bj.bcebos.com/ERNIE_Large_en_stable-2.0.0.tar.gz'), ('XLNet-cased-base','https://xlnet.bj.bcebos.com/xlnet_cased_L-12_H-768_A-12.tgz'), ('XLNet-cased-large','https://xlnet.bj.bcebos.com/xlnet_cased_L-24_H-1024_A-16.tgz'), ('ERNIE-v1-zh-base','https://baidu-nlp.bj.bcebos.com/ERNIE_stable-1.0.1.tar.gz'), ('ERNIE-v1-zh-base-max-len-512','https://ernie.bj.bcebos.com/ERNIE_1.0_max-len-512.tar.gz'), ('BERT-en-uncased-large-whole-word-masking','https://bert-models.bj.bcebos.com/wwm_uncased_L-24_H-1024_A-16.tar.gz'), ('BERT-en-cased-large-whole-word-masking','https://bert-models.bj.bcebos.com/wwm_cased_L-24_H-1024_A-16.tar.gz'), ('BERT-en-uncased-base', 'https://bert-models.bj.bcebos.com/uncased_L-12_H-768_A-12.tar.gz'), ('BERT-en-uncased-large', 'https://bert-models.bj.bcebos.com/uncased_L-24_H-1024_A-16.tar.gz'), ('BERT-en-cased-base','https://bert-models.bj.bcebos.com/cased_L-12_H-768_A-12.tar.gz'), ('BERT-en-cased-large','https://bert-models.bj.bcebos.com/cased_L-24_H-1024_A-16.tar.gz'), ('BERT-multilingual-uncased-base','https://bert-models.bj.bcebos.com/multilingual_L-12_H-768_A-12.tar.gz'), ('BERT-multilingual-cased-base','https://bert-models.bj.bcebos.com/multi_cased_L-12_H-768_A-12.tar.gz'), ('BERT-zh-base','https://bert-models.bj.bcebos.com/chinese_L-12_H-768_A-12.tar.gz'), ('utils', None)) _vocab = (('utils', None),('utils', None)) _backbone =(('utils', None),('utils', None)) _head = (('utils', None),('utils', None)) _reader = (('utils', None),('utils', None)) _items = (('pretrain', OrderedDict(_pretrain)), ('vocab', OrderedDict(_vocab)), ('backbone', OrderedDict(_backbone)), ('head', OrderedDict(_head)), ('reader', OrderedDict(_reader)) ) _items = OrderedDict(_items) def _download(item, scope, path, silent=False, convert=False): data_url = _items[item][scope] if data_url == None: return if not silent: print('Downloading {}: {} from {}...'.format(item, scope, data_url)) data_dir = path + '/' + item + '/' + scope if not os.path.exists(data_dir): os.makedirs(os.path.join(data_dir)) data_name = data_url.split('/')[-1] filename = data_dir + '/' + data_name # print process def _reporthook(count, chunk_size, total_size): bytes_so_far = count * chunk_size percent = float(bytes_so_far) / float(total_size) if percent > 1: percent = 1 if not silent: print('\r>> Downloading... {:.1%}'.format(percent), end = "") URLLIB.urlretrieve(data_url, filename, reporthook=_reporthook) if not silent: print(' done!') if item == 'pretrain': if not silent: print ('Extracting {}...'.format(data_name), end=" ") if os.path.exists(filename): tar = tarfile.open(filename, 'r') tar.extractall(path = data_dir) tar.close() os.remove(filename) if len(os.listdir(data_dir))==1: source_path = data_dir + '/' + data_name.split('.')[0] fileList = os.listdir(source_path) for file in fileList: filePath = os.path.join(source_path, file) shutil.move(filePath, data_dir) os.removedirs(source_path) if not silent: print ('done!') if convert: if not silent: print ('Converting params...', end=" ") _convert(data_dir, silent) if not silent: print ('done!') def _convert(path, silent=False): if os.path.isfile(path + '/params/__palminfo__'): if not silent: print ('already converted.') else: if os.path.exists(path + '/params/'): os.rename(path + '/params/', path + '/params1/') os.mkdir(path + '/params/') tar_model = tarfile.open(path + '/params/' + '__palmmodel__', 'w') tar_info = open(path + '/params/'+ '__palminfo__', 'w') for root, dirs, files in os.walk(path + '/params1/'): for file in files: src_file = os.path.join(root, file) tar_model.add(src_file, '__paddlepalm_' + file) tar_info.write('__paddlepalm_' + file) os.remove(src_file) tar_model.close() tar_info.close() os.removedirs(path + '/params1/') def download(item, scope='all', path='.'): """download an item. The available scopes and contained items can be showed with `paddlepalm.downloader.ls`. Args: item: the item to download. scope: the scope of the item to download. path: the target dir to download to. Default is `.`, means current dir. """ # item = item.lower() # scope = scope.lower() assert item in _items, '{} is not found. Support list: {}'.format(item, list(_items.keys())) if _items[item]['utils'] is not None: _download(item, 'utils', path, silent=True) if scope != 'all': assert scope in _items[item], '{} is not found. Support scopes: {}'.format(scope, list(_items[item].keys())) _download(item, scope, path) else: for s in _items[item].keys(): _download(item, s, path) def _ls(item, scope, l = 10): if scope != 'all': assert scope in _items[item], '{} is not found. Support scopes: {}'.format(scope, list(_items[item].keys())) print ('{}'.format(scope)) else: for s in _items[item].keys(): if s == 'utils': continue print (' => '+s) def ls(item='all', scope='all'): if scope == 'utils': return if item != 'all': assert item in _items, '{} is not found. Support scopes: {}'.format(item, list(_items.keys())) print ('Available {} items:'.format(item)) _ls(item, scope) else: l = max(map(len, _items.keys())) for i in _items.keys(): print ('Available {} items: '.format(i)) _ls(i, scope, l) ================================================ FILE: paddlepalm/backbone/README.md ================================================ ================================================ FILE: paddlepalm/backbone/__init__.py ================================================ from .ernie import ERNIE from .bert import BERT ================================================ FILE: paddlepalm/backbone/base_backbone.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. class Backbone(object): """interface of backbone model.""" def __init__(self, phase): """该函数完成一个主干网络的构造,至少需要包含一个phase参数。 注意:实现该构造函数时,必须保证对基类构造函数的调用,以创建必要的框架内建的成员变量。 Args: phase: str类型。用于区分主干网络被调用时所处的运行阶段,目前支持训练阶段train和预测阶段predict """ assert isinstance(config, dict) @property def inputs_attr(self): """描述backbone从reader处需要得到的输入对象的属性,包含各个对象的名字、shape以及数据类型。当某个对象 为标量数据类型(如str, int, float等)时,shape设置为空列表[],当某个对象的某个维度长度可变时,shape 中的相应维度设置为-1。 Return: dict类型。对各个输入对象的属性描述。例如, 对于文本分类和匹配任务,bert backbone依赖的reader对象主要包含如下的对象 {"token_ids": ([-1, max_len], 'int64'), "input_ids": ([-1, max_len], 'int64'), "segment_ids": ([-1, max_len], 'int64'), "input_mask": ([-1, max_len], 'float32')}""" raise NotImplementedError() @property def outputs_attr(self): """描述backbone输出对象的属性,包含各个对象的名字、shape以及数据类型。当某个对象为标量数据类型(如 str, int, float等)时,shape设置为空列表[],当某个对象的某个维度长度可变时,shape中的相应维度设置为-1。 Return: dict类型。对各个输出对象的属性描述。例如, 对于文本分类和匹配任务,bert backbone的输出内容可能包含如下的对象 {"word_emb": ([-1, max_seqlen, word_emb_size], 'float32'), "sentence_emb": ([-1, hidden_size], 'float32'), "sim_vec": ([-1, hidden_size], 'float32')}""" raise NotImplementedError() def build(self, inputs): """建立backbone的计算图。将符合inputs_attr描述的静态图Variable输入映射成符合outputs_attr描述的静态图Variable输出。 Args: inputs: dict类型。字典中包含inputs_attr中的对象名到计算图Variable的映射,inputs中至少会包含inputs_attr中定义的对象 Return: 需要输出的计算图变量,输出对象会被加入到fetch_list中,从而在每个训练/推理step时得到runtime的计算结果,该计算结果会被传入postprocess方法中供用户处理。 """ raise NotImplementedError() ================================================ FILE: paddlepalm/backbone/bert.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """v1.1 BERT model.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function from paddle import fluid from paddle.fluid import layers from paddlepalm.backbone.utils.transformer import pre_process_layer, encoder from paddlepalm.backbone.base_backbone import Backbone class BERT(Backbone): def __init__(self, hidden_size, num_hidden_layers, num_attention_heads, vocab_size, \ max_position_embeddings, type_vocab_size, hidden_act, hidden_dropout_prob, \ attention_probs_dropout_prob, initializer_range, is_pairwise=False, phase='train'): self._emb_size = hidden_size self._n_layer = num_hidden_layers self._n_head = num_attention_heads self._voc_size = vocab_size self._max_position_seq_len = max_position_embeddings self._sent_types = type_vocab_size self._hidden_act = hidden_act self._prepostprocess_dropout = 0. if phase == 'predict' else hidden_dropout_prob self._attention_dropout = 0. if phase == 'predict' else attention_probs_dropout_prob self._word_emb_name = "word_embedding" self._pos_emb_name = "pos_embedding" self._sent_emb_name = "sent_embedding" self._task_emb_name = "task_embedding" self._emb_dtype = "float32" self._phase = phase self._is_pairwise = is_pairwise self._param_initializer = fluid.initializer.TruncatedNormal( scale=initializer_range) @classmethod def from_config(self, config, phase='train'): assert 'hidden_size' in config, "{} is required to initialize ERNIE".format('') assert 'num_hidden_layers' in config, "{} is required to initialize ERNIE".format('num_hidden_layers') assert 'num_attention_heads' in config, "{} is required to initialize ERNIE".format('num_attention_heads') assert 'vocab_size' in config, "{} is required to initialize ERNIE".format('vocab_size') assert 'max_position_embeddings' in config, "{} is required to initialize ERNIE".format('max_position_embeddings') assert 'sent_type_vocab_size' in config or 'type_vocab_size' in config, \ "{} is required to initialize ERNIE".format('type_vocab_size') assert 'hidden_act' in config, "{} is required to initialize ERNIE".format('hidden_act') assert 'hidden_dropout_prob' in config, "{} is required to initialize ERNIE".format('hidden_dropout_prob') assert 'attention_probs_dropout_prob' in config, \ "{} is required to initialize ERNIE".format('attention_probs_dropout_prob') assert 'initializer_range' in config, "{} is required to initialize ERNIE".format('initializer_range') hidden_size = config['hidden_size'] num_hidden_layers = config['num_hidden_layers'] num_attention_heads = config['num_attention_heads'] vocab_size = config['vocab_size'] max_position_embeddings = config['max_position_embeddings'] if 'sent_type_vocab_size' in config: sent_type_vocab_size = config['sent_type_vocab_size'] else: sent_type_vocab_size = config['type_vocab_size'] hidden_act = config['hidden_act'] hidden_dropout_prob = config['hidden_dropout_prob'] attention_probs_dropout_prob = config['attention_probs_dropout_prob'] initializer_range = config['initializer_range'] if 'is_pairwise' in config: is_pairwise = config['is_pairwise'] else: is_pairwise = False return self(hidden_size, num_hidden_layers, num_attention_heads, vocab_size, \ max_position_embeddings, sent_type_vocab_size, \ hidden_act, hidden_dropout_prob, attention_probs_dropout_prob, initializer_range, is_pairwise, phase) @property def inputs_attr(self): ret = {"token_ids": [[-1, -1], 'int64'], "position_ids": [[-1, -1], 'int64'], "segment_ids": [[-1, -1], 'int64'], "input_mask": [[-1, -1, 1], 'float32'], } if self._is_pairwise and self._phase=='train': ret.update({"token_ids_neg": [[-1, -1], 'int64'], "position_ids_neg": [[-1, -1], 'int64'], "segment_ids_neg": [[-1, -1], 'int64'], "input_mask_neg": [[-1, -1, 1], 'float32'], }) return ret @property def outputs_attr(self): ret = {"word_embedding": [[-1, -1, self._emb_size], 'float32'], "embedding_table": [[-1, self._voc_size, self._emb_size], 'float32'], "encoder_outputs": [[-1, -1, self._emb_size], 'float32'], "sentence_embedding": [[-1, self._emb_size], 'float32'], "sentence_pair_embedding": [[-1, self._emb_size], 'float32']} if self._is_pairwise and self._phase == 'train': ret.update({"word_embedding_neg": [[-1, -1, self._emb_size], 'float32'], "encoder_outputs_neg": [[-1, -1, self._emb_size], 'float32'], "sentence_embedding_neg": [[-1, self._emb_size], 'float32'], "sentence_pair_embedding_neg": [[-1, self._emb_size], 'float32']}) return ret def build(self, inputs, scope_name=""): src_ids = inputs['token_ids'] pos_ids = inputs['position_ids'] sent_ids = inputs['segment_ids'] input_mask = inputs['input_mask'] self._emb_dtype = 'float32' input_buffer = {} output_buffer = {} input_buffer['base'] = [src_ids, pos_ids, sent_ids, input_mask] output_buffer['base'] = {} if self._is_pairwise and self._phase =='train': src_ids = inputs['token_ids_neg'] pos_ids = inputs['position_ids_neg'] sent_ids = inputs['segment_ids_neg'] input_mask = inputs['input_mask_neg'] input_buffer['neg'] = [src_ids, pos_ids, sent_ids, input_mask] output_buffer['neg'] = {} for key, (src_ids, pos_ids, sent_ids, input_mask) in input_buffer.items(): # padding id in vocabulary must be set to 0 emb_out = fluid.embedding( input=src_ids, size=[self._voc_size, self._emb_size], dtype=self._emb_dtype, param_attr=fluid.ParamAttr( name=scope_name+self._word_emb_name, initializer=self._param_initializer), is_sparse=False) # fluid.global_scope().find_var('backbone-word_embedding').get_tensor() embedding_table = fluid.default_main_program().global_block().var(scope_name+self._word_emb_name) position_emb_out = fluid.embedding( input=pos_ids, size=[self._max_position_seq_len, self._emb_size], dtype=self._emb_dtype, param_attr=fluid.ParamAttr( name=scope_name+self._pos_emb_name, initializer=self._param_initializer)) sent_emb_out = fluid.embedding( sent_ids, size=[self._sent_types, self._emb_size], dtype=self._emb_dtype, param_attr=fluid.ParamAttr( name=scope_name+self._sent_emb_name, initializer=self._param_initializer)) emb_out = emb_out + position_emb_out emb_out = emb_out + sent_emb_out emb_out = pre_process_layer( emb_out, 'nd', self._prepostprocess_dropout, name=scope_name+'pre_encoder') self_attn_mask = fluid.layers.matmul( x=input_mask, y=input_mask, transpose_y=True) self_attn_mask = fluid.layers.scale( x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False) n_head_self_attn_mask = fluid.layers.stack( x=[self_attn_mask] * self._n_head, axis=1) n_head_self_attn_mask.stop_gradient = True enc_out = encoder( enc_input=emb_out, attn_bias=n_head_self_attn_mask, n_layer=self._n_layer, n_head=self._n_head, d_key=self._emb_size // self._n_head, d_value=self._emb_size // self._n_head, d_model=self._emb_size, d_inner_hid=self._emb_size * 4, prepostprocess_dropout=self._prepostprocess_dropout, attention_dropout=self._attention_dropout, relu_dropout=0, hidden_act=self._hidden_act, preprocess_cmd="", postprocess_cmd="dan", param_initializer=self._param_initializer, name=scope_name+'encoder') next_sent_feat = fluid.layers.slice( input=enc_out, axes=[1], starts=[0], ends=[1]) next_sent_feat = fluid.layers.reshape(next_sent_feat, [-1, next_sent_feat.shape[-1]]) next_sent_feat = fluid.layers.fc( input=next_sent_feat, size=self._emb_size, act="tanh", param_attr=fluid.ParamAttr( name=scope_name+"pooled_fc.w_0", initializer=self._param_initializer), bias_attr=scope_name+"pooled_fc.b_0") output_buffer[key]['word_embedding'] = emb_out output_buffer[key]['encoder_outputs'] = enc_out output_buffer[key]['sentence_embedding'] = next_sent_feat output_buffer[key]['sentence_pair_embedding'] = next_sent_feat ret = {} ret['embedding_table'] = embedding_table ret['word_embedding'] = output_buffer['base']['word_embedding'] ret['encoder_outputs'] = output_buffer['base']['encoder_outputs'] ret['sentence_embedding'] = output_buffer['base']['sentence_embedding'] ret['sentence_pair_embedding'] = output_buffer['base']['sentence_pair_embedding'] if self._is_pairwise and self._phase == 'train': ret['word_embedding_neg'] = output_buffer['neg']['word_embedding'] ret['encoder_outputs_neg'] = output_buffer['neg']['encoder_outputs'] ret['sentence_embedding_neg'] = output_buffer['neg']['sentence_embedding'] ret['sentence_pair_embedding_neg'] = output_buffer['neg']['sentence_pair_embedding'] return ret def postprocess(self, rt_outputs): pass class Model(BERT): """BERT wrapper for ConfigController""" def __init__(self, config, phase): BERT.from_config(config, phase=phase) ================================================ FILE: paddlepalm/backbone/ernie.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Ernie model.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function from __future__ import unicode_literals from __future__ import absolute_import from paddle import fluid from paddle.fluid import layers from paddlepalm.backbone.utils.transformer import pre_process_layer, encoder from paddlepalm.backbone.base_backbone import Backbone class ERNIE(Backbone): def __init__(self, hidden_size, num_hidden_layers, num_attention_heads, vocab_size, \ max_position_embeddings, sent_type_vocab_size, task_type_vocab_size, \ hidden_act, hidden_dropout_prob, attention_probs_dropout_prob, initializer_range, is_pairwise=False, use_task_emb=True, phase='train'): # self._is_training = phase == 'train' # backbone一般不用关心运行阶段,因为outputs在任何阶段基本不会变 self._emb_size = hidden_size self._n_layer = num_hidden_layers self._n_head = num_attention_heads self._voc_size = vocab_size self._max_position_seq_len = max_position_embeddings self._sent_types = sent_type_vocab_size self._task_types = task_type_vocab_size self._hidden_act = hidden_act self._prepostprocess_dropout = 0. if phase == 'predict' else hidden_dropout_prob self._attention_dropout = 0. if phase == 'predict' else attention_probs_dropout_prob self._word_emb_name = "word_embedding" self._pos_emb_name = "pos_embedding" self._sent_emb_name = "sent_embedding" self._task_emb_name = "task_embedding" self._emb_dtype = "float32" self._is_pairwise = is_pairwise self._use_task_emb = use_task_emb self._phase=phase self._param_initializer = fluid.initializer.TruncatedNormal( scale=initializer_range) @classmethod def from_config(cls, config, phase='train'): assert 'hidden_size' in config, "{} is required to initialize ERNIE".format('hidden_size') assert 'num_hidden_layers' in config, "{} is required to initialize ERNIE".format('num_hidden_layers') assert 'num_attention_heads' in config, "{} is required to initialize ERNIE".format('num_attention_heads') assert 'vocab_size' in config, "{} is required to initialize ERNIE".format('vocab_size') assert 'max_position_embeddings' in config, "{} is required to initialize ERNIE".format('max_position_embeddings') assert 'sent_type_vocab_size' in config or 'type_vocab_size' in config, "{} is required to initialize ERNIE".format('sent_type_vocab_size') # assert 'task_type_vocab_size' in config, "{} is required to initialize ERNIE".format('task_type_vocab_size') assert 'hidden_act' in config, "{} is required to initialize ERNIE".format('hidden_act') assert 'hidden_dropout_prob' in config, "{} is required to initialize ERNIE".format('hidden_dropout_prob') assert 'attention_probs_dropout_prob' in config, "{} is required to initialize ERNIE".format('attention_probs_dropout_prob') assert 'initializer_range' in config, "{} is required to initialize ERNIE".format('initializer_range') hidden_size = config['hidden_size'] num_hidden_layers = config['num_hidden_layers'] num_attention_heads = config['num_attention_heads'] vocab_size = config['vocab_size'] max_position_embeddings = config['max_position_embeddings'] if 'sent_type_vocab_size' in config: sent_type_vocab_size = config['sent_type_vocab_size'] else: sent_type_vocab_size = config['type_vocab_size'] if 'task_type_vocab_size' in config: task_type_vocab_size = config['task_type_vocab_size'] else: task_type_vocab_size = config['type_vocab_size'] if 'use_task_emb' in config: use_task_emb = config['use_task_emb'] else: use_task_emb = True hidden_act = config['hidden_act'] hidden_dropout_prob = config['hidden_dropout_prob'] attention_probs_dropout_prob = config['attention_probs_dropout_prob'] initializer_range = config['initializer_range'] if 'is_pairwise' in config: is_pairwise = config['is_pairwise'] else: is_pairwise = False return cls(hidden_size, num_hidden_layers, num_attention_heads, vocab_size, \ max_position_embeddings, sent_type_vocab_size, task_type_vocab_size, \ hidden_act, hidden_dropout_prob, attention_probs_dropout_prob, initializer_range, is_pairwise, use_task_emb=use_task_emb, phase=phase) @property def inputs_attr(self): ret = {"token_ids": [[-1, -1], 'int64'], "position_ids": [[-1, -1], 'int64'], "segment_ids": [[-1, -1], 'int64'], "input_mask": [[-1, -1, 1], 'float32'], "task_ids": [[-1,-1], 'int64']} if self._is_pairwise and self._phase=='train': ret.update({"token_ids_neg": [[-1, -1], 'int64'], "position_ids_neg": [[-1, -1], 'int64'], "segment_ids_neg": [[-1, -1], 'int64'], "input_mask_neg": [[-1, -1, 1], 'float32'], "task_ids_neg": [[-1,-1], 'int64'] }) return ret @property def outputs_attr(self): ret = {"word_embedding": [[-1, -1, self._emb_size], 'float32'], "embedding_table": [[-1, self._voc_size, self._emb_size], 'float32'], "encoder_outputs": [[-1, -1, self._emb_size], 'float32'], "sentence_embedding": [[-1, self._emb_size], 'float32'], "sentence_pair_embedding": [[-1, self._emb_size], 'float32']} if self._is_pairwise and self._phase == 'train': ret.update({"word_embedding_neg": [[-1, -1, self._emb_size], 'float32'], "encoder_outputs_neg": [[-1, -1, self._emb_size], 'float32'], "sentence_embedding_neg": [[-1, self._emb_size], 'float32'], "sentence_pair_embedding_neg": [[-1, self._emb_size], 'float32']}) return ret def build(self, inputs, scope_name=""): src_ids = inputs['token_ids'] pos_ids = inputs['position_ids'] sent_ids = inputs['segment_ids'] input_mask = inputs['input_mask'] task_ids = inputs['task_ids'] input_buffer = {} output_buffer = {} input_buffer['base'] = [src_ids, pos_ids, sent_ids, input_mask, task_ids] output_buffer['base'] = {} if self._is_pairwise and self._phase =='train': src_ids = inputs['token_ids_neg'] pos_ids = inputs['position_ids_neg'] sent_ids = inputs['segment_ids_neg'] input_mask = inputs['input_mask_neg'] task_ids = inputs['task_ids_neg'] input_buffer['neg'] = [src_ids, pos_ids, sent_ids, input_mask, task_ids] output_buffer['neg'] = {} for key, (src_ids, pos_ids, sent_ids, input_mask, task_ids) in input_buffer.items(): # padding id in vocabulary must be set to 0 emb_out = fluid.embedding( input=src_ids, size=[self._voc_size, self._emb_size], dtype=self._emb_dtype, param_attr=fluid.ParamAttr( name=scope_name+self._word_emb_name, initializer=self._param_initializer), is_sparse=False) # fluid.global_scope().find_var('backbone-word_embedding').get_tensor() embedding_table = fluid.default_main_program().global_block().var(scope_name+self._word_emb_name) position_emb_out = fluid.embedding( input=pos_ids, size=[self._max_position_seq_len, self._emb_size], dtype=self._emb_dtype, param_attr=fluid.ParamAttr( name=scope_name+self._pos_emb_name, initializer=self._param_initializer)) sent_emb_out = fluid.embedding( sent_ids, size=[self._sent_types, self._emb_size], dtype=self._emb_dtype, param_attr=fluid.ParamAttr( name=scope_name+self._sent_emb_name, initializer=self._param_initializer)) emb_out = emb_out + position_emb_out emb_out = emb_out + sent_emb_out if self._use_task_emb: task_emb_out = fluid.embedding( task_ids, size=[self._task_types, self._emb_size], dtype=self._emb_dtype, param_attr=fluid.ParamAttr( name=scope_name+self._task_emb_name, initializer=self._param_initializer)) emb_out = emb_out + task_emb_out emb_out = pre_process_layer( emb_out, 'nd', self._prepostprocess_dropout, name=scope_name+'pre_encoder') self_attn_mask = fluid.layers.matmul( x=input_mask, y=input_mask, transpose_y=True) self_attn_mask = fluid.layers.scale( x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False) n_head_self_attn_mask = fluid.layers.stack( x=[self_attn_mask] * self._n_head, axis=1) n_head_self_attn_mask.stop_gradient = True enc_out = encoder( enc_input=emb_out, attn_bias=n_head_self_attn_mask, n_layer=self._n_layer, n_head=self._n_head, d_key=self._emb_size // self._n_head, d_value=self._emb_size // self._n_head, d_model=self._emb_size, d_inner_hid=self._emb_size * 4, prepostprocess_dropout=self._prepostprocess_dropout, attention_dropout=self._attention_dropout, relu_dropout=0, hidden_act=self._hidden_act, preprocess_cmd="", postprocess_cmd="dan", param_initializer=self._param_initializer, name=scope_name+'encoder') next_sent_feat = fluid.layers.slice( input=enc_out, axes=[1], starts=[0], ends=[1]) next_sent_feat = fluid.layers.reshape(next_sent_feat, [-1, next_sent_feat.shape[-1]]) next_sent_feat = fluid.layers.fc( input=next_sent_feat, size=self._emb_size, act="tanh", param_attr=fluid.ParamAttr( name=scope_name+"pooled_fc.w_0", initializer=self._param_initializer), bias_attr=scope_name+"pooled_fc.b_0") output_buffer[key]['word_embedding'] = emb_out output_buffer[key]['encoder_outputs'] = enc_out output_buffer[key]['sentence_embedding'] = next_sent_feat output_buffer[key]['sentence_pair_embedding'] = next_sent_feat ret = {} ret['embedding_table'] = embedding_table ret['word_embedding'] = output_buffer['base']['word_embedding'] ret['encoder_outputs'] = output_buffer['base']['encoder_outputs'] ret['sentence_embedding'] = output_buffer['base']['sentence_embedding'] ret['sentence_pair_embedding'] = output_buffer['base']['sentence_pair_embedding'] if self._is_pairwise and self._phase == 'train': ret['word_embedding_neg'] = output_buffer['neg']['word_embedding'] ret['encoder_outputs_neg'] = output_buffer['neg']['encoder_outputs'] ret['sentence_embedding_neg'] = output_buffer['neg']['sentence_embedding'] ret['sentence_pair_embedding_neg'] = output_buffer['neg']['sentence_pair_embedding'] return ret def postprocess(self, rt_outputs): pass class Model(ERNIE): def __init__(self, config, phase): ERNIE.from_config(config, phase=phase) ================================================ FILE: paddlepalm/backbone/utils/__init__.py ================================================ ================================================ FILE: paddlepalm/backbone/utils/transformer.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Transformer encoder.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function from functools import partial import paddle.fluid as fluid import paddle.fluid.layers as layers from paddle.fluid.layer_helper import LayerHelper as LayerHelper from functools import reduce # py3 def layer_norm(x, begin_norm_axis=1, epsilon=1e-6, param_attr=None, bias_attr=None): helper = LayerHelper('layer_norm', **locals()) mean = layers.reduce_mean(x, dim=begin_norm_axis, keep_dim=True) shift_x = layers.elementwise_sub(x=x, y=mean, axis=0) variance = layers.reduce_mean(layers.square(shift_x), dim=begin_norm_axis, keep_dim=True) r_stdev = layers.rsqrt(variance + epsilon) norm_x = layers.elementwise_mul(x=shift_x, y=r_stdev, axis=0) param_shape = [reduce(lambda x, y: x * y, norm_x.shape[begin_norm_axis:])] param_dtype = norm_x.dtype scale = helper.create_parameter( attr=param_attr, shape=param_shape, dtype=param_dtype, default_initializer=fluid.initializer.Constant(1.)) bias = helper.create_parameter( attr=bias_attr, shape=param_shape, dtype=param_dtype, is_bias=True, default_initializer=fluid.initializer.Constant(0.)) out = layers.elementwise_mul(x=norm_x, y=scale, axis=-1) out = layers.elementwise_add(x=out, y=bias, axis=-1) return out def multi_head_attention(queries, keys, values, attn_bias, d_key, d_value, d_model, n_head=1, dropout_rate=0., cache=None, param_initializer=None, name='multi_head_att'): """ Multi-Head Attention. Note that attn_bias is added to the logit before computing softmax activiation to mask certain selected positions so that they will not considered in attention weights. """ keys = queries if keys is None else keys values = keys if values is None else values if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3): raise ValueError( "Inputs: quries, keys and values should all be 3-D tensors.") def __compute_qkv(queries, keys, values, n_head, d_key, d_value): """ Add linear projection to queries, keys, and values. """ q = layers.fc(input=queries, size=d_key * n_head, num_flatten_dims=2, param_attr=fluid.ParamAttr( name=name + '_query_fc.w_0', initializer=param_initializer), bias_attr=name + '_query_fc.b_0') k = layers.fc(input=keys, size=d_key * n_head, num_flatten_dims=2, param_attr=fluid.ParamAttr( name=name + '_key_fc.w_0', initializer=param_initializer), bias_attr=name + '_key_fc.b_0') v = layers.fc(input=values, size=d_value * n_head, num_flatten_dims=2, param_attr=fluid.ParamAttr( name=name + '_value_fc.w_0', initializer=param_initializer), bias_attr=name + '_value_fc.b_0') return q, k, v def __split_heads(x, n_head): """ Reshape the last dimension of inpunt tensor x so that it becomes two dimensions and then transpose. Specifically, input a tensor with shape [bs, max_sequence_length, n_head * hidden_dim] then output a tensor with shape [bs, n_head, max_sequence_length, hidden_dim]. """ hidden_size = x.shape[-1] # The value 0 in shape attr means copying the corresponding dimension # size of the input as the output dimension size. reshaped = layers.reshape( x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True) # permuate the dimensions into: # [batch_size, n_head, max_sequence_len, hidden_size_per_head] return layers.transpose(x=reshaped, perm=[0, 2, 1, 3]) def __combine_heads(x): """ Transpose and then reshape the last two dimensions of inpunt tensor x so that it becomes one dimension, which is reverse to __split_heads. """ if len(x.shape) == 3: return x if len(x.shape) != 4: raise ValueError("Input(x) should be a 4-D Tensor.") trans_x = layers.transpose(x, perm=[0, 2, 1, 3]) # The value 0 in shape attr means copying the corresponding dimension # size of the input as the output dimension size. return layers.reshape( x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True) def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate): """ Scaled Dot-Product Attention """ scaled_q = layers.scale(x=q, scale=d_key**-0.5) product = layers.matmul(x=scaled_q, y=k, transpose_y=True) if attn_bias: product += attn_bias weights = layers.softmax(product) if dropout_rate: weights = layers.dropout( weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False) out = layers.matmul(weights, v) return out q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value) if cache is not None: # use cache and concat time steps # Since the inplace reshape in __split_heads changes the shape of k and # v, which is the cache input for next time step, reshape the cache # input from the previous time step first. k = cache["k"] = layers.concat( [layers.reshape( cache["k"], shape=[0, 0, d_model]), k], axis=1) v = cache["v"] = layers.concat( [layers.reshape( cache["v"], shape=[0, 0, d_model]), v], axis=1) q = __split_heads(q, n_head) k = __split_heads(k, n_head) v = __split_heads(v, n_head) ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate) out = __combine_heads(ctx_multiheads) # Project back to the model size. proj_out = layers.fc(input=out, size=d_model, num_flatten_dims=2, param_attr=fluid.ParamAttr( name=name + '_output_fc.w_0', initializer=param_initializer), bias_attr=name + '_output_fc.b_0') return proj_out def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'): """ Position-wise Feed-Forward Networks. This module consists of two linear transformations with a ReLU activation in between, which is applied to each position separately and identically. """ hidden = layers.fc(input=x, size=d_inner_hid, num_flatten_dims=2, act=hidden_act, param_attr=fluid.ParamAttr( name=name + '_fc_0.w_0', initializer=param_initializer), bias_attr=name + '_fc_0.b_0') if dropout_rate: hidden = layers.dropout( hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False) out = layers.fc(input=hidden, size=d_hid, num_flatten_dims=2, param_attr=fluid.ParamAttr( name=name + '_fc_1.w_0', initializer=param_initializer), bias_attr=name + '_fc_1.b_0') return out def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''): """ Add residual connection, layer normalization and droput to the out tensor optionally according to the value of process_cmd. This will be used before or after multi-head attention and position-wise feed-forward networks. """ for cmd in process_cmd: if cmd == "a": # add residual connection out = out + prev_out if prev_out else out elif cmd == "n": # add layer normalization out_dtype = out.dtype if out_dtype == fluid.core.VarDesc.VarType.FP16: out = layers.cast(x=out, dtype="float32") out = layer_norm( out, begin_norm_axis=len(out.shape) - 1, param_attr=fluid.ParamAttr( name=name + '_layer_norm_scale', initializer=fluid.initializer.Constant(1.)), bias_attr=fluid.ParamAttr( name=name + '_layer_norm_bias', initializer=fluid.initializer.Constant(0.))) if out_dtype == fluid.core.VarDesc.VarType.FP16: out = layers.cast(x=out, dtype="float16") elif cmd == "d": # add dropout if dropout_rate: out = layers.dropout( out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False) return out pre_process_layer = partial(pre_post_process_layer, None) post_process_layer = pre_post_process_layer def encoder_layer(enc_input, attn_bias, n_head, d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout, attention_dropout, relu_dropout, hidden_act, preprocess_cmd="n", postprocess_cmd="da", param_initializer=None, name=''): """The encoder layers that can be stacked to form a deep encoder. This module consits of a multi-head (self) attention followed by position-wise feed-forward networks and both the two components companied with the post_process_layer to add residual connection, layer normalization and droput. """ attn_output = multi_head_attention( pre_process_layer( enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'), None, None, attn_bias, d_key, d_value, d_model, n_head, attention_dropout, param_initializer=param_initializer, name=name + '_multi_head_att') attn_output = post_process_layer( enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att') ffd_output = positionwise_feed_forward( pre_process_layer( attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'), d_inner_hid, d_model, relu_dropout, hidden_act, param_initializer=param_initializer, name=name + '_ffn') return post_process_layer( attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn') def encoder(enc_input, attn_bias, n_layer, n_head, d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout, attention_dropout, relu_dropout, hidden_act, preprocess_cmd="n", postprocess_cmd="da", param_initializer=None, name=''): """ The encoder is composed of a stack of identical layers returned by calling encoder_layer. """ for i in range(n_layer): enc_output = encoder_layer( enc_input, attn_bias, n_head, d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout, attention_dropout, relu_dropout, hidden_act, preprocess_cmd, postprocess_cmd, param_initializer=param_initializer, name=name + '_layer_' + str(i)) enc_input = enc_output enc_output = pre_process_layer( enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder") return enc_output ================================================ FILE: paddlepalm/distribute/__init__.py ================================================ from paddle import fluid import os import multiprocessing gpu_dev_count = int(fluid.core.get_cuda_device_count()) cpu_dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count())) from .reader import yield_pieces, data_feeder, decode_fake ================================================ FILE: paddlepalm/distribute/reader.py ================================================ from . import gpu_dev_count, cpu_dev_count try: import queue as Queue except ImportError: import Queue from threading import Thread dev_count = gpu_dev_count if gpu_dev_count > 0 else cpu_dev_count def yield_pieces(data, distribute_strategy, batch_size): """ Args: distribute_strategy: support s=split, c=copy, u=unstack, """ assert batch_size % dev_count == 0, "batch_size need to be integer times larger than dev_count." # print('data in yield pieces') # print(len(data)) assert type(data) == type(distribute_strategy), [type(data), type(distribute_strategy)] assert len(data) == len(distribute_strategy), [len(data), len(distribute_strategy)] if isinstance(data, dict): keys = list(data.keys()) data_list = [data[i] for i in keys] ds_list = [distribute_strategy[i] for i in keys] else: assert isinstance(data, list), "the input data must be a list or dict, and contained with multiple tensors." data_list = data ds_list = distribute_strategy stride = batch_size // dev_count p = stride # while p < len(data_list) + stride: while p <= batch_size: temp = [] for d, s in zip(data_list, ds_list): s = s.strip().lower() if s == 's' or s == 'split': if p - stride >= len(d): # print('WARNING: no more examples to feed empty devices') temp = [] return temp.append(d[p-stride:p]) elif s == 'u' or s == 'unstack': assert len(d) <= dev_count, 'Tensor size on dim 0 must be less equal to dev_count when unstack is applied.' if p//stride > len(d): # print('WARNING: no more examples to feed empty devices') return temp.append(d[p//stride-1]) elif s == 'c' or s == 'copy': temp.append(d) else: raise NotImplementedError() p += stride if type(data) == dict: yield dict(zip(*[keys, temp])) else: # print('yielded pieces') # print(len(temp)) yield temp def data_feeder(reader, postprocess_fn=None, prefetch_steps=2, phase='train', is_multi=False): if postprocess_fn is None: def postprocess_fn(batch, id=-1, phase='train', is_multi=False): return batch def worker(reader, dev_count, queue): dev_batches = [] for index, data in enumerate(reader()): if len(dev_batches) < dev_count: dev_batches.append(data) if len(dev_batches) == dev_count: queue.put((dev_batches, 0)) dev_batches = [] # For the prediction of the remained batches, pad more batches to # the number of devices and the padded samples would be removed in # prediction outputs. if len(dev_batches) > 0: num_pad = dev_count - len(dev_batches) for i in range(len(dev_batches), dev_count): dev_batches.append(dev_batches[-1]) queue.put((dev_batches, num_pad)) queue.put(None) queue = Queue.Queue(dev_count*prefetch_steps) p = Thread( target=worker, args=(reader, dev_count, queue)) p.daemon = True p.start() while True: ret = queue.get() queue.task_done() if ret is not None: batches, num_pad = ret if dev_count > 1 and phase == 'train' and is_multi: id = batches[0]['__task_id'][0] else: id = -1 batch_buf = [] flag_buf = [] for idx, batch in enumerate(batches): # flag = num_pad == 0 flag = idx-len(batches) < -num_pad # if num_pad > 0: # num_pad -= 1 batch = postprocess_fn(batch, id, phase, is_multi=is_multi) # batch = postprocess_fn(batch) batch_buf.append(batch) flag_buf.append(flag) yield batch_buf, flag_buf else: break queue.join() def decode_fake(nums, mask, bs): bs //= dev_count n_t = 0 for flag in mask: if not flag: break n_t = n_t + 1 n_f = len(mask) - n_t p1 = nums - (n_t-1) * bs assert p1 % (n_f+1) == 0 each_f = p1 // (n_f+1) return each_f * n_f ================================================ FILE: paddlepalm/downloader.py ================================================ from ._downloader import * ================================================ FILE: paddlepalm/head/__init__.py ================================================ from .cls import Classify from .match import Match from .ner import SequenceLabel from .mrc import MRC from .mlm import MaskLM ================================================ FILE: paddlepalm/head/base_head.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import os import json import copy class Head(object): def __init__(self, phase='train'): """该函数完成一个任务头的构造,至少需要包含一个phase参数。 注意:实现该构造函数时,必须保证对基类构造函数的调用,以创建必要的框架内建的成员变量。 Args: phase: str类型。用于区分任务头被调用时所处的任务运行阶段,目前支持训练阶段train和预测阶段predict """ self._stop_gradient = {} self._phase = phase self._prog = None self._results_buffer = [] @property def inputs_attrs(self): """step级别的任务输入对象声明。 描述该任务头所依赖的reader、backbone和来自其他任务头的输出对象(每个step获取一次)。使用字典进行描述, 字典的key为输出对象所在的组件(如’reader‘,’backbone‘等),value为该组件下任务头所需要的输出对象集。 输出对象集使用字典描述,key为输出对象的名字(该名字需保证在相关组件的输出对象集中),value为该输出对象 的shape和dtype。当某个输出对象的某个维度长度可变时,shape中的相应维度设置为-1。 Return: dict类型。描述该任务头所依赖的step级输入,即来自各个组件的输出对象。""" raise NotImplementedError() @property def outputs_attr(self): """step级别的任务输出对象声明。 描述该任务头的输出对象(每个step输出一次),包括每个输出对象的名字,shape和dtype。输出对象会被加入到 fetch_list中,从而在每个训练/推理step时得到实时的计算结果,该计算结果可以传入batch_postprocess方 法中进行当前step的后处理。当某个对象为标量数据类型(如str, int, float等)时,shape设置为空列表[], 当某个对象的某个维度长度可变时,shape中的相应维度设置为-1。 Return: dict类型。描述该任务头所产生的输出对象。注意,在训练阶段时必须包含名为loss的输出对象。 """ raise NotImplementedError() @property def epoch_inputs_attrs(self): """epoch级别的任务输入对象声明。 描述该任务所依赖的来自reader、backbone和来自其他任务头的输出对象(每个epoch结束后产生一次),如完整的 样本集,有效的样本数等。使用字典进行描述,字典的key为输出对象所在的组件(如’reader‘,’backbone‘等), value为该组件下任务头所需要的输出对象集。输出对象集使用字典描述,key为输出对象的名字(该名字需保证在相关 组件的输出对象集中),value为该输出对象的shape和dtype。当某个输出对象的某个维度长度可变时,shape中的相 应维度设置为-1。 Return: dict类型。描述该任务头所产生的输出对象。注意,在训练阶段时必须包含名为loss的输出对象。 """ return {} def build(self, inputs, scope_name=""): """建立任务头的计算图。 将符合inputs_attrs描述的来自各个对象集的静态图Variables映射成符合outputs_attr描述的静态图Variable输出。 Args: inputs: dict类型。字典中包含inputs_attrs中的对象名到计算图Variable的映射,inputs中至少会包含inputs_attr中定义的对象 Return: 需要输出的计算图变量,输出对象会被加入到fetch_list中,从而在每个训练/推理step时得到runtime的计算结果,该计算结果会被传入postprocess方法中供用户处理。 """ raise NotImplementedError() def batch_postprocess(self, rt_outputs): """batch/step级别的后处理。 每个训练或推理step后针对当前batch的任务头输出对象的实时计算结果来进行相关后处理。 默认将输出结果存储到缓冲区self._results_buffer中。""" if isinstance(rt_outputs, dict): keys = rt_outputs.keys() vals = [rt_outputs[k] for k in keys] lens = [len(v) for v in vals] if len(set(lens)) == 1: results = [dict(zip(*[keys, i])) for i in zip(*vals)] self._results_buffer.extend(results) return results else: print('WARNING: irregular output results. visualize failed.') self._results_buffer.append(rt_outputs) return None def reset(self): """清空该任务头的缓冲区(在训练或推理过程中积累的处理结果)""" self._results_buffer = [] def get_results(self): """返回当前任务头积累的处理结果。""" return copy.deepcopy(self._results_buffer) def epoch_postprocess(self, post_inputs=None, output_dir=None): """epoch级别的后处理。 每个训练或推理epoch结束后,对积累的各样本的后处理结果results进行后处理。默认情况下,当output_dir为None时,直接将results打印到 屏幕上。当指定output_dir时,将results存储在指定的文件夹内,并以任务头所处阶段来作为存储文件的文件名。 Args: post_inputs: 当声明的epoch_inputs_attr不为空时,该参数会携带对应的输入变量的内容。 output_dir: 积累结果的保存路径。 """ if output_dir is not None: if not os.path.exists(output_dir): os.makedirs(output_dir) with open(os.path.join(output_dir, self._phase), 'w') as writer: for i in self._results_buffer: writer.write(json.dumps(i)+'\n') else: return self._results_buffer ================================================ FILE: paddlepalm/head/cls.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import paddle.fluid as fluid from paddle.fluid import layers from paddlepalm.head.base_head import Head import numpy as np import os import json class Classify(Head): """ classification """ def __init__(self, num_classes, input_dim, dropout_prob=0.0, \ param_initializer_range=0.02, phase='train'): self._is_training = phase == 'train' self._hidden_size = input_dim self.num_classes = num_classes self._dropout_prob = dropout_prob if phase == 'train' else 0.0 self._param_initializer = fluid.initializer.TruncatedNormal( scale=param_initializer_range) self._preds = [] self._probs = [] @property def inputs_attrs(self): reader = {} bb = {"sentence_embedding": [[-1, self._hidden_size], 'float32']} if self._is_training: reader["label_ids"] = [[-1], 'int64'] return {'reader': reader, 'backbone': bb} @property def outputs_attrs(self): if self._is_training: return {'loss': [[1], 'float32']} else: return {'logits': [[-1, self.num_classes], 'float32'], 'probs': [[-1, self.num_classes], 'float32']} def build(self, inputs, scope_name=''): sent_emb = inputs['backbone']['sentence_embedding'] if self._is_training: label_ids = inputs['reader']['label_ids'] cls_feats = fluid.layers.dropout( x=sent_emb, dropout_prob=self._dropout_prob, dropout_implementation="upscale_in_train") logits = fluid.layers.fc( input=sent_emb, size=self.num_classes, param_attr=fluid.ParamAttr( name=scope_name+"cls_out_w", initializer=self._param_initializer), bias_attr=fluid.ParamAttr( name=scope_name+"cls_out_b", initializer=fluid.initializer.Constant(0.))) probs = fluid.layers.softmax(logits) if self._is_training: loss = fluid.layers.cross_entropy( input=probs, label=label_ids) loss = layers.mean(loss) return {"loss": loss} else: return {"logits":logits, "probs":probs} def batch_postprocess(self, rt_outputs): if not self._is_training: logits = rt_outputs['logits'] probs = rt_outputs['probs'] self._preds.extend(logits.tolist()) self._probs.extend(probs.tolist()) def epoch_postprocess(self, post_inputs, output_dir=None): # there is no post_inputs needed and not declared in epoch_inputs_attrs, hence no elements exist in post_inputs if not self._is_training: results = [] for i in range(len(self._preds)): label = int(np.argmax(np.array(self._preds[i]))) result = {'index': i, 'label': label, 'logits': self._preds[i], 'probs': self._probs[i]} results.append(result) if output_dir is not None: with open(os.path.join(output_dir, 'predictions.json'), 'w') as writer: for result in results: result = json.dumps(result) writer.write(result+'\n') print('Predictions saved at '+os.path.join(output_dir, 'predictions.json')) return results ================================================ FILE: paddlepalm/head/match.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import paddle.fluid as fluid from paddle.fluid import layers from paddlepalm.head.base_head import Head import numpy as np import os import json def computeHingeLoss(pos, neg, margin): loss_part1 = fluid.layers.elementwise_sub( fluid.layers.fill_constant_batch_size_like( input=pos, shape=[-1, 1], value=margin, dtype='float32'), pos) loss_part2 = fluid.layers.elementwise_add(loss_part1, neg) loss_part3 = fluid.layers.elementwise_max( fluid.layers.fill_constant_batch_size_like( input=loss_part2, shape=[-1, 1], value=0.0, dtype='float32'), loss_part2) return loss_part3 class Match(Head): ''' matching ''' def __init__(self, num_classes, input_dim, dropout_prob=0.0, param_initializer_range=0.02, \ learning_strategy='pointwise', margin=0.5, phase='train'): """ Args: phase: train, eval, pred lang: en, ch, ... learning_strategy: pointwise, pairwise """ self._is_training = phase == 'train' self._hidden_size = input_dim self._num_classes = num_classes self._dropout_prob = dropout_prob if phase == 'train' else 0.0 self._param_initializer = fluid.initializer.TruncatedNormal( scale=param_initializer_range) self._learning_strategy = learning_strategy self._margin = margin self._preds = [] self._preds_logits = [] @property def inputs_attrs(self): reader = {} bb = {"sentence_pair_embedding": [[-1, self._hidden_size], 'float32']} if self._is_training: if self._learning_strategy == 'pointwise': reader["label_ids"] = [[-1], 'int64'] elif self._learning_strategy == 'pairwise': bb["sentence_pair_embedding_neg"] = [[-1, self._hidden_size], 'float32'] return {'reader': reader, 'backbone': bb} @property def outputs_attrs(self): if self._is_training: return {"loss": [[1], 'float32']} else: if self._learning_strategy=='paiwise': return {"probs": [[-1, 1], 'float32']} else: return {"logits": [[-1, self._num_classes], 'float32'], "probs": [[-1, self._num_classes], 'float32']} def build(self, inputs, scope_name=""): # inputs cls_feats = inputs["backbone"]["sentence_pair_embedding"] if self._is_training: cls_feats = fluid.layers.dropout( x=cls_feats, dropout_prob=self._dropout_prob, dropout_implementation="upscale_in_train") if self._learning_strategy == 'pairwise': cls_feats_neg = inputs["backbone"]["sentence_pair_embedding_neg"] cls_feats_neg = fluid.layers.dropout( x=cls_feats_neg, dropout_prob=self._dropout_prob, dropout_implementation="upscale_in_train") elif self._learning_strategy == 'pointwise': labels = inputs["reader"]["label_ids"] # loss # for pointwise if self._learning_strategy == 'pointwise': logits = fluid.layers.fc( input=cls_feats, size=self._num_classes, param_attr=fluid.ParamAttr( name=scope_name+"cls_out_w", initializer=self._param_initializer), bias_attr=fluid.ParamAttr( name=scope_name+"cls_out_b", initializer=fluid.initializer.Constant(0.))) probs = fluid.layers.softmax(logits) if self._is_training: ce_loss = fluid.layers.cross_entropy( input=probs, label=labels) loss = fluid.layers.mean(x=ce_loss) return {'loss': loss} # for pred else: return {'logits': logits, 'probs': probs} # for pairwise elif self._learning_strategy == 'pairwise': pos_score = fluid.layers.fc( input=cls_feats, size=1, act = "sigmoid", param_attr=fluid.ParamAttr( name=scope_name+"cls_out_w_pr", initializer=self._param_initializer), bias_attr=fluid.ParamAttr( name=scope_name+"cls_out_b_pr", initializer=fluid.initializer.Constant(0.))) pos_score = fluid.layers.reshape(x=pos_score, shape=[-1, 1], inplace=True) if self._is_training: neg_score = fluid.layers.fc( input=cls_feats_neg, size=1, act = "sigmoid", param_attr=fluid.ParamAttr( name=scope_name+"cls_out_w_pr", initializer=self._param_initializer), bias_attr=fluid.ParamAttr( name=scope_name+"cls_out_b_pr", initializer=fluid.initializer.Constant(0.))) neg_score = fluid.layers.reshape(x=neg_score, shape=[-1, 1], inplace=True) loss = fluid.layers.mean(computeHingeLoss(pos_score, neg_score, self._margin)) return {'loss': loss} # for pred else: return {'probs': pos_score} def batch_postprocess(self, rt_outputs): if not self._is_training: probs = [] logits = [] probs = rt_outputs['probs'] self._preds.extend(probs.tolist()) if self._learning_strategy == 'pointwise': logits = rt_outputs['logits'] self._preds_logits.extend(logits.tolist()) def reset(self): self._preds_logits = [] self._preds = [] def epoch_postprocess(self, post_inputs, output_dir=None): # there is no post_inputs needed and not declared in epoch_inputs_attrs, hence no elements exist in post_inputs if not self._is_training: results = [] for i in range(len(self._preds)): if self._learning_strategy == 'pointwise': label = int(np.argmax(np.array(self._preds[i]))) result = {'index': i, 'label': label, 'logits': self._preds_logits[i], 'probs': self._preds[i]} elif self._learning_strategy == 'pairwise': result = {'index': i, 'probs': self._preds[i][0]} results.append(result) if output_dir is not None: with open(os.path.join(output_dir, 'predictions.json'), 'w') as writer: for result in results: result = json.dumps(result, ensure_ascii=False) writer.write(result+'\n') print('Predictions saved at '+os.path.join(output_dir, 'predictions.json')) return results ================================================ FILE: paddlepalm/head/mlm.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import paddle.fluid as fluid from paddlepalm.head.base_head import Head from paddle.fluid import layers import numpy as np import os from paddlepalm.backbone.utils.transformer import pre_process_layer class MaskLM(Head): ''' mlm ''' def __init__(self, input_dim, vocab_size, hidden_act, dropout_prob=0.0, \ param_initializer_range=0.02, phase='train'): self._is_training = phase == 'train' self._emb_size = input_dim self._hidden_size = input_dim self._dropout_prob = dropout_prob if phase == 'train' else 0.0 self._preds = [] self._vocab_size = vocab_size self._hidden_act = hidden_act self._initializer_range = param_initializer_range @property def inputs_attrs(self): reader = { "mask_label": [[-1], 'int64'], "mask_pos": [[-1], 'int64'], } if not self._is_training: del reader['mask_label'] bb = { "encoder_outputs": [[-1, -1, self._hidden_size], 'float32'], "embedding_table": [[-1, self._vocab_size, self._emb_size], 'float32']} return {'reader': reader, 'backbone': bb} @property def outputs_attrs(self): if self._is_training: return {"loss": [[1], 'float32']} else: return {"logits": [[-1], 'float32']} def build(self, inputs, scope_name=""): mask_pos = inputs["reader"]["mask_pos"] word_emb = inputs["backbone"]["embedding_table"] enc_out = inputs["backbone"]["encoder_outputs"] if self._is_training: mask_label = inputs["reader"]["mask_label"] l1 = enc_out.shape[0] l2 = enc_out.shape[1] bxs = fluid.layers.fill_constant(shape=[1], value=l1*l2, dtype='int64') max_position = bxs - 1 mask_pos = fluid.layers.elementwise_min(mask_pos, max_position) mask_pos.stop_gradient = True emb_size = word_emb.shape[-1] _param_initializer = fluid.initializer.TruncatedNormal( scale=self._initializer_range) reshaped_emb_out = fluid.layers.reshape( x=enc_out, shape=[-1, emb_size]) # extract masked tokens' feature mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos) # transform: fc mask_trans_feat = fluid.layers.fc( input=mask_feat, size=emb_size, act=self._hidden_act, param_attr=fluid.ParamAttr( name=scope_name+'mask_lm_trans_fc.w_0', initializer=_param_initializer), bias_attr=fluid.ParamAttr(name=scope_name+'mask_lm_trans_fc.b_0')) # transform: layer norm mask_trans_feat = pre_process_layer( mask_trans_feat, 'n', name=scope_name+'mask_lm_trans') mask_lm_out_bias_attr = fluid.ParamAttr( name=scope_name+"mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0)) fc_out = fluid.layers.matmul( x=mask_trans_feat, y=word_emb, transpose_y=True) fc_out += fluid.layers.create_parameter( shape=[self._vocab_size], dtype='float32', attr=mask_lm_out_bias_attr, is_bias=True) if self._is_training: inputs = fluid.layers.softmax(fc_out) mask_lm_loss = fluid.layers.cross_entropy( input=inputs, label=mask_label) loss = fluid.layers.mean(mask_lm_loss) return {'loss': loss} else: return {'logits': fc_out} def batch_postprocess(self, rt_outputs): if not self._is_training: logits = rt_outputs['logits'] preds = np.argmax(logits, -1) self._preds.extend(preds.tolist()) return preds def epoch_postprocess(self, post_inputs, output_dir=None): # there is no post_inputs needed and not declared in epoch_inputs_attrs, hence no elements exist in post_inputs if not self._is_training: results = [] for i in range(len(self._preds)): result = {'index': i, 'word_id': self._preds[i]} results.append(result) if output_dir is not None: with open(os.path.join(output_dir, 'predictions.json'), 'w') as writer: for result in results: result = json.dumps(result) writer.write(result+'\n') print('Predictions saved at '+os.path.join(output_dir, 'predictions.json')) return results ================================================ FILE: paddlepalm/head/mrc.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import paddle.fluid as fluid from paddlepalm.head.base_head import Head import collections import numpy as np import os import math import six import paddlepalm.tokenizer.ernie_tokenizer as tokenization import json import io RawResult = collections.namedtuple("RawResult", ["unique_id", "start_logits", "end_logits"]) class MRC(Head): """ Machine Reading Comprehension """ def __init__(self, max_query_len, input_dim, pred_output_path=None, verbose=False, with_negative=False, do_lower_case=False, max_ans_len=None, null_score_diff_threshold=0.0, n_best_size=20, phase='train'): self._is_training = phase == 'train' self._hidden_size = input_dim self._max_sequence_length = max_query_len self._pred_results = [] output_dir = pred_output_path self._max_answer_length = max_ans_len self._null_score_diff_threshold = null_score_diff_threshold self._n_best_size = n_best_size output_dir = pred_output_path self._verbose = verbose self._with_negative = with_negative self._do_lower_case = do_lower_case @property def inputs_attrs(self): if self._is_training: reader = {"start_positions": [[-1], 'int64'], "end_positions": [[-1], 'int64'], } else: reader = {'unique_ids': [[-1], 'int64']} bb = {"encoder_outputs": [[-1, -1, self._hidden_size], 'float32']} return {'reader': reader, 'backbone': bb} @property def epoch_inputs_attrs(self): if not self._is_training: from_reader = {'examples': None, 'features': None} return {'reader': from_reader} @property def outputs_attr(self): if self._is_training: return {'loss': [[1], 'float32']} else: return {'start_logits': [[-1, -1, 1], 'float32'], 'end_logits': [[-1, -1, 1], 'float32'], 'unique_ids': [[-1], 'int64']} def build(self, inputs, scope_name=""): if self._is_training: start_positions = inputs['reader']['start_positions'] end_positions = inputs['reader']['end_positions'] # max_position = inputs["reader"]["seqlen"] - 1 # start_positions = fluid.layers.elementwise_min(start_positions, max_position) # end_positions = fluid.layers.elementwise_min(end_positions, max_position) start_positions.stop_gradient = True end_positions.stop_gradient = True else: unique_id = inputs['reader']['unique_ids'] # It's used to help fetch variable 'unique_ids' that will be removed in the future helper_constant = fluid.layers.fill_constant(shape=[1], value=1, dtype='int64') fluid.layers.elementwise_mul(unique_id, helper_constant) enc_out = inputs['backbone']['encoder_outputs'] logits = fluid.layers.fc( input=enc_out, size=2, num_flatten_dims=2, param_attr=fluid.ParamAttr( name=scope_name+"cls_squad_out_w", initializer=fluid.initializer.TruncatedNormal(scale=0.02)), bias_attr=fluid.ParamAttr( name=scope_name+"cls_squad_out_b", initializer=fluid.initializer.Constant(0.))) logits = fluid.layers.transpose(x=logits, perm=[2, 0, 1]) start_logits, end_logits = fluid.layers.unstack(x=logits, axis=0) def _compute_single_loss(logits, positions): """Compute start/en d loss for mrc model""" inputs = fluid.layers.softmax(logits) loss = fluid.layers.cross_entropy( input=inputs, label=positions) loss = fluid.layers.mean(x=loss) return loss if self._is_training: start_loss = _compute_single_loss(start_logits, start_positions) end_loss = _compute_single_loss(end_logits, end_positions) total_loss = (start_loss + end_loss) / 2.0 return {'loss': total_loss} else: return {'start_logits': start_logits, 'end_logits': end_logits, 'unique_ids': unique_id} def batch_postprocess(self, rt_outputs): """this func will be called after each step(batch) of training/evaluating/predicting process.""" if not self._is_training: unique_ids = rt_outputs['unique_ids'] start_logits = rt_outputs['start_logits'] end_logits = rt_outputs['end_logits'] for idx in range(len(unique_ids)): if unique_ids[idx] < 0: continue if len(self._pred_results) % 1000 == 0: print("Predicting example: {}".format(len(self._pred_results))) uid = int(unique_ids[idx]) s = [float(x) for x in start_logits[idx].flat] e = [float(x) for x in end_logits[idx].flat] self._pred_results.append( RawResult( unique_id=uid, start_logits=s, end_logits=e)) def epoch_postprocess(self, post_inputs, output_dir=None): """(optional interface) this func will be called after evaluation/predicting process and each epoch during training process.""" if not self._is_training: if output_dir is not None: examples = post_inputs['reader']['examples'] features = post_inputs['reader']['features'] if not os.path.exists(output_dir): os.makedirs(output_dir) output_prediction_file = os.path.join(output_dir, "predictions.json") output_nbest_file = os.path.join(output_dir, "nbest_predictions.json") output_null_log_odds_file = os.path.join(output_dir, "null_odds.json") _write_predictions(examples, features, self._pred_results, self._n_best_size, self._max_answer_length, self._do_lower_case, output_prediction_file, output_nbest_file, output_null_log_odds_file, self._with_negative, self._null_score_diff_threshold, self._verbose) return self._pred_results def _write_predictions(all_examples, all_features, all_results, n_best_size, max_answer_length, do_lower_case, output_prediction_file, output_nbest_file, output_null_log_odds_file, with_negative, null_score_diff_threshold, verbose): """Write final predictions to the json file and log-odds of null if needed.""" print("Writing predictions to: %s" % (output_prediction_file)) print("Writing nbest to: %s" % (output_nbest_file)) example_index_to_features = collections.defaultdict(list) for feature in all_features: example_index_to_features[feature.example_index].append(feature) unique_id_to_result = {} for result in all_results: unique_id_to_result[result.unique_id] = result _PrelimPrediction = collections.namedtuple( # pylint: disable=invalid-name "PrelimPrediction", [ "feature_index", "start_index", "end_index", "start_logit", "end_logit" ]) all_predictions = collections.OrderedDict() all_nbest_json = collections.OrderedDict() scores_diff_json = collections.OrderedDict() for (example_index, example) in enumerate(all_examples): features = example_index_to_features[example_index] prelim_predictions = [] # keep track of the minimum score of null start+end of position 0 score_null = 1000000 # large and positive min_null_feature_index = 0 # the paragraph slice with min mull score ull_start_logit = 0 # the start logit at the slice with min null score null_end_logit = 0 # the end logit at the slice with min null score for (feature_index, feature) in enumerate(features): result = unique_id_to_result[feature.unique_id] start_indexes = _get_best_indexes(result.start_logits, n_best_size) end_indexes = _get_best_indexes(result.end_logits, n_best_size) # if we could have irrelevant answers, get the min score of irrelevant if with_negative: feature_null_score = result.start_logits[0] + result.end_logits[ 0] if feature_null_score < score_null: score_null = feature_null_score min_null_feature_index = feature_index null_start_logit = result.start_logits[0] null_end_logit = result.end_logits[0] for start_index in start_indexes: for end_index in end_indexes: # We could hypothetically create invalid predictions, e.g., predict # that the start of the span is in the question. We throw out all # invalid predictions. if start_index >= len(feature.tokens): continue if end_index >= len(feature.tokens): continue if start_index not in feature.token_to_orig_map: continue if end_index not in feature.token_to_orig_map: continue if not feature.token_is_max_context.get(start_index, False): continue if end_index < start_index: continue length = end_index - start_index + 1 if length > max_answer_length: continue prelim_predictions.append( _PrelimPrediction( feature_index=feature_index, start_index=start_index, end_index=end_index, start_logit=result.start_logits[start_index], end_logit=result.end_logits[end_index])) if with_negative: prelim_predictions.append( _PrelimPrediction( feature_index=min_null_feature_index, start_index=0, end_index=0, start_logit=null_start_logit, end_logit=null_end_logit)) prelim_predictions = sorted( prelim_predictions, key=lambda x: (x.start_logit + x.end_logit), reverse=True) _NbestPrediction = collections.namedtuple( # pylint: disable=invalid-name "NbestPrediction", ["text", "start_logit", "end_logit"]) seen_predictions = {} nbest = [] for pred in prelim_predictions: if len(nbest) >= n_best_size: break feature = features[pred.feature_index] if pred.start_index > 0: # this is a non-null prediction tok_tokens = feature.tokens[pred.start_index:(pred.end_index + 1 )] orig_doc_start = feature.token_to_orig_map[pred.start_index] orig_doc_end = feature.token_to_orig_map[pred.end_index] orig_tokens = example.doc_tokens[orig_doc_start:(orig_doc_end + 1)] tok_text = " ".join(tok_tokens) # De-tokenize WordPieces that have been split off. tok_text = tok_text.replace(" ##", "") tok_text = tok_text.replace("##", "") # Clean whitespace tok_text = tok_text.strip() tok_text = " ".join(tok_text.split()) orig_text = " ".join(orig_tokens) final_text = _get_final_text(tok_text, orig_text, do_lower_case, verbose) if final_text in seen_predictions: continue seen_predictions[final_text] = True else: final_text = "" seen_predictions[final_text] = True nbest.append( _NbestPrediction( text=final_text, start_logit=pred.start_logit, end_logit=pred.end_logit)) # if we didn't inlude the empty option in the n-best, inlcude it if with_negative: if "" not in seen_predictions: nbest.append( _NbestPrediction( text="", start_logit=null_start_logit, end_logit=null_end_logit)) # In very rare edge cases we could have no valid predictions. So we # just create a nonce prediction in this case to avoid failure. if not nbest: nbest.append( _NbestPrediction( text="empty", start_logit=0.0, end_logit=0.0)) assert len(nbest) >= 1 total_scores = [] best_non_null_entry = None for entry in nbest: total_scores.append(entry.start_logit + entry.end_logit) if not best_non_null_entry: if entry.text: best_non_null_entry = entry # debug if best_non_null_entry is None: print("Emmm..., sth wrong") probs = _compute_softmax(total_scores) nbest_json = [] for (i, entry) in enumerate(nbest): output = collections.OrderedDict() output["text"] = entry.text.encode('utf-8').decode('utf-8') output["probability"] = probs[i] output["start_logit"] = entry.start_logit output["end_logit"] = entry.end_logit nbest_json.append(output) assert len(nbest_json) >= 1 if not with_negative: all_predictions[example.qas_id] = nbest_json[0]["text"] else: # predict "" iff the null score - the score of best non-null > threshold score_diff = score_null - best_non_null_entry.start_logit - ( best_non_null_entry.end_logit) scores_diff_json[example.qas_id] = score_diff if score_diff > null_score_diff_threshold: all_predictions[example.qas_id] = "" else: all_predictions[example.qas_id] = best_non_null_entry.text all_nbest_json[example.qas_id] = nbest_json with io.open(output_prediction_file, "w", encoding='utf-8') as writer: writer.write(json.dumps(all_predictions, indent=4, ensure_ascii=False) + "\n") with io.open(output_nbest_file, "w", encoding='utf-8') as writer: writer.write(json.dumps(all_nbest_json, indent=4, ensure_ascii=False) + "\n") if with_negative: with io.open(output_null_log_odds_file, "w", encoding='utf-8') as writer: writer.write(json.dumps(scores_diff_json, indent=4, ensure_ascii=False) + "\n") def _get_final_text(pred_text, orig_text, do_lower_case, verbose): """Project the tokenized prediction back to the original text.""" # When we created the data, we kept track of the alignment between original # (whitespace tokenized) tokens and our WordPiece tokenized tokens. So # now `orig_text` contains the span of our original text corresponding to the # span that we predicted. # # However, `orig_text` may contain extra characters that we don't want in # our prediction. # # For example, let's say: # pred_text = steve smith # orig_text = Steve Smith's # # We don't want to return `orig_text` because it contains the extra "'s". # # We don't want to return `pred_text` because it's already been normalized # (the MRQA eval script also does punctuation stripping/lower casing but # our tokenizer does additional normalization like stripping accent # characters). # # What we really want to return is "Steve Smith". # # Therefore, we have to apply a semi-complicated alignment heruistic between # `pred_text` and `orig_text` to get a character-to-charcter alignment. This # can fail in certain cases in which case we just return `orig_text`. def _strip_spaces(text): ns_chars = [] ns_to_s_map = collections.OrderedDict() for (i, c) in enumerate(text): if c == " ": continue ns_to_s_map[len(ns_chars)] = i ns_chars.append(c) ns_text = "".join(ns_chars) return (ns_text, ns_to_s_map) # We first tokenize `orig_text`, strip whitespace from the result # and `pred_text`, and check if they are the same length. If they are # NOT the same length, the heuristic has failed. If they are the same # length, we assume the characters are one-to-one aligned. tokenizer = tokenization.BasicTokenizer(do_lower_case=do_lower_case) tok_text = " ".join(tokenizer.tokenize(orig_text)) start_position = tok_text.find(pred_text) if start_position == -1: if verbose: print("Unable to find text: '%s' in '%s'" % (pred_text, orig_text)) return orig_text end_position = start_position + len(pred_text) - 1 (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text) (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text) if len(orig_ns_text) != len(tok_ns_text): if verbose: print("Length not equal after stripping spaces: '%s' vs '%s'", orig_ns_text, tok_ns_text) return orig_text # We then project the characters in `pred_text` back to `orig_text` using # the character-to-character alignment. tok_s_to_ns_map = {} for (i, tok_index) in six.iteritems(tok_ns_to_s_map): tok_s_to_ns_map[tok_index] = i orig_start_position = None if start_position in tok_s_to_ns_map: ns_start_position = tok_s_to_ns_map[start_position] if ns_start_position in orig_ns_to_s_map: orig_start_position = orig_ns_to_s_map[ns_start_position] if orig_start_position is None: if verbose: print("Couldn't map start position") return orig_text orig_end_position = None if end_position in tok_s_to_ns_map: ns_end_position = tok_s_to_ns_map[end_position] if ns_end_position in orig_ns_to_s_map: orig_end_position = orig_ns_to_s_map[ns_end_position] if orig_end_position is None: if verbose: print("Couldn't map end position") return orig_text output_text = orig_text[orig_start_position:(orig_end_position + 1)] return output_text def _get_best_indexes(logits, n_best_size): """Get the n-best logits from a list.""" index_and_score = sorted( enumerate(logits), key=lambda x: x[1], reverse=True) best_indexes = [] for i in range(len(index_and_score)): if i >= n_best_size: break best_indexes.append(index_and_score[i][0]) return best_indexes def _compute_softmax(scores): """Compute softmax probability over raw logits.""" if not scores: return [] max_score = None for score in scores: if max_score is None or score > max_score: max_score = score exp_scores = [] total_sum = 0.0 for score in scores: x = math.exp(score - max_score) exp_scores.append(x) total_sum += x probs = [] for score in exp_scores: probs.append(score / total_sum) return probs ================================================ FILE: paddlepalm/head/ner.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import paddle.fluid as fluid from paddle.fluid import layers from paddlepalm.head.base_head import Head import numpy as np import os import math class SequenceLabel(Head): ''' Sequence label ''' def __init__(self, num_classes, input_dim, dropout_prob=0.0, learning_rate=1e-3, \ param_initializer_range=0.02, phase='train'): """ Args: phase: train, eval, pred lang: en, ch, ... """ self._is_training = phase == 'train' self._hidden_size = input_dim self.num_classes = num_classes self._dropout_prob = dropout_prob if phase == 'train' else 0.0 self._param_initializer = fluid.initializer.TruncatedNormal( scale=param_initializer_range) self.learning_rate = learning_rate self._preds = [] @property def inputs_attrs(self): reader = {} bb = {"encoder_outputs": [[-1, -1, -1], 'float32']} if self._is_training: reader["label_ids"] = [[-1, -1], 'int64'] reader["seq_lens"] = [[-1], 'int64'] return {'reader': reader, 'backbone': bb} @property def outputs_attrs(self): if self._is_training: return {'loss': [[1], 'float32']} else: return {'logits': [[-1, -1, self.num_classes], 'float32']} def build(self, inputs, scope_name=''): token_emb = inputs['backbone']['encoder_outputs'] if self._is_training: label_ids = inputs['reader']['label_ids'] seq_lens = inputs['reader']['seq_lens'] emission = fluid.layers.fc( size=self.num_classes, input=token_emb, param_attr=fluid.ParamAttr( initializer=self._param_initializer, regularizer=fluid.regularizer.L2DecayRegularizer( regularization_coeff=1e-4)), bias_attr=fluid.ParamAttr( name=scope_name+"cls_out_b", initializer=fluid.initializer.Constant(0.)), num_flatten_dims=2) if self._is_training: # compute loss crf_cost = fluid.layers.linear_chain_crf( input=emission, label=label_ids, param_attr=fluid.ParamAttr( name=scope_name+'crfw', learning_rate=self.learning_rate), length=seq_lens) avg_cost = fluid.layers.mean(x=crf_cost) crf_decode = fluid.layers.crf_decoding( input=emission, param_attr=fluid.ParamAttr(name=scope_name+'crfw'), length=seq_lens) (precision, recall, f1_score, num_infer_chunks, num_label_chunks, num_correct_chunks) = fluid.layers.chunk_eval( input=crf_decode, label=label_ids, chunk_scheme="IOB", num_chunk_types=int(math.ceil((self.num_classes - 1) / 2.0)), seq_length=seq_lens) chunk_evaluator = fluid.metrics.ChunkEvaluator() chunk_evaluator.reset() return {"loss": avg_cost} else: return {"logits": emission} def batch_postprocess(self, rt_outputs): if not self._is_training: emission = rt_outputs['emission'] preds = np.argmax(emission, -1) self._preds.extend(preds.tolist()) def epoch_postprocess(self, post_inputs, output_dir=None): # there is no post_inputs needed and not declared in epoch_inputs_attrs, hence no elements exist in post_inputs if not self._is_training: if output_dir is not None: with open(os.path.join(output_dir, 'predictions.json'), 'w') as writer: for p in self._preds: writer.write(str(p)+'\n') print('Predictions saved at '+os.path.join(output_dir, 'predictions.json')) return self._preds ================================================ FILE: paddlepalm/lr_sched/__init__.py ================================================ from .slanted_triangular_schedualer import TriangularSchedualer from .warmup_schedualer import WarmupSchedualer ================================================ FILE: paddlepalm/lr_sched/base_schedualer.py ================================================ class Schedualer(): def __init__(self): self._prog = None def _set_prog(self, prog): self._prog = prog def _build(self, learning_rate): raise NotImplementedError() ================================================ FILE: paddlepalm/lr_sched/slanted_triangular_schedualer.py ================================================ from paddlepalm.lr_sched.base_schedualer import Schedualer from paddle import fluid class TriangularSchedualer(Schedualer): """ Implementation of Slanted Triangular learning rate schedual method, more details refer to https://arxiv.org/pdf/1801.06146.pdf . Apply linear warmup of learning rate from 0 to learning_rate until warmup_steps, and then decay to 0 linearly until num_train_steps.""" def __init__(self, warmup_steps, num_train_steps): """Create a new TriangularSchedualer object. Args: warmup_steps: the learning rate will grow from 0 to max_learning_rate over `warmup_steps` steps. num_train_steps: the number of train steps. """ Schedualer.__init__(self) assert num_train_steps > warmup_steps > 0 self.warmup_steps = warmup_steps self.num_train_steps = num_train_steps def _build(self, learning_rate): with self._prog._lr_schedule_guard(): lr = fluid.layers.tensor.create_global_var( shape=[1], value=0.0, dtype='float32', persistable=True, name="scheduled_learning_rate") global_step = fluid.layers.learning_rate_scheduler._decay_step_counter() with fluid.layers.control_flow.Switch() as switch: with switch.case(global_step < self.warmup_steps): warmup_lr = learning_rate * (global_step / self.warmup_steps) fluid.layers.tensor.assign(warmup_lr, lr) with switch.default(): decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay( learning_rate=learning_rate, decay_steps=self.num_train_steps, end_learning_rate=0.0, power=1.0, cycle=False) fluid.layers.tensor.assign(decayed_lr, lr) return lr ================================================ FILE: paddlepalm/lr_sched/warmup_schedualer.py ================================================ from paddlepalm.lr_sched.base_schedualer import Schedualer import paddle.fluid as fluid def WarmupSchedualer(Schedualer): """ Applies linear warmup of learning rate from 0 to learning_rate until warmup_steps, and then decay to 0 linearly until num_train_steps.""" def __init__(self, warmup_steps): schedualer.__init__(self) self.warmup_steps = warmup_steps def _build(self, learning_rate): with self._prog._lr_schedule_guard(): lr = fluid.layers.tensor.create_global_var( shape=[1], value=0.0, dtype='float32', persistable=True, name="scheduled_learning_rate") global_step = fluid.layers.learning_rate_scheduler._decay_step_counter() with fluid.layers.control_flow.Switch() as switch: with switch.case(global_step < self.warmup_steps): warmup_lr = learning_rate * (global_step / self.warmup_steps) fluid.layers.tensor.assign(warmup_lr, lr) with switch.default(): fluid.layers.tensor.assign(learning_rate, lr) return lr ================================================ FILE: paddlepalm/multihead_trainer.py ================================================ from paddle import fluid from paddle.fluid import layers from paddlepalm.distribute import gpu_dev_count, cpu_dev_count, data_feeder, decode_fake from paddlepalm import Trainer from paddlepalm.utils import reader_helper import numpy as np import time import sys dev_count = 1 if gpu_dev_count <= 1 else gpu_dev_count VERBOSE=False class MultiHeadTrainer(Trainer): """ The core unit to start a multi-task training/predicting session. A MultiHeadTrainer is built based on several Trainers. Beyond the inheritance of Trainer, it additionally achieves model backbone reuse across tasks, trainer sampling for multi-task learning, and multi-head inference for effective evaluation and prediction. """ def __init__(self, trainers): """Create a new multi_head_trainer. Args: trainers: a list of Trainer objects. """ Trainer.__init__(self, '') self._trainers = trainers name_maxlen = max([len(i.name) for i in self._trainers]) self._name_pads = {i.name: name_maxlen-len(i.name) for i in self._trainers} self._train_init = False self._dist_train_init = False self._predict_init = False self._feeded_var_names = None self._cur_train_step = 0 self._target_vars = None self._inputname_to_varname = {} self._pred_input_name_list = [] self._pred_input_varname_list = [] self._pred_fetch_name_list = [] self._pred_fetch_var_list = [] self._exe = None self._save_protocol = { 'input_names': 'self._pred_input_name_list', 'input_varnames': 'self._pred_input_varname_list', 'fetch_list': 'self._pred_fetch_name_list'} self._check_save = lambda: False for t in self._trainers: t._set_multitask() def build_forward(self): """ Build forward computation graph for training, which usually built from input layer to loss node. Return: - loss_var: a Variable object. The computational graph variable(node) of loss. """ head_dict = {} backbone = self._trainers[0]._backbone for i in self._trainers: assert i._task_head is not None and i._backbone is not None, "You should build forward for the {} task".format(i._name) assert i._backbone == backbone, "The backbone for each task must be the same" head_dict[i._name] = i._task_head train_prog = fluid.Program() train_init_prog = fluid.Program() self._train_prog = train_prog self._train_init_prog = train_init_prog def get_loss(i): head = head_dict[self._trainers[i].name] self._trainers[i]._lock_prog = True loss_var = self._trainers[i].build_forward(backbone, head) self._trainers[i]._lock_prog = False return loss_var task_fns = {i: lambda i=i: get_loss(i) for i in range(len(self._trainers))} with fluid.program_guard(train_prog, train_init_prog): task_id_var = fluid.data(name="__task_id",shape=[1],dtype='int64') loss_var = layers.switch_case( branch_index=task_id_var, branch_fns=task_fns ) self._task_id_var = task_id_var self._loss_var = loss_var self._fetch_list = [loss_var.name] if not self._multi_task: self._init_exe_prog(for_train=True) return loss_var def build_predict_forward(self): head_dict = {} backbone = self._trainers[0]._pred_backbone for i in self._trainers: assert i._pred_head is not None and i._pred_backbone is not None, "You should build_predict_forward for the {} task".format(i._name) assert i._pred_backbone == backbone, "The backbone for each task must be the same" head_dict[i._name] = i._pred_head pred_prog = fluid.Program() pred_init_prog = fluid.Program() self._pred_prog = pred_prog self._pred_init_prog = pred_init_prog def get_loss(i): head = head_dict[self._trainers[i].name] self._trainers[i]._lock_prog = True pred_vars = self._trainers[i].build_predict_forward(backbone, head) self._trainers[i]._lock_prog = False # return loss_var task_fns = {i: lambda i=i: get_loss(i) for i in range(len(self._trainers))} with fluid.program_guard(pred_prog, pred_init_prog): task_id_var = fluid.data(name="__task_id",shape=[1],dtype='int64') loss_var = layers.switch_case( branch_index=task_id_var, branch_fns=task_fns ) if not self._multi_task: self._init_exe_prog(for_train=False) def merge_inference_readers(self, readers): for r in readers: assert r._phase == 'predict' if isinstance(readers, list): reader_dict = {k.name: v for k,v in zip(self._trainers, readers)} elif isinstance(readers, dict): reader_dict = readers else: raise ValueError() num_heads = len(self._trainers) assert len(reader_dict) == num_heads, "received number of readers is not consistent with trainers." trainer_dict = {t.name: t for t in self._trainers} task_name2id = {t.name: idx for idx, t in enumerate(self._trainers)} self._task_name2id = task_name2id self._finish_steps = {} self._finish = {} input_names = [] name_to_pos = [] joint_shape_and_dtypes = [] iterators = [] prefixes = [] mrs = [] net_inputs = [] global_steps = 0 for t in self._trainers: assert t.name in reader_dict assert reader_dict[t.name].num_epochs is None, "{}: num_epochs is not None. \ To run with multi-head mode, num_epochs of each Trainer should be set as None.".format(t.name) # print(num_epochs, t.mix_ratio, base_steps_pur_epoch) self._finish_steps[t.name] = 9999999999 self._finish[t.name] = True # t._set_task_id(self._task_id_var) t.fit_reader(reader_dict[t.name], phase='predict') net_inputs.append(t._pred_net_inputs) prefixes.append(t.name) iterators.append(t._raw_iterator_fn()) input_names.append(t._pred_input_names) name_to_pos.append(t._pred_name_to_position) joint_shape_and_dtypes.append(t._pred_shape_and_dtypes) iterator_fn = reader_helper.create_multihead_inference_fn(iterators, prefixes, joint_shape_and_dtypes, \ input_names, name_to_pos, task_name2id, dev_count=dev_count) feed_batch_process_fn = reader_helper.create_feed_batch_process_fn(net_inputs) if gpu_dev_count > 1: raise NotImplementedError('currently only single-gpu mode has been supported running with multi-task mode.') # distribute_feeder_fn = data_feeder(iterator_fn, feed_batch_process_fn, phase=phase, is_multi=True, with_arg=True) else: distribute_feeder_fn = iterator_fn self._predict_iterator_fn = distribute_feeder_fn self._pred_feed_batch_process_fn = feed_batch_process_fn return distribute_feeder_fn def fit_readers_with_mixratio(self, readers, sampling_reference, num_epochs, phase='train'): """ Bind readers and loaded train/predict data to trainers. The `num_epochs` argument only works on `sampling_reference` task(trainer), and num_epochs of other tasks are infered from their `mix_ratio`. Args: readers: a dict or list of Reader objects. For dict case, each key is a trainer's name, and the mapped value is the reader object to bind to the trainer. For list case, each sampling_reference: a trainer name. The task(trainer) selected as baseline for task sampling. num_epochs: training epochs of the sampling_reference task (trainer). """ self._check_phase(phase) if isinstance(readers, list): reader_dict = {k.name: v for k,v in zip(self._trainers, readers)} elif isinstance(readers, dict): reader_dict = readers else: raise ValueError() num_heads = len(self._trainers) assert len(reader_dict) == num_heads, "received number of readers is not consistent with trainers." trainer_dict = {t.name: t for t in self._trainers} assert sampling_reference in trainer_dict trainer_dict[sampling_reference]._set_task_id(self._task_id_var) trainer_dict[sampling_reference].fit_reader(reader_dict[sampling_reference]) base_steps_pur_epoch = trainer_dict[sampling_reference]._steps_pur_epoch self._finish_steps = {} self._finish = {} input_names = [] name_to_pos = [] joint_shape_and_dtypes = [] iterators = [] prefixes = [] mrs = [] net_inputs = [] global_steps = 0 for t in self._trainers: assert t.name in reader_dict assert reader_dict[t.name].num_epochs is None, "{}: num_epochs is not None. \ To run with multi-head mode, num_epochs of each Trainer should be set as None.".format(t.name) # print(num_epochs, t.mix_ratio, base_steps_pur_epoch) max_train_steps = int(num_epochs * t.mix_ratio * base_steps_pur_epoch) if not t._as_auxilary: print('{}: expected train steps {}.'.format(t.name, max_train_steps)) sys.stdout.flush() self._finish_steps[t.name] = max_train_steps self._finish[t.name] = False else: self._finish_steps[t.name] = 9999999999 self._finish[t.name] = True global_steps += max_train_steps if t.name != sampling_reference: t._set_task_id(self._task_id_var) t.fit_reader(reader_dict[t.name]) net_inputs.append(t._net_inputs) prefixes.append(t.name) mrs.append(t.mix_ratio) iterators.append(t._raw_iterator_fn()) input_names.append(t._input_names) name_to_pos.append(t._name_to_position) joint_shape_and_dtypes.append(t._shape_and_dtypes) print('Estimated overall train steps {}.'.format(global_steps)) sys.stdout.flush() self._overall_train_steps = global_steps iterator_fn = reader_helper.create_multihead_iterator_fn(iterators, prefixes, joint_shape_and_dtypes, \ mrs, input_names, name_to_pos, dev_count=dev_count) feed_batch_process_fn = reader_helper.create_feed_batch_process_fn(net_inputs) if gpu_dev_count > 1: distribute_feeder_fn = data_feeder(iterator_fn, feed_batch_process_fn, phase=phase, is_multi=True) else: distribute_feeder_fn = iterator_fn() if phase == 'train': self._train_reader = distribute_feeder_fn self._feed_batch_process_fn = feed_batch_process_fn elif phase == 'predict': self._predict_reader = distribute_feeder_fn self._pred_feed_batch_process_fn = feed_batch_process_fn return distribute_feeder_fn def _check_finish(self, task_name, silent=False): trainers = {t.name:t for t in self._trainers} if trainers[task_name]._cur_train_step == self._finish_steps[task_name]: if not silent: print(task_name+' train finish!') sys.stdout.flush() self._finish[task_name]=True flags = list(set(self._finish.values())) return len(flags) == 1 and flags[0] == True def train(self, print_steps=5): """ start training. Args: print_steps: int. Logging frequency of training message, e.g., current step, loss and speed. """ iterator = self._train_reader self._distribute_train_prog = fluid.CompiledProgram(self._train_prog).with_data_parallel(loss_name=self._loss_var.name) for t in self._trainers: t._dist_train_init = True t._set_exe(self._exe) t._set_dist_train(self._distribute_train_prog) t._set_fetch_list(self._fetch_list) time_begin = time.time() for feed in iterator: # batch, task_id = feed rt_outputs, task_id = self.train_one_step(feed) task_rt_outputs = {k[len(self._trainers[task_id].name+'.'):]: v for k,v in rt_outputs.items() if k.startswith(self._trainers[task_id].name+'.')} self._trainers[task_id]._task_head.batch_postprocess(task_rt_outputs) if print_steps > 0 and self._cur_train_step % print_steps == 0: loss = rt_outputs[self._trainers[task_id].name+'.loss'] loss = np.mean(np.squeeze(loss)).tolist() time_end = time.time() time_cost = time_end - time_begin print("global step: {}, {}: step {}/{} (epoch {}), loss: {:.3f}, speed: {:.2f} steps/s".format( self._cur_train_step, ' '*self._name_pads[self._trainers[task_id].name]+self._trainers[task_id].name, \ (self._trainers[task_id]._cur_train_step-1) % self._trainers[task_id]._steps_pur_epoch + 1, \ self._trainers[task_id]._steps_pur_epoch, self._trainers[task_id]._cur_train_epoch, \ loss, print_steps / time_cost)) sys.stdout.flush() time_begin = time.time() self._check_save() finish = self._check_finish(self._trainers[task_id].name) if finish: break def train_one_step(self, batch): if not self._dist_train_init: self._distribute_train_prog = fluid.CompiledProgram(self._train_prog).with_data_parallel(loss_name=self._loss_var.name) for t in self._trainers: t._dist_train_init = True t._set_exe(self._exe) t._set_dist_train(self._distribute_train_prog) t._set_fetch_list(self._fetch_list) self._dist_train_init = True if dev_count > 1: assert isinstance(batch, tuple) task_id = batch[0][0]['__task_id'][0] else: assert isinstance(batch, dict) task_id = batch['__task_id'][0] rt_outputs = self._trainers[task_id].train_one_step(batch) self._cur_train_step += 1 self._check_save() return rt_outputs, task_id def predict_one_batch(self, task_name, batch): if dev_count > 1: raise NotImplementedError() # batch = next(self._predict_iterator_fn(task_name)) t = self._trainers[self._task_name2id[task_name]] # t._set_exe(self._exe) t._set_dist_pred(self._trainers[self._task_name2id[task_name]]._pred_prog) rt_outputs = t.predict_one_batch(batch) return rt_outputs def predict(self, output_dir=None, print_steps=1000): raise NotImplementedError() # iterator = self._predict_iterator # self._distribute_pred_prog = fluid.CompiledProgram(self._pred_prog).with_data_parallel() @property def overall_train_steps(self): return self._overall_train_steps ================================================ FILE: paddlepalm/optimizer/__init__.py ================================================ from .adam import Adam ================================================ FILE: paddlepalm/optimizer/adam.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Optimization and learning rate scheduling.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import numpy as np import paddle.fluid as fluid from paddlepalm.optimizer.base_optimizer import Optimizer class Adam(Optimizer): def __init__(self, loss_var, lr, lr_schedualer=None): Optimizer.__init__(self, loss_var, lr, lr_schedualer=None) self._loss = loss_var self._lr = lr self._lr_schedualer = lr_schedualer def _build(self, grad_clip=None): if self._lr_schedualer is not None: self._lr = self._lr_schedualer._build(self._lr) optimizer = fluid.optimizer.Adam(learning_rate=self._lr) if grad_clip is not None: clip_norm_thres = grad_clip # When using mixed precision training, scale the gradient clip threshold # by loss_scaling fluid.clip.set_gradient_clip( clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=clip_norm_thres)) _, param_grads = optimizer.minimize(self._loss) return param_grads def get_cur_learning_rate(self): return self._lr ================================================ FILE: paddlepalm/optimizer/base_optimizer.py ================================================ class Optimizer(object): def __init__(self, loss_var, lr, lr_schedualer=None): self._prog = None self._lr_schedualer = lr_schedualer def _build(self, grad_clip=None): raise NotImplementedError() def _set_prog(self, prog, init_prog): self._prog = prog self._init_prog = prog if self._lr_schedualer is not None: self._lr_schedualer._set_prog(prog) def get_cur_learning_rate(self): pass ================================================ FILE: paddlepalm/reader/__init__.py ================================================ from .cls import ClassifyReader from .match import MatchReader from .seq_label import SequenceLabelReader from .mrc import MRCReader from .mlm import MaskLMReader ================================================ FILE: paddlepalm/reader/base_reader.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from copy import copy class Reader(object): """interface of data reader.""" def __init__(self, phase='train'): """该函数完成一个Reader的构造,至少需要包含一个phase参数。 注意:实现该构造函数时,必须保证对基类构造函数的调用,以创建必要的框架内建的成员变量。 Args: phase: str类型。用于区分主干网络被调用时所处的运行阶段,目前支持训练阶段train和预测阶段predict """ self._phase = phase self._batch_size = None self._num_epochs = 1 self._register = set() self._registered_backbone = None @classmethod def create_register(self): return set() def clone(self, phase='train'): """拷贝一个新的reader对象。""" if phase == self._phase: return copy(self) else: ret = copy(self) ret._phase = phase return ret def require_attr(self, attr_name): """在注册器中新增一个需要产生的对象。 Args: attr_name: 需要产出的对象的对象名,例如’segment_ids‘。 """ self._register.add(attr_name) def register_with(self, backbone): """根据backbone对输入对象的依赖,在注册器中对每个依赖的输入对象进行注册。 Args: backbone: 需要对接的主干网络。 """ for attr in backbone.inputs_attr: self.require_attr(attr) self._registered_backbone = backbone def get_registered_backbone(self): """返回该reader所注册的backbone。""" return self._registered_backbone def _get_registed_attrs(self, attrs): ret = {} for i in self._register: if i not in attrs: raise NotImplementedError('output attr {} is not found in this reader.'.format(i)) ret[i] = attrs[i] return ret def load_data(self, input_file, batch_size, num_epochs=None, \ file_format='tsv', shuffle_train=True): """将磁盘上的数据载入到reader中。 注意:实现该方法时需要同步创建self._batch_size和self._num_epochs。 Args: input_file: 数据集文件路径。文件格式需要满足`file_format`参数的要求。 batch_size: 迭代器每次yield出的样本数量。注意:当环境中存在多个GPU时,batch_size需要保证被GPU卡数整除。 num_epochs: 数据集遍历次数。默认为None, 在单任务模式下代表遍历一次,在多任务模式下该参数会被上层的Trainer进行自动赋值。该参数仅对训练阶段有效。 file_format: 输入文件的文件格式。目前支持的格式: tsv. 默认为tsv. shuffle_train: 是否打乱训练集中的样本。默认为True。该参数仅对训练阶段有效。 """ raise NotImplementedError() @property def outputs_attr(self): """描述reader输出对象(被yield出的对象)的属性,包含各个对象的名字、shape以及数据类型。当某个对象为标量数据 类型(如str, int, float等)时,shape设置为空列表[],当某个对象的某个维度长度可变时,shape中的相应维度设置为-1。 注意:当使用mini-batch梯度下降学习策略时,,应为常规的输入对象设置batch_size维度(一般为-1) Return: dict类型。对各个输入对象的属性描述。例如, 对于文本分类和匹配任务,yield的输出内容可能包含如下的对象(下游backbone和task可按需访问其中的对象) {"token_ids": ([-1, max_len], 'int64'), "input_ids": ([-1, max_len], 'int64'), "segment_ids": ([-1, max_len], 'int64'), "input_mask": ([-1, max_len], 'float32'), "label": ([-1], 'int')} """ raise NotImplementedError() def _iterator(self): """数据集遍历接口,注意,当数据集遍历到尾部时该接口应自动完成指针重置,即重新从数据集头部开始新的遍历。 Yield: dict类型。符合outputs_attr描述的当前step的输出对象。 """ raise NotImplementedError() def get_epoch_outputs(self): """返回数据集每个epoch遍历后的输出对象。""" raise NotImplementedError() @property def num_examples(self): """数据集中的样本数量,即每个epoch中iterator所生成的样本数。注意,使用滑动窗口等可能导致数据集样本数发生变化的策略时 该接口应返回runtime阶段的实际样本数。""" raise NotImplementedError() @property def num_epochs(self): """数据集遍历次数""" return self._num_epochs ================================================ FILE: paddlepalm/reader/cls.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from paddlepalm.reader.base_reader import Reader from paddlepalm.reader.utils.reader4ernie import ClassifyReader as CLSReader class ClassifyReader(Reader): """ The reader completes the loading and processing of text classification dataset. Supported file format: tsv. For tsv format, training dataset file should have two header areas, i.e., `label` and `text`, and test set only requires `text` area. For example, ``` label [TAB] text 1 [TAB] Today is a good day. 0 [TAB] Such a terriable day! 1 [TAB] I feel lucky to meet you, dear. 1 [TAB] He likes sunshine and I like him :). 0 [TAB] JUST! GO! OUT! ``` CAUTIOUS: The first line of the file must be header! And areas are splited by tab (\\t). """ def __init__(self, vocab_path, max_len, tokenizer='wordpiece', \ lang='en', seed=None, do_lower_case=False, phase='train'): """Create a new Reader for loading and processing classification task data. Args: vocab_path: the vocab file path to do tokenization and token_ids generation. max_len: The maximum length of the sequence (after word segmentation). The part exceeding max_len will be removed from right. tokenizer: string type. The name of the used tokenizer. A tokenizer is to convert raw text into tokens. Avaliable tokenizers: wordpiece. lang: the language of dataset. Supported language: en (English), cn (Chinese). Default is en (English). seed: int type. The random seed to shuffle dataset. Default is None, means no use of random seed. do_lower_case: bool type. Whether to do lowercase on English text. Default is False. This argument only works on English text. phase: the running phase of this reader. Supported phase: train, predict. Default is train. Return: a Reader object for classification task. """ Reader.__init__(self, phase) assert lang.lower() in ['en', 'cn', 'english', 'chinese'], "supported language: en (English), cn (Chinese)." assert phase in ['train', 'predict'], "supported phase: train, predict." for_cn = lang.lower() == 'cn' or lang.lower() == 'chinese' self._register.add('token_ids') if phase == 'train': self._register.add('label_ids') self._is_training = phase == 'train' cls_reader = CLSReader(vocab_path, max_seq_len=max_len, do_lower_case=do_lower_case, for_cn=for_cn, random_seed=seed) self._reader = cls_reader self._phase = phase # self._batch_size = # self._print_first_n = config.get('print_first_n', 0) @property def outputs_attr(self): """The contained output items (input features) of this reader.""" attrs = {"token_ids": [[-1, -1], 'int64'], "position_ids": [[-1, -1], 'int64'], "segment_ids": [[-1, -1], 'int64'], "input_mask": [[-1, -1, 1], 'float32'], "label_ids": [[-1], 'int64'], "task_ids": [[-1, -1], 'int64'] } return self._get_registed_attrs(attrs) def load_data(self, input_file, batch_size, num_epochs=None, \ file_format='tsv', shuffle_train=True): """Load classification data into reader. Args: input_file: the dataset file path. File format should keep consistent with `file_format` argument. batch_size: number of examples for once yield. CAUSIOUS! If your environment exists multiple GPU devices (marked as dev_count), the batch_size should be divided by dev_count with no remainder! num_epochs: the travelsal times of input examples. Default is None, means once for single-task learning and automatically calculated for multi-task learning. This argument only works on train phase. file_format: the file format of input file. Supported format: tsv. Default is tsv. shuffle_train: whether to shuffle training dataset. Default is True. This argument only works on training phase. """ self._batch_size = batch_size self._num_epochs = num_epochs self._data_generator = self._reader.data_generator( \ input_file, batch_size, num_epochs if self._phase == 'train' else 1, \ shuffle=shuffle_train if self._phase == 'train' else False, \ phase=self._phase) def _iterator(self): names = ['token_ids', 'segment_ids', 'position_ids', 'task_ids', 'input_mask', 'label_ids', 'unique_ids'] for batch in self._data_generator(): outputs = {n: i for n,i in zip(names, batch)} ret = {} # TODO: move runtime shape check here for attr in self.outputs_attr.keys(): ret[attr] = outputs[attr] yield ret def get_epoch_outputs(self): return {'examples': self._reader.get_examples(self._phase), 'features': self._reader.get_features(self._phase)} @property def num_examples(self): return self._reader.get_num_examples(phase=self._phase) @property def num_epochs(self): return self._num_epochs ================================================ FILE: paddlepalm/reader/match.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from paddlepalm.reader.base_reader import Reader from paddlepalm.reader.utils.reader4ernie import ClassifyReader as CLSReader class MatchReader(Reader): """ The reader completes the loading and processing of matching-like task (e.g, query-query, question-answer, text similarity, natural language inference) dataset. Supported file format: tsv. For pointwise learning strategy, there should be two fields in training dataset file, i.e., `text_a`, `text_b` and `label`. For pairwise learning, there should exist three fields, i.e., `text_a`, `text_b` and `text_b_neg`. For predicting, only `text_a` and `text_b` are required. A pointwise learning case shows as follows: ``` label [TAB] text_a [TAB] text_b 1 [TAB] Today is a good day. [TAB] what a nice day! 0 [TAB] Such a terriable day! [TAB] There is a dog. 1 [TAB] I feel lucky to meet you, dear. [TAB] You are my lucky, darling. 1 [TAB] He likes sunshine and I like him :). [TAB] I like him. He like sunshine. 0 [TAB] JUST! GO! OUT! [TAB] Come in please. ``` A pairwise learning case shows as follows: text_a [TAB] text_b [TAB] text_b_neg Today is a good day. [TAB] what a nice day! [TAB] terriable day! Such a terriable day! [TAB] So terriable today! [TAB] There is a dog. I feel lucky to meet you, dear. [TAB] You are my lucky, darling. [TAB] Buy some bananas, okey? He likes sunshine and I like him :). [TAB] I like him. He like sunshine. [TAB] He has a dog. JUST! GO! OUT! [TAB] go out now! [TAB] Come in please. CAUTIOUS: the HEADER is required for each dataset file! And fields (columns) should be splited by Tab (\\t). """ def __init__(self, vocab_path, max_len, tokenizer='wordpiece', lang='en', seed=None, \ do_lower_case=False, learning_strategy='pointwise', phase='train', dev_count=1, print_prefix=''): """Create a new Reader for classification task data. Args: vocab_path: the vocab file path to do tokenization and token_ids generation. max_len: The maximum length of the sequence (after word segmentation). The part exceeding max_len will be removed from right. tokenizer: string type. The name of the used tokenizer. A tokenizer is to convert raw text into tokens. Avaliable tokenizers: wordpiece. lang: the language of dataset. Supported language: en (English), cn (Chinese). Default is en (English). seed: int type. The random seed to shuffle dataset. Default is None, means no use of random seed. do_lower_case: bool type. Whether to do lowercase on English text. Default is False. This argument only works on English text. learning_strategy: string type. This only works for training phase. Available strategies: pointwise, pairwise. phase: the running phase of this reader. Supported phase: train, predict. Default is train. Return: a Reader object for matching-like task. """ Reader.__init__(self, phase) assert lang.lower() in ['en', 'cn', 'english', 'chinese'], "supported language: en (English), cn (Chinese)." assert phase in ['train', 'predict'], "supported phase: train, predict." for_cn = lang.lower() == 'cn' or lang.lower() == 'chinese' self._register.add('token_ids') if phase == 'train': if learning_strategy == 'pointwise': self._register.add('label_ids') if learning_strategy == 'pairwise': self._register.add('token_ids_neg') self._register.add('position_ids_neg') self._register.add('segment_ids_neg') self._register.add('input_mask_neg') self._register.add('task_ids_neg') self._is_training = phase == 'train' self._learning_strategy = learning_strategy match_reader = CLSReader(vocab_path, max_seq_len=max_len, do_lower_case=do_lower_case, for_cn=for_cn, random_seed=seed, learning_strategy = learning_strategy) self._reader = match_reader self._dev_count = dev_count self._phase = phase @property def outputs_attr(self): attrs = {"token_ids": [[-1, -1], 'int64'], "position_ids": [[-1, -1], 'int64'], "segment_ids": [[-1, -1], 'int64'], "input_mask": [[-1, -1, 1], 'float32'], "task_ids": [[-1, -1], 'int64'], "label_ids": [[-1], 'int64'], "token_ids_neg": [[-1, -1], 'int64'], "position_ids_neg": [[-1, -1], 'int64'], "segment_ids_neg": [[-1, -1], 'int64'], "input_mask_neg": [[-1, -1, 1], 'float32'], "task_ids_neg": [[-1, -1], 'int64'] } return self._get_registed_attrs(attrs) def load_data(self, input_file, batch_size, num_epochs=None, \ file_format='tsv', shuffle_train=True): """Load matching data into reader. Args: input_file: the dataset file path. File format should keep consistent with `file_format` argument. batch_size: number of examples for once yield. CAUSIOUS! If your environment exists multiple GPU devices (marked as dev_count), the batch_size should be divided by dev_count with no remainder! num_epochs: the travelsal times of input examples. Default is None, means once for single-task learning and automatically calculated for multi-task learning. This argument only works on train phase. file_format: the file format of input file. Supported format: tsv. Default is tsv. shuffle_train: whether to shuffle training dataset. Default is True. This argument only works on training phase. """ self._batch_size = batch_size self._num_epochs = num_epochs self._data_generator = self._reader.data_generator( \ input_file, batch_size, num_epochs if self._phase == 'train' else 1, \ shuffle=shuffle_train if self._phase == 'train' else False, \ phase=self._phase) def _iterator(self): names = ['token_ids', 'segment_ids', 'position_ids', 'task_ids', 'input_mask', 'label_ids', \ 'token_ids_neg', 'segment_ids_neg', 'position_ids_neg', 'task_ids_neg', 'input_mask_neg'] if self._learning_strategy == 'pairwise': names.remove('label_ids') for batch in self._data_generator(): outputs = {n: i for n,i in zip(names, batch)} ret = {} # TODO: move runtime shape check here for attr in self.outputs_attr.keys(): ret[attr] = outputs[attr] yield ret @property def num_examples(self): return self._reader.get_num_examples(phase=self._phase) @property def num_epochs(self): return self._num_epochs ================================================ FILE: paddlepalm/reader/mlm.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from paddlepalm.reader.base_reader import Reader from paddlepalm.reader.utils.reader4ernie import MaskLMReader as MLMReader import numpy as np class MaskLMReader(Reader): def __init__(self, vocab_path, max_len, tokenizer='wordpiece', \ lang='en', seed=None, do_lower_case=False, phase='train', dev_count=1, print_prefix=''): """ Args: phase: train, eval, pred """ Reader.__init__(self, phase) assert lang.lower() in ['en', 'cn', 'english', 'chinese'], "supported language: en (English), cn (Chinese)." assert phase in ['train', 'predict'], "supported phase: train, predict." for_cn = lang.lower() == 'cn' or lang.lower() == 'chinese' self._register.add('mask_pos') if phase == 'train': self._register.add('mask_label') self._is_training = phase == 'train' mlm_reader = MLMReader(vocab_path, max_seq_len=max_len, do_lower_case=do_lower_case, for_cn=for_cn, random_seed=seed) self._reader = mlm_reader self._phase = phase self._dev_count = dev_count @property def outputs_attr(self): attrs = {"token_ids": [[-1, -1], 'int64'], "position_ids": [[-1, -1], 'int64'], "segment_ids": [[-1, -1], 'int64'], "input_mask": [[-1, -1, 1], 'float32'], "task_ids": [[-1, -1], 'int64'], "mask_label": [[-1], 'int64'], "mask_pos": [[-1], 'int64'] } return self._get_registed_attrs(attrs) def load_data(self, input_file, batch_size, num_epochs=None, \ file_format='csv', shuffle_train=True): self._batch_size = batch_size self._num_epochs = num_epochs self._data_generator = self._reader.data_generator( \ input_file, batch_size, num_epochs if self._phase == 'train' else 1, \ shuffle=shuffle_train if self._phase == 'train' else False, \ phase=self._phase) def _iterator(self): names = ['token_ids', 'position_ids', 'segment_ids', 'input_mask', 'task_ids', 'mask_label', 'mask_pos'] for batch in self._data_generator(): outputs = {n: i for n,i in zip(names, batch)} ret = {} # TODO: move runtime shape check here for attr in self.outputs_attr.keys(): ret[attr] = outputs[attr] yield ret def get_epoch_outputs(self): return {'examples': self._reader.get_examples(self._phase), 'features': self._reader.get_features(self._phase)} @property def num_examples(self): return self._reader.get_num_examples(phase=self._phase) @property def num_epochs(self): return self._num_epochs ================================================ FILE: paddlepalm/reader/mrc.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from paddlepalm.reader.base_reader import Reader from paddlepalm.reader.utils.reader4ernie import MRCReader as MRCReader_t import numpy as np class MRCReader(Reader): """ The reader completes the loading and processing of SQuAD like machine reading comprehension dataset. Supported file format: json. The outermost data structure of a dataset is a dictionary, which contains the dataset version number field and data field. In the data field, each example contains the title of the article and several paragraphs. Each paragraph contains a paragraph context corresponed question-answer pairs. For each q-a pair, it contains a question with globally unique ID, as well as (several) answers. Each answer item contains the text of the answer itself and its starting position of the context. Note that the starting position is at the character level. In addition, for the test set, answers field is not necessary. A typical case is shown as follows. {"version": "1.0", "data": [ {"title": "...", "paragraphs": [ {"context": "...", "qas": [ {"question": "..." "id": "..." "answers": [ {"text": "...", "answer_start": ...} {...} ... ] } {...} ... ] } {...}, ... ] } {...} ... ] } """ def __init__(self, vocab_path, max_len, max_query_len, doc_stride, \ tokenizer='wordpiece', lang='en', seed=None, do_lower_case=False, \ remove_noanswer=True, phase='train'): """Create a new Reader for loading and processing machine reading comprehension task data. Args: vocab_path: the vocab file path to do tokenization and token_ids generation. max_len: the maximum length of the sequence (after word segmentation). The part exceeding max_len will be removed from right. max_query_len: the maximum length of query/question (after word segmentation). doc_stride: the slice stride of context window. tokenizer: string type. The name of the used tokenizer. A tokenizer is to convert raw text into tokens. Avaliable tokenizers: wordpiece. lang: the language of dataset. Supported language: en (English), cn (Chinese). Default is en (English). seed: int type. The random seed to shuffle dataset. Default is None, means no use of random seed. do_lower_case: bool type. Whether to do lowercase on English text. Default is False. This argument only works on English text. remove_noanswer: bool type. Whether to remove no answer question and invalid answer. phase: the running phase of this reader. Supported phase: train, predict. Default is train. Return: a Reader object for classification task. """ Reader.__init__(self, phase) assert lang.lower() in ['en', 'cn', 'english', 'chinese'], "supported language: en (English), cn (Chinese)." assert phase in ['train', 'predict'], "supported phase: train, predict." for_cn = lang.lower() == 'cn' or lang.lower() == 'chinese' self._register.add('token_ids') if phase == 'train': self._register.add("start_positions") self._register.add("end_positions") else: self._register.add("unique_ids") self._is_training = phase == 'train' mrc_reader = MRCReader_t(vocab_path, max_seq_len=max_len, do_lower_case=do_lower_case, tokenizer=tokenizer, doc_stride=doc_stride, remove_noanswer=remove_noanswer, max_query_length=max_query_len, for_cn=for_cn, random_seed=seed) self._reader = mrc_reader self._phase = phase @property def outputs_attr(self): attrs = {"token_ids": [[-1, -1], 'int64'], "position_ids": [[-1, -1], 'int64'], "segment_ids": [[-1, -1], 'int64'], "input_mask": [[-1, -1, 1], 'float32'], "start_positions": [[-1], 'int64'], "end_positions": [[-1], 'int64'], "task_ids": [[-1, -1], 'int64'], "unique_ids": [[-1], 'int64'] } return self._get_registed_attrs(attrs) @property def epoch_outputs_attr(self): if not self._is_training: return {"examples": None, "features": None} def load_data(self, input_file, batch_size, num_epochs=None, file_format='csv', shuffle_train=True): """Load mrc data into reader. Args: input_file: the dataset file path. File format should keep consistent with `file_format` argument. batch_size: number of examples for once yield. CAUSIOUS! If your environment exists multiple GPU devices (marked as dev_count), the batch_size should be divided by dev_count with no remainder! num_epochs: the travelsal times of input examples. Default is None, means once for single-task learning and automatically calculated for multi-task learning. This argument only works on train phase. file_format: the file format of input file. Supported format: tsv. Default is tsv. shuffle_train: whether to shuffle training dataset. Default is True. This argument only works on training phase. """ self._batch_size = batch_size self._num_epochs = num_epochs self._data_generator = self._reader.data_generator( \ input_file, batch_size, num_epochs if self._phase == 'train' else 1, \ shuffle=shuffle_train if self._phase == 'train' else False, \ phase=self._phase) def _iterator(self): names = ['token_ids', 'segment_ids', 'position_ids', 'task_ids', 'input_mask', 'start_positions', 'end_positions', 'unique_ids'] if self._is_training: names.remove('unique_ids') for batch in self._data_generator(): outputs = {n: i for n,i in zip(names, batch)} ret = {} # TODO: move runtime shape check here for attr in self.outputs_attr.keys(): ret[attr] = outputs[attr] if not self._is_training: assert 'unique_ids' in ret, ret yield ret def get_epoch_outputs(self): return {'examples': self._reader.get_examples(self._phase), 'features': self._reader.get_features(self._phase)} @property def num_examples(self): return self._reader.get_num_examples(phase=self._phase) @property def num_epochs(self): return self._num_epochs ================================================ FILE: paddlepalm/reader/seq_label.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from paddlepalm.reader.base_reader import Reader from paddlepalm.reader.utils.reader4ernie import SequenceLabelReader as SLReader class SequenceLabelReader(Reader): """ The reader completes the loading and processing of sequence labeling type task (e.g, pos tagging, named entity recognition) dataset. Supported file format: tsv. """ def __init__(self, vocab_path, max_len, label_map_config, tokenizer='wordpiece', \ lang='en', seed=None, do_lower_case=False, phase='train', dev_count=1, print_prefix=''): """ Args: phase: train, eval, pred lang: en, ch, ... """ Reader.__init__(self, phase) assert lang.lower() in ['en', 'cn', 'english', 'chinese'], "supported language: en (English), cn (Chinese)." assert phase in ['train', 'predict'], "supported phase: train, predict." for_cn = lang.lower() == 'cn' or lang.lower() == 'chinese' self._register.add('token_ids') self._register.add('seq_lens') if phase == 'train': self._register.add('label_ids') self._is_training = phase == 'train' ner_reader = SLReader(vocab_path, max_seq_len=max_len, do_lower_case=do_lower_case, for_cn=for_cn, random_seed=seed, label_map_config=label_map_config) self._reader = ner_reader self._phase = phase self._dev_count = dev_count @property def outputs_attr(self): attrs = {"token_ids": [[-1, -1], 'int64'], "position_ids": [[-1, -1], 'int64'], "segment_ids": [[-1, -1], 'int64'], "task_ids": [[-1, -1], 'int64'], "input_mask": [[-1, -1, 1], 'float32'], "seq_lens": [[-1], 'int64'], "label_ids": [[-1, -1], 'int64']} return self._get_registed_attrs(attrs) def load_data(self, input_file, batch_size, num_epochs=None, \ file_format='tsv', shuffle_train=True): """Load sequence labeling data into reader. Args: input_file: the dataset file path. File format should keep consistent with `file_format` argument. batch_size: number of examples for once yield. CAUSIOUS! If your environment exists multiple GPU devices (marked as dev_count), the batch_size should be divided by dev_count with no remainder! num_epochs: the travelsal times of input examples. Default is None, means once for single-task learning and automatically calculated for multi-task learning. This argument only works on train phase. file_format: the file format of input file. Supported format: tsv. Default is tsv. shuffle_train: whether to shuffle training dataset. Default is True. This argument only works on training phase. """ self._batch_size = batch_size self._num_epochs = num_epochs self._data_generator = self._reader.data_generator( \ input_file, batch_size, num_epochs if self._phase == 'train' else 1, \ shuffle=shuffle_train if self._phase == 'train' else False, \ phase=self._phase) def _iterator(self): names = ['token_ids', 'segment_ids', 'position_ids', 'task_ids', 'input_mask', 'label_ids', 'seq_lens', 'label_ids'] for batch in self._data_generator(): outputs = {n: i for n,i in zip(names, batch)} ret = {} # TODO: move runtime shape check here for attr in self.outputs_attr.keys(): ret[attr] = outputs[attr] yield ret def get_epoch_outputs(self): return {'examples': self._reader.get_examples(self._phase), 'features': self._reader.get_features(self._phase)} @property def num_examples(self): return self._reader.get_num_examples(phase=self._phase) @property def num_epochs(self): return self._num_epochs ================================================ FILE: paddlepalm/reader/utils/__init__.py ================================================ ================================================ FILE: paddlepalm/reader/utils/batching4bert.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Mask, padding and batching.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import numpy as np def mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3): """ Add mask for batch_tokens, return out, mask_label, mask_pos; Note: mask_pos responding the batch_tokens after padded; """ max_len = max([len(sent) for sent in batch_tokens]) mask_label = [] mask_pos = [] prob_mask = np.random.rand(total_token_num) # Note: the first token is [CLS], so [low=1] replace_ids = np.random.randint(1, high=vocab_size, size=total_token_num) pre_sent_len = 0 prob_index = 0 for sent_index, sent in enumerate(batch_tokens): mask_flag = False prob_index += pre_sent_len for token_index, token in enumerate(sent): prob = prob_mask[prob_index + token_index] if prob > 0.15: continue elif 0.03 < prob <= 0.15: # mask if token != SEP and token != CLS: mask_label.append(sent[token_index]) sent[token_index] = MASK mask_flag = True mask_pos.append(sent_index * max_len + token_index) elif 0.015 < prob <= 0.03: # random replace if token != SEP and token != CLS: mask_label.append(sent[token_index]) sent[token_index] = replace_ids[prob_index + token_index] mask_flag = True mask_pos.append(sent_index * max_len + token_index) else: # keep the original token if token != SEP and token != CLS: mask_label.append(sent[token_index]) mask_pos.append(sent_index * max_len + token_index) pre_sent_len = len(sent) # ensure at least mask one word in a sentence while not mask_flag: token_index = int(np.random.randint(1, high=len(sent) - 1, size=1)) if sent[token_index] != SEP and sent[token_index] != CLS: mask_label.append(sent[token_index]) sent[token_index] = MASK mask_flag = True mask_pos.append(sent_index * max_len + token_index) mask_label = np.array(mask_label).astype("int64").reshape([-1]) mask_pos = np.array(mask_pos).astype("int64").reshape([-1]) return batch_tokens, mask_label, mask_pos def prepare_batch_data(insts, total_token_num, max_len=None, voc_size=0, pad_id=None, cls_id=None, sep_id=None, mask_id=None, return_input_mask=True, return_max_len=True, return_num_token=False): """ 1. generate Tensor of data 2. generate Tensor of position 3. generate self attention mask, [shape: batch_size * max_len * max_len] """ batch_src_ids = [inst[0] for inst in insts] batch_sent_ids = [inst[1] for inst in insts] batch_pos_ids = [inst[2] for inst in insts] labels_list = [] # compatible with mrqa, whose example includes start/end positions, # or unique id for i in range(3, len(insts[0]), 1): labels = [inst[i] for inst in insts] labels = np.array(labels).astype("int64").reshape([-1]) labels_list.append(labels) # First step: do mask without padding if mask_id >= 0: out, mask_label, mask_pos = mask( batch_src_ids, total_token_num, vocab_size=voc_size, CLS=cls_id, SEP=sep_id, MASK=mask_id) else: out = batch_src_ids # Second step: padding src_id, self_input_mask = pad_batch_data( out, max_len=max_len, pad_idx=pad_id, return_input_mask=True) pos_id = pad_batch_data( batch_pos_ids, max_len=max_len, pad_idx=pad_id, return_pos=False, return_input_mask=False) sent_id = pad_batch_data( batch_sent_ids, max_len=max_len, pad_idx=pad_id, return_pos=False, return_input_mask=False) if mask_id >= 0: return_list = [ src_id, pos_id, sent_id, self_input_mask, mask_label, mask_pos ] + labels_list else: return_list = [src_id, pos_id, sent_id, self_input_mask] + labels_list return return_list if len(return_list) > 1 else return_list[0] def pad_batch_data(insts, max_len=None, pad_idx=0, return_pos=False, return_input_mask=False, return_max_len=False, return_num_token=False): """ Pad the instances to the max sequence length in batch, and generate the corresponding position data and input mask. """ return_list = [] if max_len is None: max_len = max(len(inst) for inst in insts) # Any token included in dict can be used to pad, since the paddings' loss # will be masked out by weights and make no effect on parameter gradients. inst_data = np.array([ list(inst) + list([pad_idx] * (max_len - len(inst))) for inst in insts ]) return_list += [inst_data.astype("int64").reshape([-1, max_len])] # position data if return_pos: inst_pos = np.array([ list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst)) for inst in insts ]) return_list += [inst_pos.astype("int64").reshape([-1, max_len])] if return_input_mask: # This is used to avoid attention on paddings. input_mask_data = np.array([[1] * len(inst) + [0] * (max_len - len(inst)) for inst in insts]) input_mask_data = np.expand_dims(input_mask_data, axis=-1) return_list += [input_mask_data.astype("float32")] if return_max_len: return_list += [max_len] if return_num_token: num_token = 0 for inst in insts: num_token += len(inst) return_list += [num_token] return return_list if len(return_list) > 1 else return_list[0] if __name__ == "__main__": pass ================================================ FILE: paddlepalm/reader/utils/batching4ernie.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Mask, padding and batching.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import numpy as np from six.moves import xrange def mask(batch_tokens, seg_labels, mask_word_tags, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3): """ Add mask for batch_tokens, return out, mask_label, mask_pos; Note: mask_pos responding the batch_tokens after padded; """ max_len = max([len(sent) for sent in batch_tokens]) mask_label = [] mask_pos = [] prob_mask = np.random.rand(total_token_num) # Note: the first token is [CLS], so [low=1] replace_ids = np.random.randint(1, high=vocab_size, size=total_token_num) pre_sent_len = 0 prob_index = 0 for sent_index, sent in enumerate(batch_tokens): mask_flag = False mask_word = mask_word_tags[sent_index] prob_index += pre_sent_len if mask_word: beg = 0 for token_index, token in enumerate(sent): seg_label = seg_labels[sent_index][token_index] if seg_label == 1: continue if beg == 0: if seg_label != -1: beg = token_index continue prob = prob_mask[prob_index + beg] if prob > 0.15: pass else: for index in xrange(beg, token_index): prob = prob_mask[prob_index + index] base_prob = 1.0 if index == beg: base_prob = 0.15 if base_prob * 0.2 < prob <= base_prob: mask_label.append(sent[index]) sent[index] = MASK mask_flag = True mask_pos.append(sent_index * max_len + index) elif base_prob * 0.1 < prob <= base_prob * 0.2: mask_label.append(sent[index]) sent[index] = replace_ids[prob_index + index] mask_flag = True mask_pos.append(sent_index * max_len + index) else: mask_label.append(sent[index]) mask_pos.append(sent_index * max_len + index) if seg_label == -1: beg = 0 else: beg = token_index else: for token_index, token in enumerate(sent): prob = prob_mask[prob_index + token_index] if prob > 0.15: continue elif 0.03 < prob <= 0.15: # mask if token != SEP and token != CLS: mask_label.append(sent[token_index]) sent[token_index] = MASK mask_flag = True mask_pos.append(sent_index * max_len + token_index) elif 0.015 < prob <= 0.03: # random replace if token != SEP and token != CLS: mask_label.append(sent[token_index]) sent[token_index] = replace_ids[prob_index + token_index] mask_flag = True mask_pos.append(sent_index * max_len + token_index) else: # keep the original token if token != SEP and token != CLS: mask_label.append(sent[token_index]) mask_pos.append(sent_index * max_len + token_index) pre_sent_len = len(sent) mask_label = np.array(mask_label).astype("int64").reshape([-1]) mask_pos = np.array(mask_pos).astype("int64").reshape([-1]) return batch_tokens, mask_label, mask_pos def pad_batch_data(insts, pad_idx=0, return_pos=False, return_input_mask=False, return_max_len=False, return_num_token=False, return_seq_lens=False): """ Pad the instances to the max sequence length in batch, and generate the corresponding position data and attention bias. """ return_list = [] max_len = max(len(inst) for inst in insts) # Any token included in dict can be used to pad, since the paddings' loss # will be masked out by weights and make no effect on parameter gradients. inst_data = np.array( [inst + list([pad_idx] * (max_len - len(inst))) for inst in insts]) return_list += [inst_data.astype("int64").reshape([-1, max_len])] # position data if return_pos: inst_pos = np.array([ list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst)) for inst in insts ]) return_list += [inst_pos.astype("int64").reshape([-1, max_len])] if return_input_mask: # This is used to avoid attention on paddings. input_mask_data = np.array([[1] * len(inst) + [0] * (max_len - len(inst)) for inst in insts]) input_mask_data = np.expand_dims(input_mask_data, axis=-1) return_list += [input_mask_data.astype("float32")] if return_max_len: return_list += [max_len] if return_num_token: num_token = 0 for inst in insts: num_token += len(inst) return_list += [num_token] if return_seq_lens: seq_lens = np.array([len(inst) for inst in insts]) return_list += [seq_lens.astype("int64").reshape([-1])] return return_list if len(return_list) > 1 else return_list[0] if __name__ == "__main__": pass ================================================ FILE: paddlepalm/reader/utils/mlm_batching.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Mask, padding and batching.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import numpy as np def mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3, dev_count=1): """ Add mask for batch_tokens, return out, mask_label, mask_pos; Note: mask_pos responding the batch_tokens after padded; """ max_len = max([len(sent) for sent in batch_tokens]) multidev_batch_tokens = [] multidev_mask_label = [] multidev_mask_pos = [] big_batch_tokens = batch_tokens stride = len(batch_tokens) // dev_count if stride == 0: return None, None, None p = stride for i in range(dev_count): batch_tokens = big_batch_tokens[p-stride:p] p += stride mask_label = [] mask_pos = [] prob_mask = np.random.rand(total_token_num) # Note: the first token is [CLS], so [low=1] replace_ids = np.random.randint(1, high=vocab_size, size=total_token_num) pre_sent_len = 0 prob_index = 0 for sent_index, sent in enumerate(batch_tokens): mask_flag = False prob_index += pre_sent_len for token_index, token in enumerate(sent): prob = prob_mask[prob_index + token_index] if prob > 0.15: continue elif 0.03 < prob <= 0.15: # mask if token != SEP and token != CLS: mask_label.append(sent[token_index]) sent[token_index] = MASK mask_flag = True mask_pos.append(sent_index * max_len + token_index) elif 0.015 < prob <= 0.03: # random replace if token != SEP and token != CLS: mask_label.append(sent[token_index]) sent[token_index] = replace_ids[prob_index + token_index] mask_flag = True mask_pos.append(sent_index * max_len + token_index) else: # keep the original token if token != SEP and token != CLS: mask_label.append(sent[token_index]) mask_pos.append(sent_index * max_len + token_index) pre_sent_len = len(sent) # ensure at least mask one word in a sentence while not mask_flag: token_index = int(np.random.randint(1, high=len(sent) - 1, size=1)) if sent[token_index] != SEP and sent[token_index] != CLS: mask_label.append(sent[token_index]) sent[token_index] = MASK mask_flag = True mask_pos.append(sent_index * max_len + token_index) mask_label = np.array(mask_label).astype("int64").reshape([-1]) mask_pos = np.array(mask_pos).astype("int64").reshape([-1]) multidev_batch_tokens.extend(batch_tokens) multidev_mask_label.append(mask_label) multidev_mask_pos.append(mask_pos) return multidev_batch_tokens, multidev_mask_label, multidev_mask_pos def prepare_batch_data(insts, total_token_num, max_len=None, voc_size=0, pad_id=None, cls_id=None, sep_id=None, mask_id=None, task_id=0, return_input_mask=True, return_max_len=True, return_num_token=False, dev_count=1): """ 1. generate Tensor of data 2. generate Tensor of position 3. generate self attention mask, [shape: batch_size * max_len * max_len] """ batch_src_ids = [inst[0] for inst in insts] batch_sent_ids = [inst[1] for inst in insts] batch_pos_ids = [inst[2] for inst in insts] # 这里是否应该反过来???否则在task layer里展开后的word embedding是padding后的,这时候word的index是跟没有padding时的index对不上的? # First step: do mask without padding out, mask_label, mask_pos = mask( batch_src_ids, total_token_num, vocab_size=voc_size, CLS=cls_id, SEP=sep_id, MASK=mask_id, dev_count=dev_count) # Second step: padding src_id, self_input_mask = pad_batch_data( out, max_len=max_len, pad_idx=pad_id, return_input_mask=True) pos_id = pad_batch_data( batch_pos_ids, max_len=max_len, pad_idx=pad_id, return_pos=False, return_input_mask=False) sent_id = pad_batch_data( batch_sent_ids, max_len=max_len, pad_idx=pad_id, return_pos=False, return_input_mask=False) task_ids = np.ones_like( src_id, dtype="int64") * task_id return_list = [ src_id, pos_id, sent_id, self_input_mask, task_ids, mask_label, mask_pos ] return return_list def pad_batch_data(insts, max_len=None, pad_idx=0, return_pos=False, return_input_mask=False, return_max_len=False, return_num_token=False): """ Pad the instances to the max sequence length in batch, and generate the corresponding position data and input mask. """ return_list = [] if max_len is None: max_len = max(len(inst) for inst in insts) # Any token included in dict can be used to pad, since the paddings' loss # will be masked out by weights and make no effect on parameter gradients. inst_data = np.array([ list(inst) + list([pad_idx] * (max_len - len(inst))) for inst in insts ]) return_list += [inst_data.astype("int64").reshape([-1, max_len])] # position data if return_pos: inst_pos = np.array([ list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst)) for inst in insts ]) return_list += [inst_pos.astype("int64").reshape([-1, max_len])] if return_input_mask: # This is used to avoid attention on paddings. input_mask_data = np.array([[1] * len(inst) + [0] * (max_len - len(inst)) for inst in insts]) input_mask_data = np.expand_dims(input_mask_data, axis=-1) return_list += [input_mask_data.astype("float32")] if return_max_len: return_list += [max_len] if return_num_token: num_token = 0 for inst in insts: num_token += len(inst) return_list += [num_token] return return_list if len(return_list) > 1 else return_list[0] if __name__ == "__main__": pass ================================================ FILE: paddlepalm/reader/utils/mrqa_helper.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. class MRQAExample(object): """A single training/test example for simple sequence classification. For examples without an answer, the start and end position are -1. """ def __init__(self, qas_id, question_text, doc_tokens, orig_answer_text=None, start_position=None, end_position=None, is_impossible=False): self.qas_id = qas_id self.question_text = question_text self.doc_tokens = doc_tokens self.orig_answer_text = orig_answer_text self.start_position = start_position self.end_position = end_position self.is_impossible = is_impossible def __str__(self): return self.__repr__() def __repr__(self): s = "" s += "qas_id: %s" % (tokenization.printable_text(self.qas_id)) s += ", question_text: %s" % ( tokenization.printable_text(self.question_text)) s += ", doc_tokens: [%s]" % (" ".join(self.doc_tokens)) if self.start_position: s += ", start_position: %d" % (self.start_position) if self.start_position: s += ", end_position: %d" % (self.end_position) if self.start_position: s += ", is_impossible: %r" % (self.is_impossible) return s class MRQAFeature(object): """A single set of features of data.""" def __init__(self, unique_id, example_index, doc_span_index, tokens, token_to_orig_map, token_is_max_context, input_ids, input_mask, segment_ids, start_position=None, end_position=None, is_impossible=None): self.unique_id = unique_id self.example_index = example_index self.doc_span_index = doc_span_index self.tokens = tokens self.token_to_orig_map = token_to_orig_map self.token_is_max_context = token_is_max_context self.input_ids = input_ids self.input_mask = input_mask self.segment_ids = segment_ids self.start_position = start_position self.end_position = end_position self.is_impossible = is_impossible ================================================ FILE: paddlepalm/reader/utils/reader4ernie.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from __future__ import absolute_import from __future__ import division from __future__ import print_function from __future__ import unicode_literals from __future__ import absolute_import import sys import os import json import random import logging import numpy as np import six from io import open from collections import namedtuple import paddlepalm as palm import paddlepalm.tokenizer.ernie_tokenizer as tokenization from paddlepalm.reader.utils.batching4ernie import pad_batch_data from paddlepalm.reader.utils.mlm_batching import prepare_batch_data log = logging.getLogger(__name__) if six.PY3 and hasattr(sys.stdout, 'buffer'): import io sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8') sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8') if sys.version[0] == '2': reload(sys) sys.setdefaultencoding('utf-8') else: import importlib importlib.reload(sys) def csv_reader(fd, delimiter='\t'): def gen(): for i in fd: yield i.rstrip('\n').split(delimiter) return gen() class Reader(object): def __init__(self, vocab_path, label_map_config=None, max_seq_len=512, do_lower_case=True, in_tokens=False, is_inference=False, learning_strategy='pointwise', random_seed=None, tokenizer="FullTokenizer", phase='train', is_classify=True, is_regression=False, for_cn=True, task_id=0): assert phase in ['train', 'predict'], "supported phase: train, predict." self.max_seq_len = max_seq_len self.tokenizer = tokenization.FullTokenizer( vocab_file=vocab_path, do_lower_case=do_lower_case) self.vocab = self.tokenizer.vocab self.pad_id = self.vocab["[PAD]"] self.cls_id = self.vocab["[CLS]"] self.sep_id = self.vocab["[SEP]"] self.mask_id = self.vocab["[MASK]"] self.in_tokens = in_tokens self.phase = phase self.is_inference = is_inference self.learning_strategy = learning_strategy self.for_cn = for_cn self.task_id = task_id np.random.seed(random_seed) self.is_classify = is_classify self.is_regression = is_regression self.current_example = 0 self.current_epoch = 0 self.num_examples = 0 self.examples = {} if label_map_config: with open(label_map_config, encoding='utf8') as f: self.label_map = json.load(f) else: self.label_map = None def get_train_progress(self): """Gets progress for training phase.""" return self.current_example, self.current_epoch def _read_tsv(self, input_file, quotechar=None): """Reads a tab separated value file.""" with open(input_file, 'r', encoding='utf8') as f: reader = csv_reader(f) headers = next(reader) Example = namedtuple('Example', headers) examples = [] for line in reader: example = Example(*line) examples.append(example) return examples def _truncate_seq_pair(self, tokens_a, tokens_b, max_length): """Truncates a sequence pair in place to the maximum length.""" # This is a simple heuristic which will always truncate the longer sequence # one token at a time. This makes more sense than truncating an equal percent # of tokens from each, since if one sequence is very short then each token # that's truncated likely contains more information than a longer sequence. while True: total_length = len(tokens_a) + len(tokens_b) if total_length <= max_length: break if len(tokens_a) > len(tokens_b): tokens_a.pop() else: tokens_b.pop() def _convert_example_to_record(self, example, max_seq_length, tokenizer): """Converts a single `Example` into a single `Record`.""" text_a = tokenization.convert_to_unicode(example.text_a) tokens_a = tokenizer.tokenize(text_a) tokens_b = None has_text_b = False has_text_b_neg = False if isinstance(example, dict): has_text_b = "text_b" in example.keys() has_text_b_neg = "text_b_neg" in example.keys() else: has_text_b = "text_b" in example._fields has_text_b_neg = "text_b_neg" in example._fields if has_text_b: text_b = tokenization.convert_to_unicode(example.text_b) tokens_b = tokenizer.tokenize(text_b) # Modifies `tokens_a` and `tokens_b` in place so that the total # length is less than the specified length. # Account for [CLS], [SEP], [SEP] with "- 3" self._truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) if has_text_b_neg and self.phase == 'train': tokens_a_neg = tokenizer.tokenize(text_a) text_b_neg = tokenization.convert_to_unicode(example.text_b_neg) tokens_b_neg = tokenizer.tokenize(text_b_neg) self._truncate_seq_pair(tokens_a_neg, tokens_b_neg, max_seq_length - 3) else: # Account for [CLS] and [SEP] with "- 2" if len(tokens_a) > max_seq_length - 2: tokens_a = tokens_a[0:(max_seq_length - 2)] # The convention in BERT/ERNIE is: # (a) For sequence pairs: # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 # (b) For single sequences: # tokens: [CLS] the dog is hairy . [SEP] # type_ids: 0 0 0 0 0 0 0 # # Where "type_ids" are used to indicate whether this is the first # sequence or the second sequence. The embedding vectors for `type=0` and # `type=1` were learned during pre-training and are added to the wordpiece # embedding vector (and position vector). This is not *strictly* necessary # since the [SEP] token unambiguously separates the sequences, but it makes # it easier for the model to learn the concept of sequences. # # For classification tasks, the first vector (corresponding to [CLS]) is # used as as the "sentence vector". Note that this only makes sense because # the entire model is fine-tuned. tokens = [] text_type_ids = [] tokens.append("[CLS]") text_type_ids.append(0) for token in tokens_a: tokens.append(token) text_type_ids.append(0) tokens.append("[SEP]") text_type_ids.append(0) if tokens_b: for token in tokens_b: tokens.append(token) text_type_ids.append(1) tokens.append("[SEP]") text_type_ids.append(1) token_ids = tokenizer.convert_tokens_to_ids(tokens) position_ids = list(range(len(token_ids))) if has_text_b_neg and self.phase == 'train': tokens_neg = [] text_type_ids_neg = [] tokens_neg.append("[CLS]") text_type_ids_neg.append(0) for token in tokens_a_neg: tokens_neg.append(token) text_type_ids_neg.append(0) tokens_neg.append("[SEP]") text_type_ids_neg.append(0) if tokens_b_neg: for token in tokens_b_neg: tokens_neg.append(token) text_type_ids_neg.append(1) tokens_neg.append("[SEP]") text_type_ids_neg.append(1) token_ids_neg = tokenizer.convert_tokens_to_ids(tokens_neg) position_ids_neg = list(range(len(token_ids_neg))) if self.is_inference: Record = namedtuple('Record', ['token_ids', 'text_type_ids', 'position_ids']) record = Record( token_ids=token_ids, text_type_ids=text_type_ids, position_ids=position_ids) else: qid = None if "qid" in example._fields: qid = example.qid if self.learning_strategy == 'pairwise' and self.phase == 'train': Record = namedtuple('Record', ['token_ids', 'text_type_ids', 'position_ids', 'token_ids_neg', 'text_type_ids_neg', 'position_ids_neg', 'qid']) record = Record( token_ids=token_ids, text_type_ids=text_type_ids, position_ids=position_ids, token_ids_neg=token_ids_neg, text_type_ids_neg=text_type_ids_neg, position_ids_neg=position_ids_neg, qid=qid) else: if self.label_map: label_id = self.label_map[example.label] else: label_id = example.label Record = namedtuple('Record', [ 'token_ids', 'text_type_ids', 'position_ids', 'label_id', 'qid' ]) record = Record( token_ids=token_ids, text_type_ids=text_type_ids, position_ids=position_ids, label_id=label_id, qid=qid) return record def _prepare_batch_data(self, examples, batch_size, phase='train'): """generate batch records""" batch_records, max_len = [], 0 if len(examples) < batch_size: raise Exception('CLS dataset contains too few samples. Expect more than '+str(batch_size)) for index, example in enumerate(examples): if phase == "train": self.current_example = index record = self._convert_example_to_record(example, self.max_seq_len, self.tokenizer) max_len = max(max_len, len(record.token_ids)) if self.in_tokens: to_append = (len(batch_records) + 1) * max_len <= batch_size else: to_append = len(batch_records) < batch_size if to_append: batch_records.append(record) else: batch_pad_records = self._pad_batch_records(batch_records) ds = ['s'] * len(batch_pad_records) for piece in palm.distribute.yield_pieces(batch_pad_records, ds, batch_size): yield piece batch_records, max_len = [record], len(record.token_ids) if phase == 'predict' and batch_records: for piece in palm.distribute.yield_pieces(\ self._pad_batch_records(batch_records), ds, batch_size): yield piece def get_num_examples(self, input_file=None, phase='train'): if input_file is None: return len(self.examples.get(phase, [])) else: # assert input_file is not None, "Argument input_file should be given or the data_generator should be created when this func is called." examples = self._read_tsv(input_file) return len(examples) def data_generator(self, input_file, batch_size, epoch, dev_count=1, shuffle=True, phase=None): examples = self._read_tsv(input_file) if phase is None: phase = 'all' self.examples[phase] = examples def wrapper(): all_dev_batches = [] if epoch is None: num_epochs = 99999999 else: num_epochs = epoch for epoch_index in range(num_epochs): if phase == "train": self.current_example = 0 self.current_epoch = epoch_index if shuffle: np.random.shuffle(examples) for batch_data in self._prepare_batch_data( examples, batch_size, phase=phase): if len(all_dev_batches) < dev_count: all_dev_batches.append(batch_data) if len(all_dev_batches) == dev_count: for batch in all_dev_batches: yield batch all_dev_batches = [] def f(): for i in wrapper(): yield i return f # return wrapper class MaskLMReader(Reader): def _convert_example_to_record(self, example, max_seq_length, tokenizer): """Converts a single `Example` into a single `Record`.""" text_a = tokenization.convert_to_unicode(example.text_a) tokens_a = tokenizer.tokenize(text_a) tokens_b = None has_text_b = False if isinstance(example, dict): has_text_b = "text_b" in example.keys() else: has_text_b = "text_b" in example._fields if has_text_b: text_b = tokenization.convert_to_unicode(example.text_b) tokens_b = tokenizer.tokenize(text_b) if tokens_b: # Modifies `tokens_a` and `tokens_b` in place so that the total # length is less than the specified length. # Account for [CLS], [SEP], [SEP] with "- 3" self._truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) else: # Account for [CLS] and [SEP] with "- 2" if len(tokens_a) > max_seq_length - 2: tokens_a = tokens_a[0:(max_seq_length - 2)] # The convention in BERT/ERNIE is: # (a) For sequence pairs: # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 # (b) For single sequences: # tokens: [CLS] the dog is hairy . [SEP] # type_ids: 0 0 0 0 0 0 0 # # Where "type_ids" are used to indicate whether this is the first # sequence or the second sequence. The embedding vectors for `type=0` and # `type=1` were learned during pre-training and are added to the wordpiece # embedding vector (and position vector). This is not *strictly* necessary # since the [SEP] token unambiguously separates the sequences, but it makes # it easier for the model to learn the concept of sequences. # # For classification tasks, the first vector (corresponding to [CLS]) is # used as as the "sentence vector". Note that this only makes sense because # the entire model is fine-tuned. tokens = [] text_type_ids = [] tokens.append("[CLS]") text_type_ids.append(0) for token in tokens_a: tokens.append(token) text_type_ids.append(0) tokens.append("[SEP]") text_type_ids.append(0) if tokens_b: for token in tokens_b: tokens.append(token) text_type_ids.append(1) tokens.append("[SEP]") text_type_ids.append(1) token_ids = tokenizer.convert_tokens_to_ids(tokens) position_ids = list(range(len(token_ids))) return [token_ids, text_type_ids, position_ids] def batch_reader(self, examples, batch_size, in_tokens, phase): batch = [] total_token_num = 0 if len(examples) < batch_size: raise Exception('MaskLM dataset contains too few samples. Expect more than '+str(batch_size)) for e in examples: parsed_line = self._convert_example_to_record(e, self.max_seq_len, self.tokenizer) to_append = len(batch) < batch_size if to_append: batch.append(parsed_line) total_token_num += len(parsed_line[0]) else: yield batch, total_token_num batch = [parsed_line] total_token_num = len(parsed_line[0]) if len(batch) > 0 and phase == 'predict': yield batch, total_token_num def data_generator(self, input_file, batch_size, epoch, dev_count=1, shuffle=True, phase=None): examples = self._read_tsv(input_file) if phase is None: phase = 'all' self.examples[phase] = examples def wrapper(): all_dev_batches = [] if epoch is None: num_epochs = 99999999 else: num_epochs = epoch for epoch_index in range(num_epochs): if phase == "train": self.current_example = 0 self.current_epoch = epoch_index if shuffle: np.random.shuffle(examples) all_dev_batches = [] for batch_data, num_tokens in self.batch_reader(examples, batch_size, self.in_tokens, phase=phase): batch_data = prepare_batch_data( batch_data, num_tokens, voc_size=len(self.vocab), pad_id=self.pad_id, cls_id=self.cls_id, sep_id=self.sep_id, mask_id=self.mask_id, # max_len=self.max_seq_len, # 注意,如果padding到最大长度,会导致mask_pos与实际位置不对应。因为mask pos是基于batch内最大长度来计算的。 return_input_mask=True, return_max_len=False, return_num_token=False, dev_count=dev_count) # yield batch for piece in palm.distribute.yield_pieces(batch_data, ['s', 's', 's', 's', 's', 'u', 'u'], batch_size): yield piece # # ds = ['s'] * len(batch_data) # for piece in palm.distribute.yield_pieces(batch_data, ['s'] * 7, batch_size): # yield piece return wrapper class ClassifyReader(Reader): def _read_tsv(self, input_file, quotechar=None): """Reads a tab separated value file.""" with open(input_file, 'r', encoding='utf8') as f: reader = csv_reader(f) headers = next(reader) text_indices = [ index for index, h in enumerate(headers) if h != "label" ] Example = namedtuple('Example', headers) examples = [] for line in reader: for index, text in enumerate(line): if index in text_indices: if self.for_cn: line[index] = text.replace(' ', '') else: line[index] = text example = Example(*line) examples.append(example) return examples def _pad_batch_records(self, batch_records): batch_token_ids = [record.token_ids for record in batch_records] batch_text_type_ids = [record.text_type_ids for record in batch_records] batch_position_ids = [record.position_ids for record in batch_records] if self.phase=='train' and self.learning_strategy == 'pairwise': batch_token_ids_neg = [record.token_ids_neg for record in batch_records] batch_text_type_ids_neg = [record.text_type_ids_neg for record in batch_records] batch_position_ids_neg = [record.position_ids_neg for record in batch_records] if not self.is_inference: if not self.learning_strategy == 'pairwise': batch_labels = [record.label_id for record in batch_records] if self.is_classify: batch_labels = np.array(batch_labels).astype("int64").reshape( [-1]) elif self.is_regression: batch_labels = np.array(batch_labels).astype("float32").reshape( [-1]) if batch_records[0].qid: batch_qids = [record.qid for record in batch_records] batch_qids = np.array(batch_qids).astype("int64").reshape( [-1]) else: batch_qids = np.array([]).astype("int64").reshape([-1]) # padding padded_token_ids, input_mask = pad_batch_data( batch_token_ids, pad_idx=self.pad_id, return_input_mask=True) padded_text_type_ids = pad_batch_data( batch_text_type_ids, pad_idx=self.pad_id) padded_position_ids = pad_batch_data( batch_position_ids, pad_idx=self.pad_id) padded_task_ids = np.ones_like( padded_token_ids, dtype="int64") * self.task_id return_list = [ padded_token_ids, padded_text_type_ids, padded_position_ids, padded_task_ids, input_mask ] if self.phase=='train': if self.learning_strategy == 'pairwise': padded_token_ids_neg, input_mask_neg = pad_batch_data( batch_token_ids_neg, pad_idx=self.pad_id, return_input_mask=True) padded_text_type_ids_neg = pad_batch_data( batch_text_type_ids_neg, pad_idx=self.pad_id) padded_position_ids_neg = pad_batch_data( batch_position_ids_neg, pad_idx=self.pad_id) padded_task_ids_neg = np.ones_like( padded_token_ids_neg, dtype="int64") * self.task_id return_list += [padded_token_ids_neg, padded_text_type_ids_neg, \ padded_position_ids_neg, padded_task_ids_neg, input_mask_neg] elif self.learning_strategy == 'pointwise': return_list += [batch_labels] return return_list class SequenceLabelReader(Reader): def _pad_batch_records(self, batch_records): batch_token_ids = [record.token_ids for record in batch_records] batch_text_type_ids = [record.text_type_ids for record in batch_records] batch_position_ids = [record.position_ids for record in batch_records] batch_label_ids = [record.label_ids for record in batch_records] # padding padded_token_ids, input_mask, batch_seq_lens = pad_batch_data( batch_token_ids, pad_idx=self.pad_id, return_input_mask=True, return_seq_lens=True) padded_text_type_ids = pad_batch_data( batch_text_type_ids, pad_idx=self.pad_id) padded_position_ids = pad_batch_data( batch_position_ids, pad_idx=self.pad_id) padded_label_ids = pad_batch_data( batch_label_ids, pad_idx=len(self.label_map) - 1) padded_task_ids = np.ones_like( padded_token_ids, dtype="int64") * self.task_id return_list = [ padded_token_ids, padded_text_type_ids, padded_position_ids, padded_task_ids, input_mask, padded_label_ids, batch_seq_lens ] return return_list def _reseg_token_label(self, tokens, labels, tokenizer): assert len(tokens) == len(labels) ret_tokens = [] ret_labels = [] for token, label in zip(tokens, labels): sub_token = tokenizer.tokenize(token) if len(sub_token) == 0: continue ret_tokens.extend(sub_token) if len(sub_token) == 1: ret_labels.append(label) continue ret_labels.extend([label] * len(sub_token)) assert len(ret_tokens) == len(ret_labels) return ret_tokens, ret_labels def _convert_example_to_record(self, example, max_seq_length, tokenizer): tokens = tokenization.convert_to_unicode(example.text_a).split(u"") labels = tokenization.convert_to_unicode(example.label).split(u"") tokens, labels = self._reseg_token_label(tokens, labels, tokenizer) if len(tokens) > max_seq_length - 2: tokens = tokens[0:(max_seq_length - 2)] labels = labels[0:(max_seq_length - 2)] tokens = ["[CLS]"] + tokens + ["[SEP]"] token_ids = tokenizer.convert_tokens_to_ids(tokens) position_ids = list(range(len(token_ids))) text_type_ids = [0] * len(token_ids) no_entity_id = len(self.label_map) - 1 labels = [ label if label in self.label_map else u"O" for label in labels ] label_ids = [no_entity_id] + [ self.label_map[label] for label in labels ] + [no_entity_id] Record = namedtuple( 'Record', ['token_ids', 'text_type_ids', 'position_ids', 'label_ids']) record = Record( token_ids=token_ids, text_type_ids=text_type_ids, position_ids=position_ids, label_ids=label_ids) return record class ExtractEmbeddingReader(Reader): def _pad_batch_records(self, batch_records): batch_token_ids = [record.token_ids for record in batch_records] batch_text_type_ids = [record.text_type_ids for record in batch_records] batch_position_ids = [record.position_ids for record in batch_records] # padding padded_token_ids, input_mask, seq_lens = pad_batch_data( batch_token_ids, pad_idx=self.pad_id, return_input_mask=True, return_seq_lens=True) padded_text_type_ids = pad_batch_data( batch_text_type_ids, pad_idx=self.pad_id) padded_position_ids = pad_batch_data( batch_position_ids, pad_idx=self.pad_id) padded_task_ids = np.ones_like( padded_token_ids, dtype="int64") * self.task_id return_list = [ padded_token_ids, padded_text_type_ids, padded_position_ids, padded_task_ids, input_mask, seq_lens ] return return_list class MRCReader(Reader): def __init__(self, vocab_path, label_map_config=None, max_seq_len=512, do_lower_case=True, in_tokens=False, random_seed=None, tokenizer="FullTokenizer", is_classify=True, is_regression=False, for_cn=True, task_id=0, doc_stride=128, max_query_length=64, remove_noanswer=True): self.max_seq_len = max_seq_len self.tokenizer = tokenization.FullTokenizer( vocab_file=vocab_path, do_lower_case=do_lower_case) self.vocab = self.tokenizer.vocab self.pad_id = self.vocab["[PAD]"] self.cls_id = self.vocab["[CLS]"] self.sep_id = self.vocab["[SEP]"] self.in_tokens = in_tokens self.for_cn = for_cn self.task_id = task_id self.doc_stride = doc_stride self.max_query_length = max_query_length self.examples = {} self.features = {} self.remove_noanswer = remove_noanswer if random_seed is not None: np.random.seed(random_seed) self.current_example = 0 self.current_epoch = 0 self.num_examples = 0 self.Example = namedtuple('Example', ['qas_id', 'question_text', 'doc_tokens', 'orig_answer_text', 'start_position', 'end_position']) self.Feature = namedtuple("Feature", ["unique_id", "example_index", "doc_span_index", "tokens", "token_to_orig_map", "token_is_max_context", "token_ids", "position_ids", "text_type_ids", "start_position", "end_position"]) self.DocSpan = namedtuple("DocSpan", ["start", "length"]) def _read_json(self, input_file, is_training): examples = [] with open(input_file, "r", encoding='utf-8') as f: # f = f.read().decode(encoding='gbk').encode(encoding='utf-8') input_data = json.load(f)["data"] for entry in input_data: for paragraph in entry["paragraphs"]: paragraph_text = paragraph["context"] for qa in paragraph["qas"]: qas_id = qa["id"] question_text = qa["question"] start_pos = None end_pos = None orig_answer_text = None if is_training: if len(qa["answers"]) != 1: raise ValueError( "For training, each question should have exactly 1 answer." ) answer = qa["answers"][0] orig_answer_text = answer["text"] answer_offset = answer["answer_start"] answer_length = len(orig_answer_text) doc_tokens = [ paragraph_text[:answer_offset], paragraph_text[answer_offset:answer_offset + answer_length], paragraph_text[answer_offset + answer_length:] ] start_pos = 1 end_pos = 1 actual_text = " ".join(doc_tokens[start_pos:(end_pos + 1)]) if actual_text.find(orig_answer_text) == -1: log.info("Could not find answer: '%s' vs. '%s'", actual_text, orig_answer_text) continue else: doc_tokens = tokenization.tokenize_chinese_chars( paragraph_text) example = self.Example( qas_id=qas_id, question_text=question_text, doc_tokens=doc_tokens, orig_answer_text=orig_answer_text, start_position=start_pos, end_position=end_pos) examples.append(example) return examples def _improve_answer_span(self, doc_tokens, input_start, input_end, tokenizer, orig_answer_text): tok_answer_text = " ".join(tokenizer.tokenize(orig_answer_text)) for new_start in range(input_start, input_end + 1): for new_end in range(input_end, new_start - 1, -1): text_span = " ".join(doc_tokens[new_start:(new_end + 1)]) if text_span == tok_answer_text: return (new_start, new_end) return (input_start, input_end) def _check_is_max_context(self, doc_spans, cur_span_index, position): best_score = None best_span_index = None for (span_index, doc_span) in enumerate(doc_spans): end = doc_span.start + doc_span.length - 1 if position < doc_span.start: continue if position > end: continue num_left_context = position - doc_span.start num_right_context = end - position score = min(num_left_context, num_right_context) + 0.01 * doc_span.length if best_score is None or score > best_score: best_score = score best_span_index = span_index return cur_span_index == best_span_index def _convert_example_to_feature(self, examples, max_seq_length, tokenizer, is_training, remove_noanswer=True): features = [] unique_id = 1000000000 print('converting examples to features...') for (example_index, example) in enumerate(examples): if example_index % 1000 == 0: print('processing {}th example...'.format(example_index)) query_tokens = tokenizer.tokenize(example.question_text) if len(query_tokens) > self.max_query_length: query_tokens = query_tokens[0:self.max_query_length] tok_to_orig_index = [] orig_to_tok_index = [] all_doc_tokens = [] for (i, token) in enumerate(example.doc_tokens): orig_to_tok_index.append(len(all_doc_tokens)) sub_tokens = tokenizer.tokenize(token) for sub_token in sub_tokens: tok_to_orig_index.append(i) all_doc_tokens.append(sub_token) tok_start_position = None tok_end_position = None if is_training: tok_start_position = orig_to_tok_index[example.start_position] if example.end_position < len(example.doc_tokens) - 1: tok_end_position = orig_to_tok_index[example.end_position + 1] - 1 else: tok_end_position = len(all_doc_tokens) - 1 (tok_start_position, tok_end_position) = self._improve_answer_span( all_doc_tokens, tok_start_position, tok_end_position, tokenizer, example.orig_answer_text) max_tokens_for_doc = max_seq_length - len(query_tokens) - 3 doc_spans = [] start_offset = 0 while start_offset < len(all_doc_tokens): length = len(all_doc_tokens) - start_offset if length > max_tokens_for_doc: length = max_tokens_for_doc doc_spans.append(self.DocSpan(start=start_offset, length=length)) if start_offset + length == len(all_doc_tokens): break start_offset += min(length, self.doc_stride) for (doc_span_index, doc_span) in enumerate(doc_spans): tokens = [] token_to_orig_map = {} token_is_max_context = {} text_type_ids = [] tokens.append("[CLS]") text_type_ids.append(0) for token in query_tokens: tokens.append(token) text_type_ids.append(0) tokens.append("[SEP]") text_type_ids.append(0) for i in range(doc_span.length): split_token_index = doc_span.start + i token_to_orig_map[len(tokens)] = tok_to_orig_index[ split_token_index] is_max_context = self._check_is_max_context( doc_spans, doc_span_index, split_token_index) token_is_max_context[len(tokens)] = is_max_context tokens.append(all_doc_tokens[split_token_index]) text_type_ids.append(1) tokens.append("[SEP]") text_type_ids.append(1) token_ids = tokenizer.convert_tokens_to_ids(tokens) position_ids = list(range(len(token_ids))) start_position = None end_position = None if is_training: doc_start = doc_span.start doc_end = doc_span.start + doc_span.length - 1 out_of_span = False if not (tok_start_position >= doc_start and tok_end_position <= doc_end): out_of_span = True if out_of_span: start_position = 0 end_position = 0 if remove_noanswer: continue else: doc_offset = len(query_tokens) + 2 start_position = tok_start_position - doc_start + doc_offset end_position = tok_end_position - doc_start + doc_offset feature = self.Feature( unique_id=unique_id, example_index=example_index, doc_span_index=doc_span_index, tokens=tokens, token_to_orig_map=token_to_orig_map, token_is_max_context=token_is_max_context, token_ids=token_ids, position_ids=position_ids, text_type_ids=text_type_ids, start_position=start_position, end_position=end_position) features.append(feature) unique_id += 1 return features def _prepare_batch_data(self, records, batch_size, phase=None): """generate batch records""" batch_records, max_len = [], 0 if len(records) < batch_size: raise Exception('mrc dataset contains too few samples. Expect more than '+str(batch_size)) for index, record in enumerate(records): if phase == "train": self.current_example = index max_len = max(max_len, len(record.token_ids)) if self.in_tokens: to_append = (len(batch_records) + 1) * max_len <= batch_size else: to_append = len(batch_records) < batch_size if to_append: batch_records.append(record) else: # yield self._pad_batch_records(batch_records, phase == "train") ds = ['s'] * 8 for piece in palm.distribute.yield_pieces(\ self._pad_batch_records(batch_records, phase == 'train'), ds, batch_size): yield piece batch_records, max_len = [record], len(record.token_ids) if phase == 'predict' and batch_records: for piece in palm.distribute.yield_pieces(\ self._pad_batch_records(batch_records, phase == 'train'), ds, batch_size): yield piece def _pad_batch_records(self, batch_records, is_training): batch_token_ids = [record.token_ids for record in batch_records] batch_text_type_ids = [record.text_type_ids for record in batch_records] batch_position_ids = [record.position_ids for record in batch_records] if is_training: batch_start_position = [ record.start_position for record in batch_records ] batch_end_position = [ record.end_position for record in batch_records ] batch_start_position = np.array(batch_start_position).astype( "int64").reshape([-1]) batch_end_position = np.array(batch_end_position).astype( "int64").reshape([-1]) else: batch_size = len(batch_token_ids) batch_start_position = np.zeros( shape=[batch_size], dtype="int64") batch_end_position = np.zeros(shape=[batch_size], dtype="int64") batch_unique_ids = [record.unique_id for record in batch_records] batch_unique_ids = np.array(batch_unique_ids).astype("int64").reshape( [-1]) # padding padded_token_ids, input_mask = pad_batch_data( batch_token_ids, pad_idx=self.pad_id, return_input_mask=True) padded_text_type_ids = pad_batch_data( batch_text_type_ids, pad_idx=self.pad_id) padded_position_ids = pad_batch_data( batch_position_ids, pad_idx=self.pad_id) padded_task_ids = np.ones_like( padded_token_ids, dtype="int64") * self.task_id return_list = [ padded_token_ids, padded_text_type_ids, padded_position_ids, padded_task_ids, input_mask, batch_start_position, batch_end_position, batch_unique_ids ] return return_list def get_num_examples(self, phase): return len(self.features[phase]) def get_features(self, phase): return self.features[phase] def get_examples(self, phase): return self.examples[phase] def data_generator(self, input_file, batch_size, epoch, dev_count=1, shuffle=True, phase=None): examples = self.examples.get(phase, None) features = self.features.get(phase, None) if not examples: examples = self._read_json(input_file, phase == "train") features = self._convert_example_to_feature( examples, self.max_seq_len, self.tokenizer, phase == "train", remove_noanswer=self.remove_noanswer) self.examples[phase] = examples self.features[phase] = features def wrapper(): all_dev_batches = [] if epoch is None: num_epochs = 99999999 else: num_epochs = epoch for epoch_index in range(num_epochs): if phase == "train": self.current_example = 0 self.current_epoch = epoch_index if phase == "train" and shuffle: np.random.shuffle(features) for batch_data in self._prepare_batch_data( features, batch_size, phase=phase): yield batch_data return wrapper if __name__ == '__main__': pass ================================================ FILE: paddlepalm/tokenizer/__init__.py ================================================ ================================================ FILE: paddlepalm/tokenizer/bert_tokenizer.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Tokenization classes.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import collections import unicodedata import six def convert_to_unicode(text): """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" if six.PY3: if isinstance(text, str): return text elif isinstance(text, bytes): return text.decode("utf-8", "ignore") else: raise ValueError("Unsupported string type: %s" % (type(text))) elif six.PY2: if isinstance(text, str): return text.decode("utf-8", "ignore") elif isinstance(text, unicode): return text else: raise ValueError("Unsupported string type: %s" % (type(text))) else: raise ValueError("Not running on Python2 or Python 3?") def printable_text(text): """Returns text encoded in a way suitable for print or `tf.logging`.""" # These functions want `str` for both Python2 and Python3, but in one case # it's a Unicode string and in the other it's a byte string. if six.PY3: if isinstance(text, str): return text elif isinstance(text, bytes): return text.decode("utf-8", "ignore") else: raise ValueError("Unsupported string type: %s" % (type(text))) elif six.PY2: if isinstance(text, str): return text elif isinstance(text, unicode): return text.encode("utf-8") else: raise ValueError("Unsupported string type: %s" % (type(text))) else: raise ValueError("Not running on Python2 or Python 3?") def load_vocab(vocab_file): """Loads a vocabulary file into a dictionary.""" vocab = collections.OrderedDict() fin = open(vocab_file) for num, line in enumerate(fin): items = convert_to_unicode(line.strip()).split("\t") if len(items) > 2: break token = items[0] index = items[1] if len(items) == 2 else num token = token.strip() vocab[token] = int(index) return vocab def convert_by_vocab(vocab, items): """Converts a sequence of [tokens|ids] using the vocab.""" output = [] for item in items: output.append(vocab[item]) return output def convert_tokens_to_ids(vocab, tokens): return convert_by_vocab(vocab, tokens) def convert_ids_to_tokens(inv_vocab, ids): return convert_by_vocab(inv_vocab, ids) def whitespace_tokenize(text): """Runs basic whitespace cleaning and splitting on a peice of text.""" text = text.strip() if not text: return [] tokens = text.split() return tokens class FullTokenizer(object): """Runs end-to-end tokenziation.""" def __init__(self, vocab_file, do_lower_case=True): self.vocab = load_vocab(vocab_file) self.inv_vocab = {v: k for k, v in self.vocab.items()} self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case) self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) def tokenize(self, text): split_tokens = [] for token in self.basic_tokenizer.tokenize(text): for sub_token in self.wordpiece_tokenizer.tokenize(token): split_tokens.append(sub_token) return split_tokens def convert_tokens_to_ids(self, tokens): return convert_by_vocab(self.vocab, tokens) def convert_ids_to_tokens(self, ids): return convert_by_vocab(self.inv_vocab, ids) class CharTokenizer(object): """Runs end-to-end tokenziation.""" def __init__(self, vocab_file, do_lower_case=True): self.vocab = load_vocab(vocab_file) self.inv_vocab = {v: k for k, v in self.vocab.items()} self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) def tokenize(self, text): split_tokens = [] for token in text.lower().split(" "): for sub_token in self.wordpiece_tokenizer.tokenize(token): split_tokens.append(sub_token) return split_tokens def convert_tokens_to_ids(self, tokens): return convert_by_vocab(self.vocab, tokens) def convert_ids_to_tokens(self, ids): return convert_by_vocab(self.inv_vocab, ids) class BasicTokenizer(object): """Runs basic tokenization (punctuation splitting, lower casing, etc.).""" def __init__(self, do_lower_case=True): """Constructs a BasicTokenizer. Args: do_lower_case: Whether to lower case the input. """ self.do_lower_case = do_lower_case self._never_lowercase = ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]'] def tokenize(self, text): """Tokenizes a piece of text.""" text = convert_to_unicode(text) text = self._clean_text(text) # This was added on November 1st, 2018 for the multilingual and Chinese # models. This is also applied to the English models now, but it doesn't # matter since the English models were not trained on any Chinese data # and generally don't have any Chinese data in them (there are Chinese # characters in the vocabulary because Wikipedia does have some Chinese # words in the English Wikipedia.). text = self._tokenize_chinese_chars(text) orig_tokens = whitespace_tokenize(text) split_tokens = [] for token in orig_tokens: if self.do_lower_case and token not in self._never_lowercase: token = token.lower() token = self._run_strip_accents(token) if token in self._never_lowercase: split_tokens.extend([token]) else: split_tokens.extend(self._run_split_on_punc(token)) output_tokens = whitespace_tokenize(" ".join(split_tokens)) return output_tokens def _run_strip_accents(self, text): """Strips accents from a piece of text.""" text = unicodedata.normalize("NFD", text) output = [] for char in text: cat = unicodedata.category(char) if cat == "Mn": continue output.append(char) return "".join(output) def _run_split_on_punc(self, text): """Splits punctuation on a piece of text.""" chars = list(text) i = 0 start_new_word = True output = [] while i < len(chars): char = chars[i] if _is_punctuation(char): output.append([char]) start_new_word = True else: if start_new_word: output.append([]) start_new_word = False output[-1].append(char) i += 1 return ["".join(x) for x in output] def _tokenize_chinese_chars(self, text): """Adds whitespace around any CJK character.""" output = [] for char in text: cp = ord(char) if self._is_chinese_char(cp): output.append(" ") output.append(char) output.append(" ") else: output.append(char) return "".join(output) def _is_chinese_char(self, cp): """Checks whether CP is the codepoint of a CJK character.""" # This defines a "chinese character" as anything in the CJK Unicode block: # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) # # Note that the CJK Unicode block is NOT all Japanese and Korean characters, # despite its name. The modern Korean Hangul alphabet is a different block, # as is Japanese Hiragana and Katakana. Those alphabets are used to write # space-separated words, so they are not treated specially and handled # like the all of the other languages. if ((cp >= 0x4E00 and cp <= 0x9FFF) or # (cp >= 0x3400 and cp <= 0x4DBF) or # (cp >= 0x20000 and cp <= 0x2A6DF) or # (cp >= 0x2A700 and cp <= 0x2B73F) or # (cp >= 0x2B740 and cp <= 0x2B81F) or # (cp >= 0x2B820 and cp <= 0x2CEAF) or (cp >= 0xF900 and cp <= 0xFAFF) or # (cp >= 0x2F800 and cp <= 0x2FA1F)): # return True return False def _clean_text(self, text): """Performs invalid character removal and whitespace cleanup on text.""" output = [] for char in text: cp = ord(char) if cp == 0 or cp == 0xfffd or _is_control(char): continue if _is_whitespace(char): output.append(" ") else: output.append(char) return "".join(output) class WordpieceTokenizer(object): """Runs WordPiece tokenziation.""" def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100): self.vocab = vocab self.unk_token = unk_token self.max_input_chars_per_word = max_input_chars_per_word def tokenize(self, text): """Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. For example: input = "unaffable" output = ["un", "##aff", "##able"] Args: text: A single token or whitespace separated tokens. This should have already been passed through `BasicTokenizer. Returns: A list of wordpiece tokens. """ text = convert_to_unicode(text) output_tokens = [] for token in whitespace_tokenize(text): chars = list(token) if len(chars) > self.max_input_chars_per_word: output_tokens.append(self.unk_token) continue is_bad = False start = 0 sub_tokens = [] while start < len(chars): end = len(chars) cur_substr = None while start < end: substr = "".join(chars[start:end]) if start > 0: substr = "##" + substr if substr in self.vocab: cur_substr = substr break end -= 1 if cur_substr is None: is_bad = True break sub_tokens.append(cur_substr) start = end if is_bad: output_tokens.append(self.unk_token) else: output_tokens.extend(sub_tokens) return output_tokens def _is_whitespace(char): """Checks whether `chars` is a whitespace character.""" # \t, \n, and \r are technically contorl characters but we treat them # as whitespace since they are generally considered as such. if char == " " or char == "\t" or char == "\n" or char == "\r": return True cat = unicodedata.category(char) if cat == "Zs": return True return False def _is_control(char): """Checks whether `chars` is a control character.""" # These are technically control characters but we count them as whitespace # characters. if char == "\t" or char == "\n" or char == "\r": return False cat = unicodedata.category(char) if cat.startswith("C"): return True return False def _is_punctuation(char): """Checks whether `chars` is a punctuation character.""" cp = ord(char) # We treat all non-letter/number ASCII as punctuation. # Characters such as "^", "$", and "`" are not in the Unicode # Punctuation class but we treat them as punctuation anyways, for # consistency. if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)): return True cat = unicodedata.category(char) if cat.startswith("P"): return True return False ================================================ FILE: paddlepalm/tokenizer/ernie_tokenizer.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Tokenization classes.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function from __future__ import unicode_literals from __future__ import absolute_import from io import open import collections import unicodedata import six def convert_to_unicode(text): """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" if six.PY3: if isinstance(text, str): return text elif isinstance(text, bytes): return text.decode("utf-8", "ignore") else: raise ValueError("Unsupported string type: %s" % (type(text))) elif six.PY2: if isinstance(text, str): return text.decode("utf-8", "ignore") elif isinstance(text, unicode): return text else: raise ValueError("Unsupported string type: %s" % (type(text))) else: raise ValueError("Not running on Python2 or Python 3?") def printable_text(text): """Returns text encoded in a way suitable for print or `tf.logging`.""" # These functions want `str` for both Python2 and Python3, but in one case # it's a Unicode string and in the other it's a byte string. if six.PY3: if isinstance(text, str): return text elif isinstance(text, bytes): return text.decode("utf-8", "ignore") else: raise ValueError("Unsupported string type: %s" % (type(text))) elif six.PY2: if isinstance(text, str): return text elif isinstance(text, unicode): return text.encode("utf-8") else: raise ValueError("Unsupported string type: %s" % (type(text))) else: raise ValueError("Not running on Python2 or Python 3?") def load_vocab(vocab_file): """Loads a vocabulary file into a dictionary.""" vocab = collections.OrderedDict() with open(vocab_file, encoding='utf8') as fin: for num, line in enumerate(fin): items = convert_to_unicode(line.strip()).split("\t") if len(items) > 2: break token = items[0] index = items[1] if len(items) == 2 else num token = token.strip() vocab[token] = int(index) return vocab def convert_by_vocab(vocab, items): """Converts a sequence of [tokens|ids] using the vocab.""" output = [] for item in items: output.append(vocab[item]) return output def convert_tokens_to_ids(vocab, tokens): return convert_by_vocab(vocab, tokens) def convert_ids_to_tokens(inv_vocab, ids): return convert_by_vocab(inv_vocab, ids) def whitespace_tokenize(text): """Runs basic whitespace cleaning and splitting on a peice of text.""" text = text.strip() if not text: return [] tokens = text.split() return tokens class FullTokenizer(object): """Runs end-to-end tokenziation.""" def __init__(self, vocab_file, do_lower_case=True): self.vocab = load_vocab(vocab_file) self.inv_vocab = {v: k for k, v in self.vocab.items()} self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case) self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) def tokenize(self, text): split_tokens = [] for token in self.basic_tokenizer.tokenize(text): for sub_token in self.wordpiece_tokenizer.tokenize(token): split_tokens.append(sub_token) return split_tokens def convert_tokens_to_ids(self, tokens): return convert_by_vocab(self.vocab, tokens) def convert_ids_to_tokens(self, ids): return convert_by_vocab(self.inv_vocab, ids) class CharTokenizer(object): """Runs end-to-end tokenziation.""" def __init__(self, vocab_file, do_lower_case=True): self.vocab = load_vocab(vocab_file) self.inv_vocab = {v: k for k, v in self.vocab.items()} self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) def tokenize(self, text): split_tokens = [] for token in text.lower().split(" "): for sub_token in self.wordpiece_tokenizer.tokenize(token): split_tokens.append(sub_token) return split_tokens def convert_tokens_to_ids(self, tokens): return convert_by_vocab(self.vocab, tokens) def convert_ids_to_tokens(self, ids): return convert_by_vocab(self.inv_vocab, ids) class BasicTokenizer(object): """Runs basic tokenization (punctuation splitting, lower casing, etc.).""" def __init__(self, do_lower_case=True): """Constructs a BasicTokenizer. Args: do_lower_case: Whether to lower case the input. """ self.do_lower_case = do_lower_case self._never_lowercase = ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]'] def tokenize(self, text): """Tokenizes a piece of text.""" text = convert_to_unicode(text) text = self._clean_text(text) # This was added on November 1st, 2018 for the multilingual and Chinese # models. This is also applied to the English models now, but it doesn't # matter since the English models were not trained on any Chinese data # and generally don't have any Chinese data in them (there are Chinese # characters in the vocabulary because Wikipedia does have some Chinese # words in the English Wikipedia.). text = self._tokenize_chinese_chars(text) orig_tokens = whitespace_tokenize(text) split_tokens = [] for token in orig_tokens: if self.do_lower_case and token not in self._never_lowercase: token = token.lower() token = self._run_strip_accents(token) if token in self._never_lowercase: split_tokens.extend([token]) else: split_tokens.extend(self._run_split_on_punc(token)) output_tokens = whitespace_tokenize(" ".join(split_tokens)) return output_tokens def _run_strip_accents(self, text): """Strips accents from a piece of text.""" text = unicodedata.normalize("NFD", text) output = [] for char in text: cat = unicodedata.category(char) if cat == "Mn": continue output.append(char) return "".join(output) def _run_split_on_punc(self, text): """Splits punctuation on a piece of text.""" chars = list(text) i = 0 start_new_word = True output = [] while i < len(chars): char = chars[i] if _is_punctuation(char): output.append([char]) start_new_word = True else: if start_new_word: output.append([]) start_new_word = False output[-1].append(char) i += 1 return ["".join(x) for x in output] def _tokenize_chinese_chars(self, text): """Adds whitespace around any CJK character.""" output = [] for char in text: cp = ord(char) if self._is_chinese_char(cp): output.append(" ") output.append(char) output.append(" ") else: output.append(char) return "".join(output) def _is_chinese_char(self, cp): """Checks whether CP is the codepoint of a CJK character.""" # This defines a "chinese character" as anything in the CJK Unicode block: # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) # # Note that the CJK Unicode block is NOT all Japanese and Korean characters, # despite its name. The modern Korean Hangul alphabet is a different block, # as is Japanese Hiragana and Katakana. Those alphabets are used to write # space-separated words, so they are not treated specially and handled # like the all of the other languages. if ((cp >= 0x4E00 and cp <= 0x9FFF) or # (cp >= 0x3400 and cp <= 0x4DBF) or # (cp >= 0x20000 and cp <= 0x2A6DF) or # (cp >= 0x2A700 and cp <= 0x2B73F) or # (cp >= 0x2B740 and cp <= 0x2B81F) or # (cp >= 0x2B820 and cp <= 0x2CEAF) or (cp >= 0xF900 and cp <= 0xFAFF) or # (cp >= 0x2F800 and cp <= 0x2FA1F)): # return True return False def _clean_text(self, text): """Performs invalid character removal and whitespace cleanup on text.""" output = [] for char in text: cp = ord(char) if cp == 0 or cp == 0xfffd or _is_control(char): continue if _is_whitespace(char): output.append(" ") else: output.append(char) return "".join(output) class WordpieceTokenizer(object): """Runs WordPiece tokenziation.""" def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100): self.vocab = vocab self.unk_token = unk_token self.max_input_chars_per_word = max_input_chars_per_word def tokenize(self, text): """Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. For example: input = "unaffable" output = ["un", "##aff", "##able"] Args: text: A single token or whitespace separated tokens. This should have already been passed through `BasicTokenizer. Returns: A list of wordpiece tokens. """ text = convert_to_unicode(text) output_tokens = [] for token in whitespace_tokenize(text): chars = list(token) if len(chars) > self.max_input_chars_per_word: output_tokens.append(self.unk_token) continue is_bad = False start = 0 sub_tokens = [] while start < len(chars): end = len(chars) cur_substr = None while start < end: substr = "".join(chars[start:end]) if start > 0: substr = "##" + substr if substr in self.vocab: cur_substr = substr break end -= 1 if cur_substr is None: is_bad = True break sub_tokens.append(cur_substr) start = end if is_bad: output_tokens.append(self.unk_token) else: output_tokens.extend(sub_tokens) return output_tokens def _is_whitespace(char): """Checks whether `chars` is a whitespace character.""" # \t, \n, and \r are technically contorl characters but we treat them # as whitespace since they are generally considered as such. if char == " " or char == "\t" or char == "\n" or char == "\r": return True cat = unicodedata.category(char) if cat == "Zs": return True return False def _is_control(char): """Checks whether `chars` is a control character.""" # These are technically control characters but we count them as whitespace # characters. if char == "\t" or char == "\n" or char == "\r": return False cat = unicodedata.category(char) if cat.startswith("C"): return True return False def _is_punctuation(char): """Checks whether `chars` is a punctuation character.""" cp = ord(char) # We treat all non-letter/number ASCII as punctuation. # Characters such as "^", "$", and "`" are not in the Unicode # Punctuation class but we treat them as punctuation anyways, for # consistency. if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)): return True cat = unicodedata.category(char) if cat.startswith("P"): return True return False def tokenize_chinese_chars(text): """Adds whitespace around any CJK character.""" def _is_chinese_char(cp): """Checks whether CP is the codepoint of a CJK character.""" # This defines a "chinese character" as anything in the CJK Unicode block: # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) # # Note that the CJK Unicode block is NOT all Japanese and Korean characters, # despite its name. The modern Korean Hangul alphabet is a different block, # as is Japanese Hiragana and Katakana. Those alphabets are used to write # space-separated words, so they are not treated specially and handled # like the all of the other languages. if ((cp >= 0x4E00 and cp <= 0x9FFF) or # (cp >= 0x3400 and cp <= 0x4DBF) or # (cp >= 0x20000 and cp <= 0x2A6DF) or # (cp >= 0x2A700 and cp <= 0x2B73F) or # (cp >= 0x2B740 and cp <= 0x2B81F) or # (cp >= 0x2B820 and cp <= 0x2CEAF) or (cp >= 0xF900 and cp <= 0xFAFF) or # (cp >= 0x2F800 and cp <= 0x2FA1F)): # return True return False def _is_whitespace(c): if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F: return True return False output = [] buff = "" for char in text: cp = ord(char) if _is_chinese_char(cp) or _is_whitespace(char): if buff != "": output.append(buff) buff = "" output.append(char) else: buff += char if buff != "": output.append(buff) return output ================================================ FILE: paddlepalm/trainer.py ================================================ # -*- coding: utf-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from __future__ import print_function import os import json from paddle import fluid import time import sys import numpy as np import paddlepalm.utils.basic_helper as helper from paddlepalm.utils import reader_helper, saver from paddlepalm.distribute import gpu_dev_count, data_feeder, decode_fake # from paddlepalm.default_settings import * DEBUG=False class Trainer(object): """ The core unit to start a training/predicting session for single task. A trainer is to build computation graph, manage training and evaluation process, achieve model/checkpoint saving and pretrain_model/checkpoint loading. """ def __init__(self, name, mix_ratio=1.0, reuse_head_with=None): """Create a new trainer. Args: name: string. The name of the trainer(training task). mix_ratio: sampling weight of this trainer in multi-task learning mode. Default is 1.0. reuse_head_with: reuse parameters of task head with another trainer. Default is None, not reuse with others. """ self._name = name self._pred_reader = None self._task_head = None self._pred_head = None self._train_reader = None self._dist_train_init = False self._predict_reader = None self._train_iterator = None self._predict_iterator = None self._train_init = False self._predict_init = False self._train_init_prog = None self._pred_init_prog = None self._check_save = lambda: False self._task_reuse_scope = name if reuse_head_with is None else reuse_head_with self._feeded_var_names = None self._target_vars = None self._predict_vars = None self._num_examples = 0 self._multi_task = False self._as_auxilary = False self._task_id = None # training process management self._mix_ratio = mix_ratio self._expected_train_steps = None self._expected_train_epochs = None self._steps_pur_epoch = None self._pred_steps_pur_epoch = None self._cur_train_epoch = 0 self._cur_train_step = 0 self._train_finish = False self._inputname_to_varname = {} self._pred_input_name_list = [] self._pred_input_varname_list = [] self._pred_fetch_name_list = [] self._pred_fetch_var_list = [] # exe is built when random_init_params called. self._exe = None self._save_protocol = { 'input_names': 'self._pred_input_name_list', 'input_varnames': 'self._pred_input_varname_list', 'fetch_list': 'self._pred_fetch_name_list'} self._lock = False self._lock_prog = False self._build_forward = False def build_forward(self, backbone, task_head): """ Build forward computation graph for training, which usually built from input layer to loss node. Args: backbone: a Backbone object with phase == 'train', which is used to extract multi-level text features, e.g., contextual word embedding and sentence embedding. head: a Head object with phase == 'train', which is used to build task specific output layers. Return: loss_var: a Variable object. The computational graph variable(node) of loss. """ self._task_head = task_head self._backbone = backbone self._build_forward = True # create reader, task # then check i/o across reader, backbone and task_layer task_attrs = [] pred_task_attrs = [] task_attr_from_reader = helper.encode_inputs(self._task_head.inputs_attrs['reader'], self.name) # merge reader input attrs from backbone and task_instances input_names, shape_and_dtypes, name_to_position = reader_helper.merge_input_attrs(backbone.inputs_attr, task_attr_from_reader, insert_taskid=False) # shapes: [task_id, shapes_of_backbone, shapes_of_inst1, ..., shapes_of_instN] self._shape_and_dtypes = shape_and_dtypes self._name_to_position = name_to_position self._input_names = input_names if DEBUG: print('----- for debug -----') print('joint input names:') print(joint_input_names) print('joint input shape and dtypes:') print(joint_shape_and_dtypes) input_attrs = [[i, j, k] for i, (j,k) in zip(input_names, shape_and_dtypes)] train_prog = fluid.Program() train_init_prog = fluid.Program() if not self._lock_prog: self._train_prog = train_prog self._train_init_prog = train_init_prog if not self._lock_prog: with fluid.program_guard(train_prog, train_init_prog): net_inputs = reader_helper.create_net_inputs(input_attrs, is_async=False) bb_output_vars = backbone.build(net_inputs) else: net_inputs = reader_helper.create_net_inputs(input_attrs, is_async=False) bb_output_vars = backbone.build(net_inputs) self._net_inputs = net_inputs assert sorted(bb_output_vars.keys()) == sorted(backbone.outputs_attr.keys()) task_output_vars = {} task_inputs = {'backbone': bb_output_vars} task_inputs_from_reader = helper.decode_inputs(net_inputs, self.name) task_inputs['reader'] = task_inputs_from_reader scope = self.name+'.' if not self._lock_prog: with fluid.program_guard(train_prog, train_init_prog): with fluid.unique_name.guard(scope): output_vars = self._build_head(task_inputs, phase='train', scope=scope) else: with fluid.unique_name.guard(scope): output_vars = self._build_head(task_inputs, phase='train', scope=scope) output_vars = {self.name+'.'+key: val for key, val in output_vars.items()} old = len(task_output_vars) # for debug task_output_vars.update(output_vars) assert len(task_output_vars) - old == len(output_vars) # for debug bb_fetches = {k: v.name for k,v in bb_output_vars.items()} task_fetches = {k: v.name for k,v in task_output_vars.items()} self._fetches = task_fetches self._fetch_names, self._fetch_list = zip(*self._fetches.items()) if not self._lock_prog: with fluid.program_guard(train_prog, train_init_prog): loss_var = fluid.layers.reduce_sum(task_output_vars[self.name+'.loss']) else: loss_var = fluid.layers.reduce_sum(task_output_vars[self.name+'.loss']) self._loss_var = loss_var if not self._multi_task: self._init_exe_prog(for_train=True) return loss_var def build_predict_forward(self, pred_backbone, pred_head): """ Build computation graph for evaluation and prediction. Arguments: - pred_backbone: a Backbone object with phase == 'predict'. For evaluating model during training, the predict backbone should keep the same with train backbone. - pred_head: a Head object with phase == 'predict'. For evaluating model during training, the predict head should keep the same with train head. Return: - output_vars: dict type. Each value is a computational graph variable(node) argumented by pred_head outputs_attr. """ self._pred_head = pred_head self._pred_backbone = pred_backbone pred_task_attr_from_reader = helper.encode_inputs(self._pred_head.inputs_attrs['reader'], self.name) pred_input_names, pred_shape_and_dtypes, pred_name_to_position = reader_helper.merge_input_attrs(pred_backbone.inputs_attr, pred_task_attr_from_reader, insert_taskid=False) pred_input_attrs = [[i, j, k] for i, (j,k) in zip(pred_input_names, pred_shape_and_dtypes)] self._pred_shape_and_dtypes = pred_shape_and_dtypes self._pred_name_to_position = pred_name_to_position self._pred_input_names = pred_input_names if not self._lock_prog: pred_prog = fluid.Program() self._pred_prog = pred_prog pred_init_prog = fluid.Program() self._pred_init_prog = pred_init_prog with fluid.program_guard(pred_prog, pred_init_prog): pred_net_inputs = reader_helper.create_net_inputs(pred_input_attrs) pred_bb_output_vars = pred_backbone.build(pred_net_inputs) self._pred_net_inputs = pred_net_inputs else: pred_net_inputs = reader_helper.create_net_inputs(pred_input_attrs) pred_bb_output_vars = pred_backbone.build(pred_net_inputs) self._pred_net_inputs = pred_net_inputs # prepare predict vars for saving inference model if not self._lock_prog: with fluid.program_guard(pred_prog, pred_init_prog): cur_inputs = helper.decode_inputs(pred_net_inputs, self.name) self._pred_input_name_list, self._pred_input_varname_list = \ zip(*[[k, v.name] for k,v in cur_inputs.items()]) pred_task_inputs = {'backbone': pred_bb_output_vars, 'reader': cur_inputs} scope = self.name + '.' with fluid.unique_name.guard(scope): output_vars = self._build_head(pred_task_inputs, phase='predict', scope=scope) else: cur_inputs = helper.decode_inputs(pred_net_inputs, self.name) self._pred_input_name_list, self._pred_input_varname_list = \ zip(*[[k, v.name] for k,v in cur_inputs.items()]) pred_task_inputs = {'backbone': pred_bb_output_vars, 'reader': cur_inputs} scope = self.name + '.' with fluid.unique_name.guard(scope): output_vars = self._build_head(pred_task_inputs, phase='predict', scope=scope) if output_vars is not None: self._pred_fetch_name_list, self._pred_fetch_list = zip(*output_vars.items()) else: self._pred_fetch_name_list = [] self._pred_fetch_var_list = [] # if not self._multi_task: self._init_exe_prog(for_train=False) self._exe.run(self._pred_init_prog) self._predict_vars = output_vars return output_vars def build_backward(self, optimizer, weight_decay=None, use_ema=False, ema_decay=None): """ Build backward computation graph and training strategy. Arguments: - optimizer: - weight_decay: optional, default is None (disable weight decay). - use_ema: optional, default is False. The flag to control whether to apply Exponential Moving Average strategy on parameter updates. - ema_decay: optional, default is None. Only works with use_ema == True. Control decay rate of EMA strategy. """ # build optimizer assert self._loss_var is not None and self._train_init_prog is not None, "train graph not foung! You should build_forward first." optimizer._set_prog(self._train_prog, self._train_init_prog) with fluid.program_guard(self._train_prog, self._train_init_prog): param_grads = optimizer._build() if weight_decay is not None: param_list = dict() for param in self._train_prog.global_block().all_parameters(): param_list[param.name] = param * 1.0 param_list[param.name].stop_gradient = True def exclude_from_weight_decay(name): if name.find("layer_norm") > -1: return True bias_suffix = ["_bias", "_b", ".b_0"] for suffix in bias_suffix: if name.endswith(suffix): return True return False for param, grad in param_grads: if exclude_from_weight_decay(param.name): continue with param.block.program._optimized_guard( [param, grad]), fluid.framework.name_scope("weight_decay"): updated_param = param - param_list[ param.name] * weight_decay * optimizer.get_cur_learning_rate() fluid.layers.assign(output=param, input=updated_param) if use_ema: ema = fluid.optimizer.ExponentialMovingAverage(ema_decay) ema.update() self._exe.run(self._train_init_prog) def set_as_aux(self): """Set the task in this trainer as auxilary task. \nCAUSIOUS: This API only works on multi-task learning mode. Each task is set as target task by default. """ self._as_auxilary = True def fit_reader(self, reader, phase='train'): """ Bind a reader and loaded train/predict data to trainer. Args: reader: a Reader object. The running phase of the reader should be consistent with `phase` argument of this method. phase: running phase. Currently support: train, predict. """ self._check_phase(phase) if phase=='train': assert self._shape_and_dtypes is not None, "You need to build_forward or build_predict_head first to prepare input features." else: assert self._pred_shape_and_dtypes is not None, "You need to build_forward or build_predict_head first to prepare input features." batch_size = reader._batch_size self._num_epochs = reader.num_epochs if phase == 'train': self._train_reader = reader self._steps_pur_epoch = reader.num_examples // batch_size shape_and_dtypes = self._shape_and_dtypes name_to_position = self._name_to_position if self._task_id is not None: self._net_inputs['__task_id'] = self._task_id net_inputs = self._net_inputs self._train_batch_size = batch_size self._num_examples = reader.num_examples reader_helper.check_io(self._backbone.inputs_attr, reader.outputs_attr, in_name='backbone', out_name='reader(train)') reader_helper.check_io(self._task_head.inputs_attrs['reader'], reader.outputs_attr, in_name='task_head(reader)', out_name='reader(train)') reader_helper.check_io(self._task_head.inputs_attrs['backbone'], self._backbone.outputs_attr, in_name='task_head(backbone, train)', out_name='backbone') elif phase == 'predict': self._predict_reader = reader self._pred_steps_pur_epoch = reader.num_examples // batch_size shape_and_dtypes = self._pred_shape_and_dtypes name_to_position = self._pred_name_to_position net_inputs = self._pred_net_inputs self._predict_batch_size = batch_size self._pred_num_examples = reader.num_examples reader_helper.check_io(self._pred_backbone.inputs_attr, reader.outputs_attr, in_name='backbone', out_name='reader(predict)') reader_helper.check_io(self._pred_head.inputs_attrs['reader'], reader.outputs_attr, in_name='task_head(reader)', out_name='reader(predict)') reader_helper.check_io(self._pred_head.inputs_attrs['backbone'], self._pred_backbone.outputs_attr, in_name='task_head(backbone, predict)', out_name='backbone') else: raise NotImplementedError() print('ok!') # merge dataset iterators and create net input vars iterator = reader._iterator() prefix = self.name # merge dataset iterators and create net input vars iterator = reader._iterator() prefix = self.name # 对yield出的数据进行runtime检查和适配 iterator_fn = reader_helper.create_iterator_fn(iterator, prefix, shape_and_dtypes, name_to_position, return_type='dict') self._raw_iterator_fn = iterator_fn feed_batch_process_fn = reader_helper.create_feed_batch_process_fn(net_inputs) if gpu_dev_count > 1: distribute_feeder_fn = data_feeder(iterator_fn, feed_batch_process_fn, phase=phase) else: distribute_feeder_fn = iterator_fn() if phase == 'train': self._train_iterator = distribute_feeder_fn self._feed_batch_process_fn = feed_batch_process_fn elif phase == 'predict': self._predict_iterator = distribute_feeder_fn self._pred_feed_batch_process_fn = feed_batch_process_fn return distribute_feeder_fn def load_ckpt(self, model_path): """ load training checkpoint for further training or predicting. Args: model_path: the path of saved checkpoint/parameters. """ assert self._train_init_prog is not None or self._pred_init_prog is not None, "model graph not built. You should at least build_forward or build_predict_forward to load its checkpoint." # if self._train_init_prog is not None: # saver.init_pretraining_params( # self._exe, # model_path, # convert=False, # main_program=self._train_init_prog, # strict=True) # elif self._pred_init_prog is not None: # saver.init_pretraining_params( # self._exe, # model_path, # convert=False, # main_program=self._pred_init_prog, # strict=True) if self._train_init_prog is not None: print('loading checkpoint into train program') saver.init_checkpoint( self._exe, model_path, main_program=self._train_init_prog) elif self._pred_init_prog is not None: saver.init_checkpoint( self._exe, model_path, main_program=self._pred_init_prog) else: raise Exception("model not found. You should at least build_forward or build_predict_forward to load its checkpoint.") def load_predict_model(self, model_path, convert=False): """ load pretrain models(backbone) for training. Args: model_path: the path of saved pretrained parameters. """ assert self._pred_prog is not None, "training graph not found. You should at least build_forward to load its pretrained parameters." saver.init_pretraining_params( self._exe, model_path, convert=convert, main_program=self._pred_prog) def load_pretrain(self, model_path, convert=False): """ load pretrain models(backbone) for training. Args: model_path: the path of saved pretrained parameters. """ assert self._train_init_prog is not None, "training graph not found. You should at least build_forward to load its pretrained parameters." saver.init_pretraining_params( self._exe, model_path, convert=convert, main_program=self._train_init_prog) def set_saver(self, save_path, save_steps, save_type='ckpt'): """ create a build-in saver into trainer. A saver will automatically save checkpoint or predict model every `save_steps` training steps. Args: save_path: a string. the path to save checkpoints or predict models. save_steps: an integer. the frequency to save models. save_type: a string. The type of saved model. Currently support checkpoint(ckpt) and predict model(predict), default is ckpt. If both two types are needed to save, you can set as "ckpt,predict". """ save_type = save_type.split(',') if 'predict' in save_type: assert self._pred_head is not None, "Predict head not found! You should build_predict_head first if you want to save predict model." assert save_path is not None and save_steps is not None, 'save_path and save_steps is required to save model.' self._save_predict = True if not os.path.exists(save_path): os.makedirs(save_path) else: self._save_predict = False if 'ckpt' in save_type: if save_path is not None and save_steps is not None: self._save_ckpt = True if not os.path.exists(save_path): os.makedirs(save_path) else: "WARNING: save_path or save_steps is not set, model will not be saved during training." self._save_ckpt = False else: self._save_ckpt = False def temp_func(): if (self._save_predict or self._save_ckpt) and self._cur_train_step % save_steps == 0: if self._save_predict: self._save(save_path, suffix='pred.step'+str(self._cur_train_step)) print('predict model has been saved at '+os.path.join(save_path, 'pred.step'+str(self._cur_train_step))) sys.stdout.flush() if self._save_ckpt: fluid.io.save_persistables(self._exe, os.path.join(save_path, 'ckpt.step'+str(self._cur_train_step)), self._train_prog) print('checkpoint has been saved at '+os.path.join(save_path, 'ckpt.step'+str(self._cur_train_step))) sys.stdout.flush() return True else: return False self._check_save = temp_func def train(self, print_steps=5): """ start training. Args: print_steps: int. Logging frequency of training message, e.g., current step, loss and speed. """ iterator = self._train_iterator self._distribute_train_prog = fluid.CompiledProgram(self._train_prog).with_data_parallel(loss_name=self._loss_var.name) time_begin = time.time() for feed in iterator: rt_outputs = self.train_one_step(feed) task_rt_outputs = {k[len(self.name+'.'):]: v for k,v in rt_outputs.items() if k.startswith(self.name+'.')} self._task_head.batch_postprocess(task_rt_outputs) if print_steps > 0 and self._cur_train_step % print_steps == 0: loss = rt_outputs[self.name+'.loss'] loss = np.mean(np.squeeze(loss)).tolist() time_end = time.time() time_cost = time_end - time_begin print("step {}/{} (epoch {}), loss: {:.3f}, speed: {:.2f} steps/s".format( (self._cur_train_step-1) % self._steps_pur_epoch + 1 , self._steps_pur_epoch, self._cur_train_epoch, loss, print_steps / time_cost)) sys.stdout.flush() time_begin = time.time() if self._num_epochs is None and not self._multi_task and self._cur_train_step == self._steps_pur_epoch: break def predict(self, output_dir=None, print_steps=1000): """ start predicting. Args: output_dir: str. The path to save prediction results, default is None. If set as None, the results would output to screen directly. print_steps: int. Logging frequency of predicting message, e.g., current progress and speed. """ iterator = self._predict_iterator self._distribute_pred_prog = fluid.CompiledProgram(self._pred_prog).with_data_parallel() if output_dir is not None and not os.path.exists(output_dir): os.makedirs(output_dir) time_begin = time.time() cur_predict_step = 0 for feed in iterator: rt_outputs = self.predict_one_batch(feed) self._pred_head.batch_postprocess(rt_outputs) cur_predict_step += 1 if print_steps > 0 and cur_predict_step % print_steps == 0: time_end = time.time() time_cost = time_end - time_begin print("batch {}/{}, speed: {:.2f} steps/s".format( cur_predict_step, self._pred_steps_pur_epoch, print_steps / time_cost)) sys.stdout.flush() time_begin = time.time() if self._pred_head.epoch_inputs_attrs: reader_outputs = self._predict_reader.get_epoch_outputs() else: reader_outputs = None results = self._pred_head.epoch_postprocess({'reader':reader_outputs}, output_dir=output_dir) return results def reset_buffer(self): self._pred_head.reset() def _check_phase(self, phase): assert phase in ['train', 'predict'], "Supported phase: train, predict," def _set_multitask(self): self._multi_task = True def _set_nomultitask(self): self._multi_task = False def _set_task_id(self, task_id): self._task_id = task_id def _init_exe_prog(self, for_train=True): if not self._train_init and not self._predict_init: on_gpu = gpu_dev_count > 0 self._exe = helper.build_executor(on_gpu) if for_train: assert self._train_prog is not None, "train graph not found! You should build_forward first before you random init parameters." self._train_init = True else: assert self._pred_prog is not None, "predict graph not found! You should build_predict_head first before you random init parameters." self._predict_init = True # def random_init_params(self): # """ # randomly initialize model parameters. # """ # # if not self._train_init: # self._init_exe_prog() # # print('random init params...') # self._exe.run(self._train_init_prog) def get_one_batch(self, phase='train'): self._check_phase(phase) if phase == 'train': return next(self._train_reader) elif phase == 'predict': return next(self._predict_reader) else: raise NotImplementedError() def _set_exe(self, exe): self._exe = exe def _set_dist_train(self, prog): self._distribute_train_prog = prog def _set_dist_pred(self, prog): self._distribute_pred_prog = prog def _set_fetch_list(self, fetch_list): self._fetch_list = fetch_list def train_one_step(self, batch): if not self._dist_train_init: self._distribute_train_prog = fluid.CompiledProgram(self._train_prog).with_data_parallel(loss_name=self._loss_var.name) self._dist_train_init = True exe = self._exe distribute_train_prog = self._distribute_train_prog fetch_list = self._fetch_list if gpu_dev_count > 1: feed, mask = batch rt_outputs = exe.run(distribute_train_prog, feed=feed, fetch_list=fetch_list) num_fakes = decode_fake(len(rt_outputs[0]), mask, self._train_batch_size) if num_fakes: rt_outputs = [i[:-num_fakes] for i in rt_outputs] else: feed = self._feed_batch_process_fn(batch) rt_outputs = exe.run(distribute_train_prog, feed=feed, fetch_list=fetch_list) rt_outputs = {k:v for k,v in zip(self._fetch_names, rt_outputs)} self._cur_train_step += 1 self._check_save() self._cur_train_epoch = (self._cur_train_step-1) // self._steps_pur_epoch return rt_outputs def predict_one_batch(self, batch): if gpu_dev_count > 1: feed, mask = batch rt_outputs = self._exe.run(self._distribute_pred_prog, feed=feed, fetch_list=self._pred_fetch_list, use_prune=True) num_fakes = decode_fake(len(rt_outputs[0]), mask, self._predict_batch_size) if num_fakes: rt_outputs = [i[:-num_fakes] for i in rt_outputs] else: feed = self._pred_feed_batch_process_fn(batch) rt_outputs = self._exe.run(self._distribute_pred_prog, feed=feed, fetch_list=self._pred_fetch_list, use_prune=True) rt_outputs = {k:v for k,v in zip(self._pred_fetch_name_list, rt_outputs)} return rt_outputs @property def name(self): return self._name @property def num_examples(self): return self._num_examples @property def mix_ratio(self): return self._mix_ratio @mix_ratio.setter def mix_ratio(self, value): self._mix_ratio = value @property def num_epochs(self): return self._num_epochs @property def cur_train_step(self): return self._cur_train_step @property def cur_train_epoch(self): return self._cur_train_epoch @property def steps_pur_epoch(self): return self._steps_pur_epoch def _build_head(self, net_inputs, phase, scope=""): self._check_phase(phase) if phase == 'train': output_vars = self._task_head.build(net_inputs, scope_name=scope) if phase == 'predict': output_vars = self._pred_head.build(net_inputs, scope_name=scope) return output_vars def _save(self, save_path, suffix=None): # dirpath = save_path.rstrip('/').rstrip('\\') + suffix if suffix is not None: dirpath = os.path.join(save_path, suffix) else: dirpath = save_path self._pred_input_varname_list = [str(i) for i in self._pred_input_varname_list] prog = self._pred_prog.clone() fluid.io.save_inference_model(dirpath, self._pred_input_varname_list, self._pred_fetch_var_list, self._exe, prog) conf = {} for k, strv in self._save_protocol.items(): d = None v = locals() exec('d={}'.format(strv), globals(), v) conf[k] = v['d'] with open(os.path.join(dirpath, '__conf__'), 'w') as writer: writer.write(json.dumps(conf, indent=1)) print(self._name + ': predict model saved at ' + dirpath) sys.stdout.flush() def _load(self, infer_model_path=None): if infer_model_path is None: infer_model_path = self._save_infermodel_path for k,v in json.load(open(os.path.join(infer_model_path, '__conf__'))).items(): strv = self._save_protocol[k] exec('{}=v'.format(strv)) pred_prog, self._pred_input_varname_list, self._pred_fetch_var_list = \ fluid.io.load_inference_model(infer_model_path, self._exe) print(self._name+': inference model loaded from ' + infer_model_path) sys.stdout.flush() return pred_prog ================================================ FILE: paddlepalm/utils/__init__.py ================================================ from . import basic_helper from . import config_helper ================================================ FILE: paddlepalm/utils/basic_helper.py ================================================ # coding=utf-8 import os import json import yaml from .config_helper import PDConfig import logging from paddle import fluid def get_basename(f): return os.path.splitext(f)[0] def get_suffix(f): return os.path.splitext(f)[-1] def parse_yaml(f, asdict=True, support_cmd_line=False): assert os.path.exists(f), "file {} not found.".format(f) if support_cmd_line: args = PDConfig(yaml_file=f, fuse_args=True) args.build() return args.asdict() if asdict else args else: if asdict: with open(f, "r") as fin: yaml_config = yaml.load(fin, Loader=yaml.SafeLoader) return yaml_config else: raise NotImplementedError() def parse_json(f, asdict=True, support_cmd_line=False): assert os.path.exists(f), "file {} not found.".format(f) if support_cmd_line: args = PDConfig(json_file=f, fuse_args=support_cmd_line) args.build() return args.asdict() if asdict else args else: if asdict: with open(f, "r") as fin: config = json.load(fin) return config else: raise NotImplementedError() def parse_list(string, astype=str): assert isinstance(string, str), "{} is not a string.".format(string) if ',' not in string: return [astype(string)] string = string.replace(',', ' ') return [astype(i) for i in string.split()] def try_float(s): try: float(s) return(float(s)) except: return s # TODO: 增加None机制,允许hidden size、batch size和seqlen设置为None def check_io(in_attr, out_attr, strict=False, in_name="left", out_name="right"): for name, attr in in_attr.items(): assert name in out_attr, in_name+': '+name+' not found in '+out_name if attr != out_attr[name]: if strict: raise ValueError(name+': shape or dtype not consistent!') else: logging.warning('{}: shape or dtype not consistent!\n{}:\n{}\n{}:\n{}'.format(name, in_name, attr, out_name, out_attr[name])) def encode_inputs(inputs, scope_name, sep='.', cand_set=None): outputs = {} for k, v in inputs.items(): if cand_set is not None: if k in cand_set: outputs[k] = v if scope_name+sep+k in cand_set: outputs[scope_name+sep+k] = v else: outputs[scope_name+sep+k] = v return outputs def decode_inputs(inputs, scope_name, sep='.', keep_unk_keys=True): outputs = {} for name, value in inputs.items(): # var for backbone are also available to tasks if keep_unk_keys and sep not in name: outputs[name] = value # var for this inst if name.startswith(scope_name+'.'): outputs[name[len(scope_name+'.'):]] = value return outputs def build_executor(on_gpu): if on_gpu: place = fluid.CUDAPlace(0) # dev_count = fluid.core.get_cuda_device_count() else: place = fluid.CPUPlace() # dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count())) # return fluid.Executor(place), dev_count return fluid.Executor(place) def fit_attr(conf, fit_attr, strict=False): for i, attr in fit_attr.items(): if i not in conf: if strict: raise Exception('Argument {} is required to create a controller.'.format(i)) else: continue conf[i] = attr(conf[i]) return conf ================================================ FILE: paddlepalm/utils/config_helper.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from __future__ import absolute_import from __future__ import division from __future__ import print_function import os import sys import argparse import json import yaml import six import logging logging_only_message = "%(message)s" logging_details = "%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s" class JsonConfig(object): """ A high-level api for handling json configure file. """ def __init__(self, config_path): self._config_dict = self._parse(config_path) def _parse(self, config_path): try: with open(config_path) as json_file: config_dict = json.load(json_file) assert isinstance(config_dict, dict), "Object in {} is NOT a dict.".format(config_path) except: raise IOError("Error in parsing bert model config file '%s'" % config_path) else: return config_dict def __getitem__(self, key): return self._config_dict[key] def asdict(self): return self._config_dict def print_config(self): for arg, value in sorted(six.iteritems(self._config_dict)): print('%s: %s' % (arg, value)) print('------------------------------------------------') class ArgumentGroup(object): def __init__(self, parser, title, des): self._group = parser.add_argument_group(title=title, description=des) def add_arg(self, name, type, default, help, **kwargs): type = str2bool if type == bool else type self._group.add_argument( "--" + name, default=default, type=type, help=help + ' Default: %(default)s.', **kwargs) class ArgConfig(object): """ A high-level api for handling argument configs. """ def __init__(self): parser = argparse.ArgumentParser() train_g = ArgumentGroup(parser, "training", "training options.") train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.") train_g.add_arg("learning_rate", float, 5e-5, "Learning rate used to train with warmup.") train_g.add_arg( "lr_scheduler", str, "linear_warmup_decay", "scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay']) train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.") train_g.add_arg( "warmup_proportion", float, 0.1, "Proportion of training steps to perform linear learning rate warmup for." ) train_g.add_arg("save_steps", int, 1000, "The steps interval to save checkpoints.") train_g.add_arg( "loss_scaling", float, 1.0, "Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled." ) train_g.add_arg("pred_dir", str, None, "Path to save the prediction results") log_g = ArgumentGroup(parser, "logging", "logging related.") log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.") log_g.add_arg("verbose", bool, False, "Whether to output verbose log.") run_type_g = ArgumentGroup(parser, "run_type", "running type options.") run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.") run_type_g.add_arg( "use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).") run_type_g.add_arg( "num_iteration_per_drop_scope", int, 1, "Ihe iteration intervals to clean up temporary variables.") run_type_g.add_arg("do_train", bool, True, "Whether to perform training.") run_type_g.add_arg("do_predict", bool, True, "Whether to perform prediction.") custom_g = ArgumentGroup(parser, "customize", "customized options.") self.custom_g = custom_g self.parser = parser def add_arg(self, name, dtype, default, descrip): self.custom_g.add_arg(name, dtype, default, descrip) def build_conf(self): return self.parser.parse_args() def str2bool(v): # because argparse does not support to parse "true, False" as python # boolean directly return v.lower() in ("true", "t", "1") def print_arguments(args, log=None): if not log: print('----------- Configuration Arguments -----------') for arg, value in sorted(six.iteritems(vars(args))): print('%s: %s' % (arg, value)) print('------------------------------------------------') else: log.info('----------- Configuration Arguments -----------') for arg, value in sorted(six.iteritems(vars(args))): log.info('%s: %s' % (arg, value)) log.info('------------------------------------------------') class PDConfig(object): """ A high-level API for managing configuration files in PaddlePaddle. Can jointly work with command-line-arugment, json files and yaml files. """ def __init__(self, json_file=None, yaml_file=None, fuse_args=True): """ Init funciton for PDConfig. json_file: the path to the json configure file. yaml_file: the path to the yaml configure file. fuse_args: if fuse the json/yaml configs with argparse. """ if json_file is not None and yaml_file is not None: raise Warning( "json_file and yaml_file can not co-exist for now. please only use one configure file type." ) return self.args = None self.arg_config = {} self.json_config = {} self.yaml_config = {} parser = argparse.ArgumentParser() self.yaml_g = ArgumentGroup(parser, "yaml", "options from yaml.") self.json_g = ArgumentGroup(parser, "json", "options from json.") self.com_g = ArgumentGroup(parser, "custom", "customized options.") self.parser = parser if json_file is not None: assert isinstance(json_file, str) self.load_json(json_file, fuse_args=fuse_args) if yaml_file is not None: assert isinstance(yaml_file, str) or isinstance(yaml_file, list) self.load_yaml(yaml_file, fuse_args=fuse_args) def load_json(self, file_path, fuse_args=True): if not os.path.exists(file_path): raise Warning("the json file %s does not exist." % file_path) return with open(file_path, "r") as fin: self.json_config = json.loads(fin.read()) fin.close() if fuse_args: for name in self.json_config: if not isinstance(self.json_config[name], int) \ and not isinstance(self.json_config[name], float) \ and not isinstance(self.json_config[name], str) \ and not isinstance(self.json_config[name], bool): continue self.json_g.add_arg(name, type(self.json_config[name]), self.json_config[name], "This is from %s" % file_path) def load_yaml(self, file_path_list, fuse_args=True): if isinstance(file_path_list, str): file_path_list = [file_path_list] for file_path in file_path_list: if not os.path.exists(file_path): raise Warning("the yaml file %s does not exist." % file_path) return with open(file_path, "r") as fin: self.yaml_config = yaml.load(fin, Loader=yaml.SafeLoader) if fuse_args: for name in self.yaml_config: if not isinstance(self.yaml_config[name], int) \ and not isinstance(self.yaml_config[name], float) \ and not isinstance(self.yaml_config[name], str) \ and not isinstance(self.yaml_config[name], bool): continue self.yaml_g.add_arg(name, type(self.yaml_config[name]), self.yaml_config[name], "This is from %s" % file_path) def build(self): self.args = self.parser.parse_args() self.arg_config = vars(self.args) def asdict(self): return self.arg_config def __add__(self, new_arg): assert isinstance(new_arg, list) or isinstance(new_arg, tuple) assert len(new_arg) >= 3 assert self.args is None name = new_arg[0] dtype = new_arg[1] dvalue = new_arg[2] desc = new_arg[3] if len( new_arg) == 4 else "Description is not provided." self.com_g.add_arg(name, dtype, dvalue, desc) return self def __getattr__(self, name): if name in self.arg_config: return self.arg_config[name] if name in self.json_config: return self.json_config[name] if name in self.yaml_config: return self.yaml_config[name] raise Warning("The argument %s is not defined." % name) def Print(self): print("-" * 70) for name in self.arg_config: print("{: <25}\t{}".format(str(name), str(self.arg_config[name]))) for name in self.json_config: if name not in self.arg_config: print("{: <25}\t{}" % (str(name), str(self.json_config[name]))) for name in self.yaml_config: if name not in self.arg_config: print("{: <25}\t{}" % (str(name), str(self.yaml_config[name]))) print("-" * 70) if __name__ == "__main__": pd_config = PDConfig(yaml_file="./test/bert_config.yaml") pd_config += ("my_age", int, 18, "I am forever 18.") pd_config.build() print(pd_config.do_train) print(pd_config.hidden_size) print(pd_config.my_age) ================================================ FILE: paddlepalm/utils/plot_helper.py ================================================ ================================================ FILE: paddlepalm/utils/print_helper.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. MAXLEN = 70 def print_dict(dic, title=""): if title: title = ' ' + title + ' ' left_len = (MAXLEN - len(title)) // 2 title = '-' * left_len + title right_len = MAXLEN - len(title) title = title + '-' * right_len else: title = '-' * MAXLEN print(title) for name in dic: print("{: <25}\t{}".format(str(name), str(dic[name]))) print("") # print("-" * MAXLEN + '\n') ================================================ FILE: paddlepalm/utils/reader_helper.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import os import sys import random import logging import numpy as np import paddle from paddle import fluid from paddle.fluid import layers from paddlepalm.distribute import gpu_dev_count, cpu_dev_count import six dev_count = 1 if gpu_dev_count <= 1 else gpu_dev_count def create_feed_batch_process_fn(net_inputs): def feed_batch_process_fn(data, id=-1, phase='train', is_multi=False): temp = {} if dev_count > 1 and phase=='train' and is_multi: inputs = net_inputs[id] else: inputs= net_inputs for q, var in inputs.items(): if isinstance(var, str) or (six.PY3 and isinstance(var, bytes)) or (six.PY2 and isinstance(var, unicode)): temp[var] = data[q] else: temp[var.name] = data[q] return temp return feed_batch_process_fn # def create_multihead_feed_batch_process_fn(net_inputs): # # def feed_batch_process_fn(data, id=-1): # # temps = {} # # for i in range(len(net_inputs)): # temp = {} # inputs = net_inputs[id] if id != -1 else net_inputs # # for q, var in inputs.items(): # if isinstance(var, str) or isinstance(var, unicode): # temp[var] = data[q] # else: # temp[var.name] = data[q] # # temps[i] = temp # # return temp # # return feed_batch_process_fn def check_io(in_attr, out_attr, strict=False, in_name="left", out_name="right"): for name, attr in in_attr.items(): assert name in out_attr, in_name+': '+name+' not found in '+out_name if attr != out_attr[name]: if strict: raise ValueError(name+': shape or dtype not consistent!') else: logging.warning('{}: shape or dtype not consistent!\n{}:\n{}\n{}:\n{}'.format(name, in_name, attr, out_name, out_attr[name])) def _check_and_adapt_shape_dtype(rt_val, attr, message=""): if not isinstance(rt_val, np.ndarray): if rt_val is None: raise Exception(message+": get None value. ") rt_val = np.array(rt_val) assert rt_val.dtype != np.dtype('O'), message+"yielded data is not a valid tensor (number of elements on some dimension may not consistent): {}".format(rt_val) if rt_val.dtype == np.dtype('float64'): rt_val = rt_val.astype('float32') shape, dtype = attr assert rt_val.dtype == np.dtype(dtype), message+"yielded data type not consistent with attr settings. Expect: {}, receive: {}.".format(rt_val.dtype, np.dtype(dtype)) assert len(shape) == rt_val.ndim, message+"yielded data rank(ndim) not consistent with attr settings. Expect: {}, receive: {}.".format(len(shape), rt_val.ndim) for rt, exp in zip(rt_val.shape, shape): if exp is None or exp < 0: continue assert rt == exp, "yielded data shape is not consistent with attr settings.Expected:{}Actual:{}".format(exp, rt) return rt_val def _zero_batch(attrs): pos_attrs = [] for shape, dtype in attrs: pos_shape = [size if size and size > 0 else 1 for size in shape] pos_attrs.append([pos_shape, dtype]) return [np.zeros(shape=shape, dtype=dtype) for shape, dtype in pos_attrs] def _zero_batch_x(attrs, batch_size): pos_attrs = [] for shape, dtype in attrs: pos_shape = [size for size in shape] if pos_shape[0] == -1: pos_shape[0] = batch_size if pos_shape[1] == -1: pos_shape[1] = 512 # max seq len pos_attrs.append([pos_shape, dtype]) return [np.zeros(shape=shape, dtype=dtype) for shape, dtype in pos_attrs] def create_net_inputs(input_attrs, is_async=False, iterator_fn=None, dev_count=1, n_prefetch=1): inputs = [] ret = {} for name, shape, dtype in input_attrs: p = layers.data(name, shape=shape, dtype=dtype) ret[name] = p inputs.append(p) if is_async: assert iterator_fn is not None, "iterator_fn is needed for building async input layer." reader = fluid.io.PyReader(inputs, capacity=dev_count, iterable=False) reader.decorate_batch_generator(iterator_fn) reader.start() return ret def create_iterator_fn(iterator, iterator_prefix, shape_and_dtypes, outname_to_pos, verbose=0, return_type='list'): pos_to_outname = {j:i for i,j in outname_to_pos.items()} def iterator_fn(): v = verbose for outputs in iterator: results = [None] * len(outname_to_pos) prefix = iterator_prefix for outname, val in outputs.items(): task_outname = prefix + '.' + outname if outname in outname_to_pos: idx = outname_to_pos[outname] val = _check_and_adapt_shape_dtype(val, shape_and_dtypes[idx]) results[idx] = val if task_outname in outname_to_pos: idx = outname_to_pos[task_outname] val = _check_and_adapt_shape_dtype(val, shape_and_dtypes[idx]) results[idx] = val if return_type == 'list': yield results elif return_type == 'dict': temp = {} for pos, i in enumerate(results): temp[pos_to_outname[pos]] = i yield temp return iterator_fn def create_multihead_inference_fn(iterators, iterator_prefixes, joint_shape_and_dtypes, names, outname_to_pos, task_name2id, dev_count=1): def iterator(task_name): while True: id = task_name2id[task_name] # id = np.random.choice(task_ids, p=weights) task_id_tensor = np.array([id]).astype("int64") for i in range(dev_count): outputs = next(iterators[id]) # dict type prefix = iterator_prefixes[id] results = {} results['__task_id'] = task_id_tensor for outname, val in outputs.items(): task_outname = prefix + '.' + outname if outname in names[id]: idx = outname_to_pos[id][outname] val = _check_and_adapt_shape_dtype(val, joint_shape_and_dtypes[id][idx], message=outname+': ') results[outname] = val if task_outname in names[id]: idx = outname_to_pos[id][task_outname] val = _check_and_adapt_shape_dtype(val, joint_shape_and_dtypes[id][idx], message=task_outname+': ') results[task_outname] = val yield results return iterator def create_multihead_iterator_fn(iterators, iterator_prefixes, joint_shape_and_dtypes, mrs, names, outname_to_pos, dev_count=1, keep_one_task=True): task_ids = range(len(iterators)) weights = [mr / float(sum(mrs)) for mr in mrs] if not keep_one_task: dev_count = 1 def iterator(): while True: id = np.random.choice(task_ids, p=weights) task_id_tensor = np.array([id]).astype("int64") for i in range(dev_count): outputs = next(iterators[id]) # dict type prefix = iterator_prefixes[id] results = {} results['__task_id'] = task_id_tensor for outname, val in outputs.items(): task_outname = prefix + '.' + outname if outname in names[id]: idx = outname_to_pos[id][outname] val = _check_and_adapt_shape_dtype(val, joint_shape_and_dtypes[id][idx], message=outname+': ') results[outname] = val if task_outname in names[id]: idx = outname_to_pos[id][task_outname] val = _check_and_adapt_shape_dtype(val, joint_shape_and_dtypes[id][idx], message=task_outname+': ') results[task_outname] = val yield results return iterator def create_joint_iterator_fn(iterators, iterator_prefixes, joint_shape_and_dtypes, mrs, outname_to_pos, dev_count=1, keep_one_task=True, verbose=0): """ joint_shape_and_dtypes: 本质上是根据bb和parad的attr设定的,并且由reader中的attr自动填充-1(可变)维度得到,因此通过与iterator的校验可以完成runtime的batch正确性检查 """ task_ids = range(len(iterators)) weights = [mr / float(sum(mrs)) for mr in mrs] if not keep_one_task: dev_count = 1 results = _zero_batch(joint_shape_and_dtypes) outbuf = {} for id in task_ids: outputs = next(iterators[id]) # dict type outbuf[id] = outputs prefix = iterator_prefixes[id] for outname, val in outputs.items(): task_outname = prefix + '.' + outname if outname in outname_to_pos: idx = outname_to_pos[outname] val = _check_and_adapt_shape_dtype(val, joint_shape_and_dtypes[idx], message=outname+': ') results[idx] = val if task_outname in outname_to_pos: idx = outname_to_pos[task_outname] val = _check_and_adapt_shape_dtype(val, joint_shape_and_dtypes[idx], message=task_outname+': ') results[idx] = val fake_batch = results dev_count_bak = dev_count def iterator(): v = verbose has_show_warn = False while True: id = np.random.choice(task_ids, p=weights) results = fake_batch if v > 0: print('----- debug joint iterator -----') print('sampled task id: '+str(id)) task_id_tensor = np.array([[id]]).astype("int64") for i in range(dev_count): results[outname_to_pos['__task_id']] = task_id_tensor assert outname_to_pos['__task_id'] == 0 if id in outbuf: outputs = outbuf[id] del outbuf[id] else: outputs = next(iterators[id]) # dict type if 'token_ids' in outputs: val1 = len(outputs['token_ids']) val = _check_and_adapt_shape_dtype([val1], [[1], 'int64']) results[outname_to_pos['batch_size']] = val val2 = len(outputs['token_ids'][0]) val = _check_and_adapt_shape_dtype([val2], [[1], 'int64']) results[outname_to_pos['seqlen']] = val val = _check_and_adapt_shape_dtype([val1*val2], [[1], 'int64']) results[outname_to_pos['batchsize_x_seqlen']] = val else: if not has_show_warn: print('WARNING: token_ids not found in current batch, failed to yield batch_size, seqlen and batchsize_x_seqlen. (This message would be shown only once.)') has_show_warn = True prefix = iterator_prefixes[id] for outname, val in outputs.items(): if v > 0: print('reader generate: '+outname) task_outname = prefix + '.' + outname if outname in outname_to_pos: idx = outname_to_pos[outname] if v > 0: print(outname + ' is insert in idx ' + str(idx)) val = _check_and_adapt_shape_dtype(val, joint_shape_and_dtypes[idx], message=outname+': ') results[idx] = val if task_outname in outname_to_pos: idx = outname_to_pos[task_outname] if v > 0: print(task_outname + ' is insert in idx ' + str(idx)) val = _check_and_adapt_shape_dtype(val, joint_shape_and_dtypes[idx], message=task_outname+': ') results[idx] = val if v > 0: print('yielded batch len and shapes:') print(len(results)) for i in results: print(np.shape(i)) print('') v -= 1 yield results return iterator def merge_input_attrs(backbone_attr, task_attrs, insert_taskid=True, insert_batchsize=False, insert_seqlen=False, insert_batchsize_x_seqlen=False): """ Args: task_attrs(list[dict]|dict): task input attributes, key=attr_name, val=[shape, dtype], support single task and nested tasks """ if isinstance(task_attrs, dict): task_attrs = [task_attrs] ret = [] names = [] start = 0 if insert_taskid: ret.append(([1, 1], 'int64')) names.append('__task_id') start += 1 if insert_batchsize: ret.append(([1], 'int64')) names.append('batch_size') start += 1 if insert_seqlen: ret.append(([1], 'int64')) names.append('seqlen') start += 1 if insert_batchsize_x_seqlen: ret.append(([1], 'int64')) names.append(u'batchsize_x_seqlen') start += 1 names += sorted(backbone_attr.keys()) ret.extend([backbone_attr[k] for k in names[start:]]) name_to_position = {} # pos=0 is for task_id, thus we start from 1 for pos, k in enumerate(names): name_to_position[k] = pos for task_attr in task_attrs: task_names = sorted(task_attr.keys()) names.extend(task_names) ret.extend([task_attr[k] for k in task_names]) for pos, k in enumerate(task_names, start=len(name_to_position)): name_to_position[k] = pos return names, ret, name_to_position ================================================ FILE: paddlepalm/utils/saver.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from __future__ import print_function import os import six import ast import copy import tarfile import shutil import numpy as np import paddle.fluid as fluid def init_checkpoint(exe, init_checkpoint_path, main_program, skip_list = []): assert os.path.exists( init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path def existed_persitables(var): if not fluid.io.is_persistable(var): return False if var.name in skip_list: return False return os.path.exists(os.path.join(init_checkpoint_path, var.name)) fluid.io.load_vars( exe, init_checkpoint_path, main_program=main_program, predicate=existed_persitables) print("Load model from {}".format(init_checkpoint_path)) def init_pretraining_params(exe, pretraining_params_path, convert, main_program, strict=False): assert os.path.exists(pretraining_params_path ), "[%s] cann't be found." % pretraining_params_path if convert: assert os.path.exists(os.path.join(pretraining_params_path, '__palmmodel__')), "__palmmodel__ not found." with tarfile.open(os.path.join(pretraining_params_path, '__palmmodel__'), 'r') as f: f.extractall(os.path.join(pretraining_params_path, '.temp')) log_path = os.path.join(pretraining_params_path, '__palmmodel__') pretraining_params_path = os.path.join(pretraining_params_path, '.temp') else: log_path = pretraining_params_path print("Loading pretraining parameters from {}...".format(pretraining_params_path)) def existed_params(var): if not isinstance(var, fluid.framework.Parameter): return False if not os.path.exists(os.path.join(pretraining_params_path, var.name)): if strict: raise Exception('Error: {} not found in {}.'.format(var.name, log_path)) else: print('Warning: {} not found in {}.'.format(var.name, log_path)) return os.path.exists(os.path.join(pretraining_params_path, var.name)) fluid.io.load_vars( exe, pretraining_params_path, main_program=main_program, predicate=existed_params) if convert: shutil.rmtree(pretraining_params_path) print('') ================================================ FILE: paddlepalm/utils/textprocess_helper.py ================================================ # -*- coding: UTF-8 -*- # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. def is_whitespace(c): if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F: return True return False ================================================ FILE: setup.cfg ================================================ [metadata] name = paddlepalm author = zhangyiming author_email = zhangyiming04@baidu.com version = 2.1.0 description = PaddlePALM long_description = file: README.md long_description_content_type = text/markdown home_page = https://github.com/PaddlePaddle/PALM license = Apache 2.0 classifier = Private :: Do Not Upload Programming Language :: Python Programming Language :: Python :: 2 Programming Language :: Python :: 2.7 Programming Language :: Python :: 3 Programming Language :: Python :: 3.5 Programming Language :: Python :: 3.6 Programming Language :: Python :: 3.7 keywords = paddlepaddle paddle nlp pretrain multi-task-learning [options] packages = find: include_package_data = True zip_safe = False [sdist] dist_dir = output/dist [bdist_wheel] dist_dir = output/dist [easy_install] index_url = http://pip.baidu.com/root/baidu/+simple/ ================================================ FILE: setup.py ================================================ # -*- coding: UTF-8 -*- ################################################################################ # # Copyright (c) 2019 Baidu.com, Inc. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License" # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. ################################################################################ """ Setup script. Authors: zhouxiangyang(zhouxiangyang@baidu.com) Date: 2020/2/4 00:00:01 """ import setuptools with open("README.md", "r") as fh: long_description = fh.read() setuptools.setup( name="paddlepalm", version="2.1.0", author="PaddlePaddle", author_email="zhangyiming04@baidu.com", description="a flexible, general and easy-to-use NLP large-scale pretraining and multi-task learning framework.", # long_description=long_description, # long_description_content_type="text/markdown", url="https://github.com/PaddlePaddle/PALM", # packages=setuptools.find_packages(), packages = ['paddlepalm', 'paddlepalm.backbone', 'paddlepalm.backbone.utils', 'paddlepalm.optimizer', 'paddlepalm.reader', 'paddlepalm.reader.utils', 'paddlepalm.head', 'paddlepalm.distribute', 'paddlepalm.lr_sched', 'paddlepalm.tokenizer', 'paddlepalm.utils'], package_dir={'paddlepalm':'./paddlepalm', 'paddlepalm.backbone':'./paddlepalm/backbone', 'paddlepalm.backbone.utils':'./paddlepalm/backbone/utils', 'paddlepalm.optimizer':'./paddlepalm/optimizer', 'paddlepalm.lr_sched': './paddlepalm/lr_sched', 'paddlepalm.distribute': './paddlepalm/distribute', 'paddlepalm.reader':'./paddlepalm/reader', 'paddlepalm.reader.utils':'./paddlepalm/reader/utils', 'paddlepalm.head':'./paddlepalm/head', 'paddlepalm.tokenizer':'./paddlepalm/tokenizer', 'paddlepalm.utils':'./paddlepalm/utils'}, platforms = "any", license='Apache 2.0', classifiers = [ 'License :: OSI Approved :: Apache Software License', 'Programming Language :: Python', 'Programming Language :: Python :: 2', 'Programming Language :: Python :: 2.7', 'Programming Language :: Python :: 3', 'Programming Language :: Python :: 3.5', 'Programming Language :: Python :: 3.6', 'Programming Language :: Python :: 3.7', ], install_requires = [ 'paddlepaddle-gpu>=1.8.0' ] ) ================================================ FILE: test/test2/config.yaml ================================================ ask_instance: "mrqa, mlm4mrqa, match4mrqa" target_tag: 1, 0, 0 mix_ratio: 1.0, 0.5, 0.5 save_path: "output_model/secondrun" backbone: "ernie" backbone_config_path: "../../pretrain_model/ernie/ernie_config.json" vocab_path: "../../pretrain_model/ernie/vocab.txt" do_lower_case: True max_seq_len: 512 batch_size: 4 num_epochs: 2 optimizer: "adam" learning_rate: 3e-5 warmup_proportion: 0.1 weight_decay: 0.1 print_every_n_steps: 1 ================================================ FILE: test/test2/run.py ================================================ # coding=utf-8 import paddlepalm as palm import json if __name__ == '__main__': max_seqlen = 512 batch_size = 4 num_epochs = 2 lr = 1e-3 vocab_path = './pretrain/ernie/vocab.txt' train_file = './data/cls4mrqa/train.tsv' predict_file = './data/cls4mrqa/dev.tsv' config = json.load(open('./pretrain/ernie/ernie_config.json')) # ernie = palm.backbone.ERNIE(...) ernie = palm.backbone.ERNIE.from_config(config) # cls_reader2 = palm.reader.cls(train_file_topic, vocab_path, batch_size, max_seqlen) # cls_reader3 = palm.reader.cls(train_file_subj, vocab_path, batch_size, max_seqlen) # topic_trainer = palm.Trainer('topic_cls', cls_reader2, cls) # subj_trainer = palm.Trainer('subj_cls', cls_reader3, cls) # 创建该分类任务的reader,由诸多参数控制数据集读入格式、文件数量、预处理规则等 cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen) cls_reader2 = palm.reader.ClassifyReader(vocab_path, max_seqlen) print(cls_reader.outputs_attr) # 不同的backbone会对任务reader有不同的特征要求,例如对于分类任务,基本的输入feature为token_ids和label_ids,但是对于BERT,还要求从输入中额外提取position、segment、input_mask等特征,因此经过register后,reader会自动补充backbone所要求的字段 cls_reader.register_with(ernie) cls_reader2.register_with(ernie) print(cls_reader.outputs_attr) print("preparing data...") print(cls_reader.num_examples) cls_reader.load_data(train_file, batch_size) cls_reader2.load_data(train_file, batch_size) print(cls_reader.num_examples) print('done!') # 创建任务头(task head),如分类、匹配、机器阅读理解等。每个任务头有跟该任务相关的必选/可选参数。注意,任务头与reader是解耦合的,只要任务头依赖的数据集侧的字段能被reader提供,那么就是合法的 cls_head = palm.head.Classify(4, 1024, 0.1) cls_head2 = palm.head.Classify(4, 1024, 0.1) # 根据reader和任务头来创建一个训练器trainer,trainer代表了一个训练任务,内部维护着训练进程、和任务的关键信息,并完成合法性校验,该任务的模型保存、载入等相关规则控制 trainer = palm.Trainer('cls') trainer2 = palm.Trainer('senti_cls') mh_trainer = palm.MultiHeadTrainer([trainer, trainer2]) # match4mrqa.reuse_head_with(mrc4mrqa) # data_vars = cls_reader.build() # output_vars = ernie.build(data_vars) # cls_head.build({'backbone': output_vars, 'reader': data_vars}) loss_var = mh_trainer.build_forward(ernie, [cls_head, cls_head2]) n_steps = cls_reader.num_examples * num_epochs // batch_size warmup_steps = int(0.1 * n_steps) print(warmup_steps) sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps) adam = palm.optimizer.Adam(loss_var, lr, sched) mh_trainer.build_backward(optimizer=adam, weight_decay=0.001) # mh_trainer.random_init_params() mh_trainer.load_pretrain('pretrain/ernie/params') # trainer.train(iterator_fn, print_steps=1, save_steps=5, save_path='outputs', save_type='ckpt,predict') mh_trainer.fit_readers_with_mixratio([cls_reader, cls_reader2], 'cls', 2) mh_trainer.train(print_steps=1) # trainer.save() ================================================ FILE: test/test2/run.sh ================================================ export CUDA_VISIBLE_DEVICES=3 python run.py ================================================ FILE: test/test3/config.yaml ================================================ task_instance: "cls1, cls2, cls3, cls4, cls5, cls6" task_reuse_tag: 0,0,1,1,0,2 save_path: "output_model/thirdrun" backbone: "ernie" backbone_config_path: "../../pretrain_model/ernie/ernie_config.json" vocab_path: "../../pretrain_model/ernie/vocab.txt" do_lower_case: True max_seq_len: 512 batch_size: 4 num_epochs: 2 optimizer: "adam" learning_rate: 3e-5 warmup_proportion: 0.1 weight_decay: 0.1 print_every_n_steps: 1 ================================================ FILE: test/test3/run.py ================================================ # coding=utf-8 import paddlepalm as palm import json if __name__ == '__main__': max_seqlen = 512 batch_size = 4 num_epochs = 2 lr = 1e-3 vocab_path = './pretrain/ernie/vocab.txt' train_file = './data/cls4mrqa/train.tsv' predict_file = './data/cls4mrqa/dev.tsv' config = json.load(open('./pretrain/ernie/ernie_config.json')) # ernie = palm.backbone.ERNIE(...) ernie = palm.backbone.ERNIE.from_config(config) # cls_reader2 = palm.reader.cls(train_file_topic, vocab_path, batch_size, max_seqlen) # cls_reader3 = palm.reader.cls(train_file_subj, vocab_path, batch_size, max_seqlen) # topic_trainer = palm.Trainer('topic_cls', cls_reader2, cls) # subj_trainer = palm.Trainer('subj_cls', cls_reader3, cls) # 创建该分类任务的reader,由诸多参数控制数据集读入格式、文件数量、预处理规则等 cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen) <<<<<<< HEAD:test/test2/run.py cls_reader2 = palm.reader.ClassifyReader(vocab_path, max_seqlen) ======= predict_cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, phase='predict') >>>>>>> remotes/upstream/r0.3-api:test/test3/run.py print(cls_reader.outputs_attr) print(predict_cls_reader.outputs_attr) # 不同的backbone会对任务reader有不同的特征要求,例如对于分类任务,基本的输入feature为token_ids和label_ids,但是对于BERT,还要求从输入中额外提取position、segment、input_mask等特征,因此经过register后,reader会自动补充backbone所要求的字段 cls_reader.register_with(ernie) cls_reader2.register_with(ernie) print(cls_reader.outputs_attr) <<<<<<< HEAD:test/test2/run.py print("preparing data...") print(cls_reader.num_examples) cls_reader.load_data(train_file, batch_size) cls_reader2.load_data(train_file, batch_size) ======= print(predict_cls_reader.outputs_attr) print("preparing data...") print(cls_reader.num_examples) cls_reader.load_data(train_file, batch_size, num_epochs=num_epochs) >>>>>>> remotes/upstream/r0.3-api:test/test3/run.py print(cls_reader.num_examples) print('done!') # 创建任务头(task head),如分类、匹配、机器阅读理解等。每个任务头有跟该任务相关的必选/可选参数。注意,任务头与reader是解耦合的,只要任务头依赖的数据集侧的字段能被reader提供,那么就是合法的 cls_head = palm.head.Classify(4, 1024, 0.1) <<<<<<< HEAD:test/test2/run.py cls_head2 = palm.head.Classify(4, 1024, 0.1) # 根据reader和任务头来创建一个训练器trainer,trainer代表了一个训练任务,内部维护着训练进程、和任务的关键信息,并完成合法性校验,该任务的模型保存、载入等相关规则控制 trainer = palm.Trainer('cls') trainer2 = palm.Trainer('senti_cls') mh_trainer = palm.MultiHeadTrainer([trainer, trainer2]) ======= # 根据reader和任务头来创建一个训练器trainer,trainer代表了一个训练任务,内部维护着训练进程、和任务的关键信息,并完成合法性校验,该任务的模型保存、载入等相关规则控制 trainer = palm.Trainer('senti_cls') >>>>>>> remotes/upstream/r0.3-api:test/test3/run.py # match4mrqa.reuse_head_with(mrc4mrqa) # data_vars = cls_reader.build() # output_vars = ernie.build(data_vars) # cls_head.build({'backbone': output_vars, 'reader': data_vars}) <<<<<<< HEAD:test/test2/run.py loss_var = mh_trainer.build_forward(ernie, [cls_head, cls_head2]) n_steps = cls_reader.num_examples * num_epochs // batch_size warmup_steps = int(0.1 * n_steps) print(warmup_steps) sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps) ======= loss_var = trainer.build_forward(ernie, cls_head) # controller.build_forward() # Error! a head/backbone can be only build once! Try NOT to call build_forward method for any Trainer! # n_steps = cls_reader.num_examples * num_epochs // batch_size # warmup_steps = int(0.1 * n_steps) # print(warmup_steps) # sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps) sched = None >>>>>>> remotes/upstream/r0.3-api:test/test3/run.py adam = palm.optimizer.Adam(loss_var, lr, sched) mh_trainer.build_backward(optimizer=adam, weight_decay=0.001) # mh_trainer.random_init_params() mh_trainer.load_pretrain('pretrain/ernie/params') # trainer.train(iterator_fn, print_steps=1, save_steps=5, save_path='outputs', save_type='ckpt,predict') <<<<<<< HEAD:test/test2/run.py mh_trainer.fit_readers_with_mixratio([cls_reader, cls_reader2], 'cls', 2) mh_trainer.train(print_steps=1) # trainer.save() ======= trainer.fit_reader(cls_reader) trainer.train(print_steps=1) # trainer.save() print('prepare to predict...') pred_ernie = palm.backbone.ERNIE.from_config(config, phase='pred') cls_pred_head = palm.head.Classify(4, 1024, phase='pred') trainer.build_predict_forward(pred_ernie, cls_pred_head) predict_cls_reader.load_data(predict_file, 8) print(predict_cls_reader.num_examples) predict_cls_reader.register_with(pred_ernie) trainer.fit_reader(predict_cls_reader, phase='predict') print('predicting..') trainer.predict(print_steps=20) # controller = palm.Controller([mrqa, match4mrqa, mlm4mrqa]) # loss = controller.build_forward(bb, mask_task=[]) # n_steps = controller.estimate_train_steps(basetask=mrqa, num_epochs=2, batch_size=8, dev_count=4) # adam = palm.optimizer.Adam(loss) # sched = palm.schedualer.LinearWarmup(learning_rate, max_train_steps=n_steps, warmup_steps=0.1*n_steps) # # controller.build_backward(optimizer=adam, schedualer=sched, weight_decay=0.001, use_ema=True, ema_decay=0.999) # controller.random_init_params() # controller.load_pretrain('../../pretrain_model/ernie/params') # controller.train() # controller = palm.Controller(config='config.yaml', task_dir='tasks', for_train=False) # controller.pred('mrqa', inference_model_dir='output_model/secondrun/mrqa/infer_model') >>>>>>> remotes/upstream/r0.3-api:test/test3/run.py ================================================ FILE: test/test3/run.sh ================================================ export CUDA_VISIBLE_DEVICES=3 python run.py