Showing preview only (491K chars total). Download the full file or copy to clipboard to get everything.
Repository: PaddlePaddle/PALM
Branch: master
Commit: 2555c0e2a5fa
Files: 98
Total size: 443.8 KB
Directory structure:
gitextract_o6rx2q6_/
├── .gitignore
├── README.md
├── README_zh.md
├── customization_cn.md
├── examples/
│ ├── classification/
│ │ ├── README.md
│ │ ├── download.py
│ │ ├── evaluate.py
│ │ └── run.py
│ ├── matching/
│ │ ├── README.md
│ │ ├── download.py
│ │ ├── evaluate.py
│ │ ├── process.py
│ │ └── run.py
│ ├── mrc/
│ │ ├── README.md
│ │ ├── download.py
│ │ ├── evaluate.py
│ │ └── run.py
│ ├── multi-task/
│ │ ├── README.md
│ │ ├── download.py
│ │ ├── evaluate_intent.py
│ │ ├── evaluate_slot.py
│ │ ├── joint_predict.py
│ │ ├── predict_intent.py
│ │ ├── predict_slot.py
│ │ ├── process.py
│ │ └── run.py
│ ├── predict/
│ │ ├── README.md
│ │ ├── download.py
│ │ ├── evaluate.py
│ │ └── run.py
│ ├── tagging/
│ │ ├── README.md
│ │ ├── download.py
│ │ ├── evaluate.py
│ │ └── run.py
│ └── train_with_eval/
│ ├── README.md
│ ├── download.py
│ ├── evaluate.py
│ └── run.py
├── paddlepalm/
│ ├── __init__.py
│ ├── _downloader.py
│ ├── backbone/
│ │ ├── README.md
│ │ ├── __init__.py
│ │ ├── base_backbone.py
│ │ ├── bert.py
│ │ ├── ernie.py
│ │ └── utils/
│ │ ├── __init__.py
│ │ └── transformer.py
│ ├── distribute/
│ │ ├── __init__.py
│ │ └── reader.py
│ ├── downloader.py
│ ├── head/
│ │ ├── __init__.py
│ │ ├── base_head.py
│ │ ├── cls.py
│ │ ├── match.py
│ │ ├── mlm.py
│ │ ├── mrc.py
│ │ └── ner.py
│ ├── lr_sched/
│ │ ├── __init__.py
│ │ ├── base_schedualer.py
│ │ ├── slanted_triangular_schedualer.py
│ │ └── warmup_schedualer.py
│ ├── multihead_trainer.py
│ ├── optimizer/
│ │ ├── __init__.py
│ │ ├── adam.py
│ │ └── base_optimizer.py
│ ├── reader/
│ │ ├── __init__.py
│ │ ├── base_reader.py
│ │ ├── cls.py
│ │ ├── match.py
│ │ ├── mlm.py
│ │ ├── mrc.py
│ │ ├── seq_label.py
│ │ └── utils/
│ │ ├── __init__.py
│ │ ├── batching4bert.py
│ │ ├── batching4ernie.py
│ │ ├── mlm_batching.py
│ │ ├── mrqa_helper.py
│ │ └── reader4ernie.py
│ ├── tokenizer/
│ │ ├── __init__.py
│ │ ├── bert_tokenizer.py
│ │ └── ernie_tokenizer.py
│ ├── trainer.py
│ └── utils/
│ ├── __init__.py
│ ├── basic_helper.py
│ ├── config_helper.py
│ ├── plot_helper.py
│ ├── print_helper.py
│ ├── reader_helper.py
│ ├── saver.py
│ └── textprocess_helper.py
├── setup.cfg
├── setup.py
└── test/
├── test2/
│ ├── config.yaml
│ ├── run.py
│ └── run.sh
└── test3/
├── config.yaml
├── run.py
└── run.sh
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
*.pyc
paddlepalm.egg-info
data
__pycache__
*egg-info
pretrain_model
pretrain
output*
output_model
build
dist
paddle_palm.egg-info
mrqa_output
*.log
================================================
FILE: README.md
================================================
# PaddlePALM
English | [简体中文](./README_zh.md)
PaddlePALM (PArallel Learning from Multi-tasks) is a fast, flexible, extensible and easy-to-use NLP large-scale pretraining and multi-task learning framework. PaddlePALM is a high level framework **aiming at fastly developing high-performance NLP models**.
With PaddlePALM, it is easy to achieve effecient exploration of robust learning of NLP models with multiple auxilary tasks. For example, based on PaddlePALM, the produced robust MRC model, [D-Net](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/Research/MRQA2019-D-NET), has achieved **the 1st place** in [EMNLP2019 MRQA](https://mrqa.github.io) track.
<p align="center">
<img src="https://tva1.sinaimg.cn/large/006tNbRwly1gbjkuuwrmlj30hs0hzdh2.jpg" alt="Sample" width="300" height="333">
<p align="center">
<em>MRQA2019 Leaderboard</em>
</p>
</p>
Beyond the research scope, PaddlePALM has been applied on **Baidu Search Engine** to seek for more accurate user query understanding and answer mining, which implies the high reliability and performance of PaddlePALM.
#### Features:
- **Easy-to-use:** with PALM, *8 steps* to achieve a typical NLP task. Moreover, all basic components (e.g., the model backbone, dataset reader, task output head, optimizer...) have been decoupled, which allows the replacement of any component to other candidates with quite minor changes of your code.
- **Built-in Popular NLP Backbones and Pre-trained models:** multiple state-of-the-art general purpose model architectures and pretrained models (e.g., BERT,ERNIE,RoBERTa,...) are built-in.
- **Easy to play Multi-task Learning:** only one API is needed for jointly training of several tasks with parameters reusement.
- **Support train/eval with Multi-GPUs:** automatically recognize and adapt to multiple gpus mode to accelerate training and inference.
- **Pre-training friendly:** self-supervised tasks (e.g., mask language model) are built-in to facilitate pre-training. Easy to train from scratch.
- **Easy to Customize:** support customized development of any component (e.g, backbone, task head, reader and optimizer) with reusement of pre-defined ones, which gives developers high flexibility and effeciency to adapt for diverse NLP scenes.
You can easily re-produce following competitive results with minor codes, which covers most of NLP tasks such as classification, matching, sequence labeling, reading comprehension, dialogue understanding and so on. More details can be found in `examples`.
<table>
<tbody>
<tr>
<th><strong>Dataset</strong>
<br></th>
<th colspan="2"><center><strong>chnsenticorp</strong></center></th>
<th colspan="2"><center><strong>Quora Question Pairs matching</strong><center></th>
<th colspan="1"><strong>MSRA-NER<br>(SIGHAN2006)</strong></th>
<th colspan="2"><strong>CMRC2018</strong></th>
</tr>
<tr>
<td rowspan="2">
<p>
<strong>Metric</strong>
<br></p>
</td>
<td colspan="1">
<center><strong>accuracy</strong></center>
<br></td>
<td colspan="1">
<strong>f1-score</strong>
<strong></strong>
<br></td>
<td colspan="1">
<center><strong>accuracy</strong></center>
<br></td>
<td colspan="1">
<strong>f1-score</strong>
<strong></strong>
<br></td>
<td colspan="1">
<strong>f1-score</strong>
<strong></strong>
<br></td>
<td colspan="1">
<strong>em</strong>
<br></td>
<td colspan="1">
<strong>f1-score</strong>
<br></td>
</tr>
<tr>
<td colspan="2" width="">
<strong>test</strong>
<br></td>
<td colspan="2" width="">
<strong>test</strong>
<br></td>
<td colspan="1" width="">
<strong>test</strong>
<br></td>
<td colspan="2" width="">
<strong>dev</strong>
<br></td>
</tr>
<tr>
<td><strong>ERNIE Base</strong></td>
<td>95.8</td>
<td>95.8</td>
<td>86.2</td>
<td>82.2</td>
<td>99.2</td>
<td>64.3</td>
<td>85.2</td>
</tr>
</tbody>
</table>
## Overview
<p align="center">
<img src="https://tva1.sinaimg.cn/large/0082zybply1gbyo8d4ltoj31ag0n3tby.jpg" alt="Sample" width="600px" height="auto">
<p align="center">
<em>Architecture Diagram</em>
</p>
</p>
PaddlePALM is a well-designed high-level NLP framework. You can efficiently achieve **supervised learning, unsupervised/self-supervised learning, multi-task learning and transfer learning** with minor codes based on PaddlePALM. There are three layers in PaddlePALM architecture, i.e., component layer, trainer layer and high-level trainer layer from bottom to top.
In component layer, PaddlePALM supplies 6 **decoupled** components to achieve a NLP task. Each component contains rich `pre-defined` classes and a `Base` class. Pre-defined classes are aiming at typical NLP tasks, and the base class is to help users develop a new Class (based on pre-defined ones or from the base).
The trainer layer is to establish a computation graph with selected components and do training and predicting. The training strategy, model saving and loading, evaluation and predicting procedures are described in this layer. Noted a trainer can only process one task.
The high-level trainer layer is for complicated learning and inference strategy, e.g., multi-task learning. You can add auxilary tasks to train robust NLP models (improve test set and out-of-domain performance of a model), or jointly training multiple related tasks to gain more performance for each task.
| module | illustration |
| - | - |
| **paddlepalm** | an open source NLP pretraining and multitask learning framework, built on paddlepaddle. |
| **paddlepalm.reader** | a collection of elastic task-specific dataset readers. |
| **paddlepalm.backbone** | a collection of classic NLP representation models, e.g., BERT, ERNIE, RoBERTa. |
| **paddlepalm.head** | a collection of task-specific output layers. |
| **paddlepalm.lr_sched** | a collection of learning rate schedualers. |
| **paddlepalm.optimizer** | a collection of optimizers. |
| **paddlepalm.downloader** | a download module for pretrained models with configure and vocab files. |
| **paddlepalm.Trainer** | the core unit to start a single task training/predicting session. A trainer is to build computation graph, manage training and evaluation process, achieve model/checkpoint saving and pretrain_model/checkpoint loading.|
| **paddlepalm.MultiHeadTrainer** | the core unit to start a multi-task training/predicting session. A MultiHeadTrainer is built based on several Trainers. Beyond the inheritance of Trainer, it additionally achieves model backbone reuse across tasks, trainer sampling for multi-task learning, and multi-head inference for effective evaluation and prediction. |
## Installation
PaddlePALM support both python2 and python3, linux and windows, CPU and GPU. The preferred way to install PaddlePALM is via `pip`. Just run following commands in your shell.
```bash
pip install paddlepalm
```
### Installing via source
```shell
git clone https://github.com/PaddlePaddle/PALM.git
cd PALM && python setup.py install
```
### Library Dependencies
- Python >= 2.7
- cuda >= 9.0
- cudnn >= 7.0
- PaddlePaddle >= 1.7.0 (Please refer to [this](http://www.paddlepaddle.org/#quick-start) to install)
### Downloading pretrain models
We incorporate many pretrained models to initialize model backbone parameters. Training big NLP model, e.g., 12-layer transformers, with pretrained models is practically much more effective than that with randomly initialized parameters. To see all the available pretrained models and download, run following code in python interpreter (input command `python` in shell):
```python
>>> from paddlepalm import downloader
>>> downloader.ls('pretrain')
Available pretrain items:
=> RoBERTa-zh-base
=> RoBERTa-zh-large
=> ERNIE-v2-en-base
=> ERNIE-v2-en-large
=> XLNet-cased-base
=> XLNet-cased-large
=> ERNIE-v1-zh-base
=> ERNIE-v1-zh-base-max-len-512
=> BERT-en-uncased-large-whole-word-masking
=> BERT-en-cased-large-whole-word-masking
=> BERT-en-uncased-base
=> BERT-en-uncased-large
=> BERT-en-cased-base
=> BERT-en-cased-large
=> BERT-multilingual-uncased-base
=> BERT-multilingual-cased-base
=> BERT-zh-base
>>> downloader.download('pretrain', 'BERT-en-uncased-base', './pretrain_models')
...
```
## Usage
#### Quick Start
8 steps to start a typical NLP training task.
1. use `paddlepalm.reader` to create a *reader* for dataset loading and input features generation, then call `reader.load_data` method to load your training data.
2. use `paddlepalm.backbone` to create a model *backbone* to extract text features (e.g., contextual word embedding, sentence embedding).
3. register your *reader* with your *backbone* through `reader.register_with` method. After this step, your reader is able to yield input features used by backbone.
4. use `paddlepalm.head` to create a task output *head*. This head can provide task loss for training and predicting results for model inference.
5. create a task *trainer* with `paddlepalm.Trainer`, then build forward graph with backbone and task head (created in step 2 and 4) through `trainer.build_forward`.
6. use `paddlepalm.optimizer` (and `paddlepalm.lr_sched` if is necessary) to create a *optimizer*, then build backward through `trainer.build_backward`.
7. fit prepared reader and data (achieved in step 1) to trainer with `trainer.fit_reader` method.
8. load pretrain model with `trainer.load_pretrain`, or load checkpoint with `trainer.load_ckpt` or nothing to do for training from scratch, then do training with `trainer.train`.
For more implementation details, see following demos:
- [Sentiment Classification](https://github.com/PaddlePaddle/PALM/tree/master/examples/classification)
- [Question Pairs matching](https://github.com/PaddlePaddle/PALM/tree/master/examples/matching)
- [Named Entity Recognition](https://github.com/PaddlePaddle/PALM/tree/master/examples/tagging)
- [SQuAD-like Machine Reading Comprehension](https://github.com/PaddlePaddle/PALM/tree/master/examples/mrc).
#### Multi-task Learning
To run with multi-task learning mode:
1. repeatedly create components (i.e., reader, backbone and head) for each task followed with step 1~5 above.
2. create empty trainers (each trainer is corresponded to one task) and pass them to create a `MultiHeadTrainer`.
3. build multi-task forward graph with `multi_head_trainer.build_forward` method.
4. use `paddlepalm.optimizer` (and `paddlepalm.lr_sched` if is necessary) to create a *optimizer*, then build backward through `multi_head_trainer.build_backward`.
5. fit all prepared readers and data to multi_head_trainer with `multi_head_trainer.fit_readers` method.
6. load pretrain model with `multi_head_trainer.load_pretrain`, or load checkpoint with `multi_head_trainer.load_ckpt` or nothing to do for training from scratch, then do training with `multi_head_trainer.train`.
The save/load and predict operations of a multi_head_trainer is the same as a trainer.
For more implementation details with `multi_head_trainer`, see
- [ATIS: joint training of dialogue intent recognition and slot filling](https://github.com/PaddlePaddle/PALM/tree/master/examples/multi-task)
#### Save models
To save models/checkpoints and logs during training, just call `trainer.set_saver` method. More implementation details see [this](https://github.com/PaddlePaddle/PALM/tree/master/examples).
#### Evaluation/Inference
To do predict/evaluation after a training stage, just create another three reader, backbone and head instance with `phase='predict'` (repeat step 1~4 above). Then do predicting with `predict` method in trainer (no need to create another trainer). More implementation details see [this](https://github.com/PaddlePaddle/PALM/tree/master/examples/predict).
If you want to do evaluation during training process, use `trainer.train_one_step()` instead of `trainer.train()`. The `trainer.train_one_step(batch)` achieves to train only one step, thus you can insert evaluation code into any point of training process. The argument `batch` can be fetched from `trainer.get_one_batch`.
PaddlePALM also supports multi-head inference, please reference `examples/multi-task/joint_predict.py`.
#### Play with Multiple GPUs
If there exists multiple GPUs in your environment, you can control the number and index of these GPUs through the environment variable [CUDA_VISIBLE_DEVICES](https://devblogs.nvidia.com/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/). For example, if 4 GPUs in your enviroment, indexed with 0,1,2,3, you can run with GPU2 only with following commands
```shell
CUDA_VISIBLE_DEVICES=2 python run.py
```
Multiple GPUs should be seperated with `,`. For example, running with GPU2 and GPU3, following commands is refered:
```shell
CUDA_VISIBLE_DEVICES=2,3 python run.py
```
On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. Therefore, when running with multiple cards, **you need to ensure that the set batch_size can be divided by the number of cards.**
## License
This tutorial is contributed by [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) and licensed under the [Apache-2.0 license](https://github.com/PaddlePaddle/models/blob/develop/LICENSE).
================================================
FILE: README_zh.md
================================================
# PaddlePALM
[English](./README.md) | 简体中文
PaddlePALM (PArallel Learning from Multi-tasks) 是一个灵活,通用且易于使用的NLP大规模预训练和多任务学习框架。 PALM是一个旨在**快速开发高性能NLP模型**的上层框架。
使用PaddlePALM,可以非常轻松灵活的探索具有多种任务辅助训练的“高鲁棒性”阅读理解模型,基于PALM训练的模型[D-Net](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/Research/MRQA2019-D-NET)在[EMNLP2019国际阅读理解评测](https://mrqa.github.io/)中夺得冠军。
<p align="center">
<img src="https://tva1.sinaimg.cn/large/006tNbRwly1gbjkuuwrmlj30hs0hzdh2.jpg" alt="Sample" width="300" height="333">
<p align="center">
<em>MRQA2019 排行榜</em>
</p>
</p>
除了降低NLP研究成本以外,PaddlePALM已被应用于“百度搜索引擎”,有效地提高了用户查询的理解准确度和挖掘出的答案质量,具备高可靠性和高训练/推理性能。
#### 特点:
- **易于使用**:使用PALM, *8个步骤*即可实现一个典型的NLP任务。此外,模型主干网络、数据集读取工具和任务输出层已经解耦,只需对代码进行相当小的更改,就可以将任何组件替换为其他候选组件。
- **支持多任务学习**:*6个步骤*即可实现多任务学习任务。
- **支持大规模任务和预训练**:可自动利用多gpu加速训练和推理。集群上的分布式训练需要较少代码。
- **流行的NLP骨架和预训练模型**:内置多种最先进的通用模型架构和预训练模型(如BERT、ERNIE、RoBERTa等)。
- **易于定制**:支持任何组件的定制开发(例如:主干网络,任务头,读取工具和优化器)与预定义组件的复用,这给了开发人员高度的灵活性和效率,以适应不同的NLP场景。
你可以很容易地用较少的代码复现出很好的性能,涵盖了大多数NLP任务,如分类、匹配、序列标记、阅读理解、对话理解等等。更多细节可以在`examples`中找到。
<table>
<tbody>
<tr>
<th><strong>数据集</strong>
<br></th>
<th colspan="2"><center><strong>chnsenticorp</strong></center></th>
<th colspan="2"><center><strong>Quora Question Pairs matching</strong><center></th>
<th colspan="1"><strong>MSRA-NER<br>(SIGHAN2006)</strong></th>
<th colspan="2"><strong>CMRC2018</strong></th>
</tr>
<tr>
<td rowspan="2">
<p>
<strong>评价标准</strong>
<br></p>
</td>
<td colspan="1">
<center><strong>accuracy</strong></center>
<br></td>
<td colspan="1">
<strong>f1-score</strong>
<strong></strong>
<br></td>
<td colspan="1">
<center><strong>accuracy</strong></center>
<br></td>
<td colspan="1">
<strong>f1-score</strong>
<strong></strong>
<br></td>
<td colspan="1">
<strong>f1-score</strong>
<strong></strong>
<br></td>
<td colspan="1">
<strong>em</strong>
<br></td>
<td colspan="1">
<strong>f1-score</strong>
<br></td>
</tr>
<tr>
<td colspan="2" width="">
<strong>test</strong>
<br></td>
<td colspan="2" width="">
<strong>test</strong>
<br></td>
<td colspan="1" width="">
<strong>test</strong>
<br></td>
<td colspan="2" width="">
<strong>dev</strong>
<br></td>
</tr>
<tr>
<td><strong>ERNIE Base</strong></td>
<td>95.8</td>
<td>95.8</td>
<td>86.2</td>
<td>82.2</td>
<td>99.2</td>
<td>64.3</td>
<td>85.2</td>
</tr>
</tbody>
</table>
## Package概览
<p align="center">
<img src="https://tva1.sinaimg.cn/large/0082zybply1gbyo8d4ltoj31ag0n3tby.jpg" alt="Sample" width="600px" height="auto">
<p align="center">
<em>PALM架构图</em>
</p>
</p>
PaddlePALM是一个设计良好的高级NLP框架。基于PaddlePALM的轻量级代码可以高效实现**监督学习、非监督/自监督学习、多任务学习和迁移学习**。在PaddlePALM架构中有三层,从下到上依次是component层、trainer层、high-level trainer层。
在组件层,PaddlePALM提供了6个 **解耦的**组件来实现NLP任务。每个组件包含丰富的预定义类和一个基类。预定义类是针对典型的NLP任务的,而基类是帮助用户开发一个新类(基于预定义类或基类)。
训练器层是用选定的构件建立计算图,进行训练和预测。该层描述了训练策略、模型保存和加载、评估和预测过程。一个训练器只能处理一个任务。
高级训练器层用于复杂的学习和推理策略,如多任务学习。您可以添加辅助任务来训练健壮的NLP模型(提高模型的测试集和领域外的性能),或者联合训练多个相关任务来获得每个任务的更高性能。
| 模块 | 描述 |
| - | - |
| **paddlepalm** | 基于PaddlePaddle框架的high-level NLP预训练和多任务学习框架。 |
| **paddlepalm.reader** | 预置的任务数据集读取与预处理工具。|
| **paddlepalm.backbone** | 预置的主干网络,如BERT, ERNIE, RoBERTa。|
| **paddlepalm.head** | 预置的任务输出层。|
| **paddlepalm.lr_sched** | 预置的学习率规划策略。|
| **paddlepalm.optimizer** | 预置的优化器。|
| **paddlepalm.downloader** | 预训练模型管理与下载模块。|
| **paddlepalm.Trainer** | 任务训练/预测单元。训练器用于建立计算图,管理训练和评估过程,实现模型/检查点保存和pretrain_model/检查点加载等。|
| **paddlepalm.MultiHeadTrainer** | 完成多任务训练/预测的模块。一个MultiHeadTrainer建立在几个Trainer的基础上。实现了模型主干网络跨任务复用、多任务学习、多任务推理等。|
## 安装
PaddlePALM 支持 python2 和 python3, linux 和 windows, CPU 和 GPU。安装PaddlePALM的首选方法是通过`pip`。只需运行以下命令:
```bash
pip install paddlepalm
```
### 通过源码安装
```shell
git clone https://github.com/PaddlePaddle/PALM.git
cd PALM && python setup.py install
```
### 库依赖
- Python >= 2.7
- cuda >= 9.0
- cudnn >= 7.0
- PaddlePaddle >= 1.7.0 (请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装)
### 下载预训练模型
我们提供了许多预训练的模型来初始化模型主干网络参数。用预先训练好的模型训练大的NLP模型,如12层Transformer,实际上比用随机初始化的参数更有效。要查看所有可用的预训练模型并下载,请在python解释器中运行以下代码(在shell中输入命令`python`):
```python
>>> from paddlepalm import downloader
>>> downloader.ls('pretrain')
Available pretrain items:
=> RoBERTa-zh-base
=> RoBERTa-zh-large
=> ERNIE-v2-en-base
=> ERNIE-v2-en-large
=> XLNet-cased-base
=> XLNet-cased-large
=> ERNIE-v1-zh-base
=> ERNIE-v1-zh-base-max-len-512
=> BERT-en-uncased-large-whole-word-masking
=> BERT-en-cased-large-whole-word-masking
=> BERT-en-uncased-base
=> BERT-en-uncased-large
=> BERT-en-cased-base
=> BERT-en-cased-large
=> BERT-multilingual-uncased-base
=> BERT-multilingual-cased-base
=> BERT-zh-base
>>> downloader.download('pretrain', 'BERT-en-uncased-base', './pretrain_models')
...
```
## 使用
#### 快速开始
8个步骤开始一个典型的NLP训练任务。
1. 使用`paddlepalm.reader` 为数据集加载和输入特征生成创建一个`reader`,然后调用`reader.load_data`方法加载训练数据。
2. 使用`paddlepalm.load_data`创建一个模型*主干网络*来提取文本特征(例如,上下文单词嵌入,句子嵌入)。
3. 通过`reader.register_with`将`reader`注册到主干网络上。在这一步之后,reader能够使用主干网络产生的输入特征。
4. 使用`paddlepalm.head`。创建一个任务*head*,可以为训练提供任务损失,为模型推理提供预测结果。
5. 使用`paddlepalm.Trainer`创建一个任务`Trainer`,然后通过`Trainer.build_forward`构建包含主干网络和任务头的前向图(在步骤2和步骤4中创建)。
6. 使用`paddlepalm.optimizer`(如果需要,创建`paddlepalm.lr_sched`)来创建一个*优化器*,然后通过`train.build_back`向后构建。
7. 使用`trainer.fit_reader`将准备好的reader和数据(在步骤1中实现)给到trainer。
8. 使用`trainer.load_pretrain`加载预训练模型或使用 `trainer.load_pretrain`加载checkpoint,或不加载任何已训练好的参数,然后使用`trainer.train`进行训练。
更多实现细节请见示例:
- [情感分析](https://github.com/PaddlePaddle/PALM/tree/master/examples/classification)
- [Quora问题相似度匹配](https://github.com/PaddlePaddle/PALM/tree/master/examples/matching)
- [命名实体识别](https://github.com/PaddlePaddle/PALM/tree/master/examples/tagging)
- [类SQuAD机器阅读理解](https://github.com/PaddlePaddle/PALM/tree/master/examples/mrc)
#### 多任务学习
多任务学习模式下运行:
1. 重复创建组件(每个任务按照上述第1~5步执行)。
2. 创建空的`Trainer`(每个`Trainer`对应一个任务),并通过它们创建一个`MultiHeadTrainer`。
3. 使用`multi_head_trainer.build_forward`构建多任务前向图。
4. 使用`paddlepalm.optimizer`(如果需要,创建`paddlepalm.lr_sched`)来创建一个*optimizer*,然后通过` multi_head_trainer.build_backward`创建反向。
5. 使用`multi_head_trainer.fit_readers`将所有准备好的读取器和数据放入`multi_head_trainer`中。
6. 使用`multi_head_trainer.load_pretrain`加载预训练模型或使用 `multi_head_trainer.load_pretrain`加载checkpoint,或不加载任何已经训练好的参数,然后使用`multi_head_trainer.train`进行训练。
multi_head_trainer的保存/加载和预测操作与trainer相同。
更多实现`multi_head_trainer`的细节,请见
- [ATIS: 对话意图识别和插槽填充的联合训练](https://github.com/PaddlePaddle/PALM/tree/master/examples/multi-task)
#### 设置saver
在训练时保存 models/checkpoints 和 logs,调用 `trainer.set_saver` 方法。更多实现细节见[这里](https://github.com/PaddlePaddle/PALM/tree/master/examples)。
#### 评估/预测
训练结束后进行预测和评价, 只需创建额外的reader, backbone和head(重复上面1~4步骤),注意创建时需设`phase='predict'`。 然后使用trainer的`predict`方法进行预测(不需创建额外的trainer)。更多实现细节请见[这里](https://github.com/PaddlePaddle/PALM/tree/master/examples/predict)。
#### 使用多GPU
如果您的环境中存在多个GPU,您可以通过环境变量控制这些GPU的数量和索引[CUDA_VISIBLE_DEVICES](https://devblogs.nvidia.com/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/)。例如,如果您的环境中有4个gpu,索引为0、1、2、3,那么您可以运行以下命令来只使用GPU2:
```shell
CUDA_VISIBLE_DEVICES=2 python run.py
```
多GPU的使用需要 `,`作为分隔。例如,使用GPU2和GPU3,运行以下命令:
```shell
CUDA_VISIBLE_DEVICES=2,3 python run.py
```
在多GPU模式下,PaddlePALM会自动将每个batch数据分配到可用的GPU上。例如,如果`batch_size`设置为64,并且有4个GPU可以用于PaddlePALM,那么每个GPU中的batch_size实际上是64/4=16。因此,**当使用多个GPU时,您需要确保batch_size可以被暴露给PALM的GPU数量整除**。
## 许可证书
此向导由[PaddlePaddle](https://github.com/PaddlePaddle/Paddle)贡献,受[Apache-2.0 license](https://github.com/PaddlePaddle/models/blob/develop/LICENSE)许可认证。
================================================
FILE: customization_cn.md
================================================
# PALM组件定制化教程
PALM支持对如下组件自定义:
- **head**
定义一个新的任务输出头,接收来自backbone和reader的输入,输出训练阶段的loss和预测阶段的预测结果。例如:分类任务头,序列标注任务头,机器阅读理解任务头等。
- **backbone**
定义一个新的主干网络,接收来自reader的文本相关的序列特征输入(如token ids),输出文本的特征向量表示(如词向量、上下文相关的词向量表示、句子向量等)。例如:BERT encoder,CNN encoder等。
- **reader**
定义一个新的数据集载入与预处理模块,接收来自原始数据集文件的输入(纯文本,原始标签等),输出文本相关的序列特征(如token ids,position ids等)。例如:文本分类数据集处理模块;文本匹配数据集处理模块等。
- **optimizer**
定义一个新的优化器
- **lr_sched**
定义一种新的学习率规划策略
PALM中的每个组件均使用类来描述,因此可以允许存在内部记忆(成员变量)。
新增某种类型的组件时,只需要实现该组件类型所在目录下的接口类中所描述的方法。若希望新增的组件跟框架的某个内置组件功能相似,那么实现新增组件时,可以继承自已有的内置组件,且仅对需要变动的方法进行修改即可。
### head自定义
head的接口类(Interface)位于`paddlepalm/head/base_head.py`。
该接口类定义如下:
```python
# -*- coding: UTF-8 -*-
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import json
import copy
class Head(object):
def __init__(self, phase='train'):
"""该函数完成一个任务头的构造,至少需要包含一个phase参数。
注意:实现该构造函数时,必须保证对基类构造函数的调用,以创建必要的框架内建的成员变量。
Args:
phase: str类型。用于区分任务头被调用时所处的任务运行阶段,目前支持训练阶段train和预测阶段predict
"""
self._stop_gradient = {}
self._phase = phase
self._prog = None
self._results_buffer = []
@property
def inputs_attrs(self):
"""step级别的任务输入对象声明。
描述该任务头所依赖的reader、backbone和来自其他任务头的输出对象(每个step获取一次)。使用字典进行描述,
字典的key为输出对象所在的组件(如’reader‘,’backbone‘等),value为该组件下任务头所需要的输出对象集。
输出对象集使用字典描述,key为输出对象的名字(该名字需保证在相关组件的输出对象集中),value为该输出对象
的shape和dtype。当某个输出对象的某个维度长度可变时,shape中的相应维度设置为-1。
Return:
dict类型。描述该任务头所依赖的step级输入,即来自各个组件的输出对象。"""
raise NotImplementedError()
@property
def outputs_attr(self):
"""step级别的任务输出对象声明。
描述该任务头的输出对象(每个step输出一次),包括每个输出对象的名字,shape和dtype。输出对象会被加入到
fetch_list中,从而在每个训练/推理step时得到实时的计算结果,该计算结果可以传入batch_postprocess方
法中进行当前step的后处理。当某个对象为标量数据类型(如str, int, float等)时,shape设置为空列表[],
当某个对象的某个维度长度可变时,shape中的相应维度设置为-1。
Return:
dict类型。描述该任务头所产生的输出对象。注意,在训练阶段时必须包含名为loss的输出对象。
"""
raise NotImplementedError()
@property
def epoch_inputs_attrs(self):
"""epoch级别的任务输入对象声明。
描述该任务所依赖的来自reader、backbone和来自其他任务头的输出对象(每个epoch结束后产生一次),如完整的
样本集,有效的样本数等。使用字典进行描述,字典的key为输出对象所在的组件(如’reader‘,’backbone‘等),
value为该组件下任务头所需要的输出对象集。输出对象集使用字典描述,key为输出对象的名字(该名字需保证在相关
组件的输出对象集中),value为该输出对象的shape和dtype。当某个输出对象的某个维度长度可变时,shape中的相
应维度设置为-1。
Return:
dict类型。描述该任务头所产生的输出对象。注意,在训练阶段时必须包含名为loss的输出对象。
"""
return {}
def build(self, inputs, scope_name=""):
"""建立任务头的计算图。
将符合inputs_attrs描述的来自各个对象集的静态图Variables映射成符合outputs_attr描述的静态图Variable输出。
Args:
inputs: dict类型。字典中包含inputs_attrs中的对象名到计算图Variable的映射,inputs中至少会包含inputs_attr中定义的对象
Return:
需要输出的计算图变量,输出对象会被加入到fetch_list中,从而在每个训练/推理step时得到runtime的计算结果,该计算结果会被传入postprocess方法中供用户处理。
"""
raise NotImplementedError()
def batch_postprocess(self, rt_outputs):
"""batch/step级别的后处理。
每个训练或推理step后针对当前batch的任务头输出对象的实时计算结果来进行相关后处理。
默认将输出结果存储到缓冲区self._results_buffer中。"""
if isinstance(rt_outputs, dict):
keys = rt_outputs.keys()
vals = [rt_outputs[k] for k in keys]
lens = [len(v) for v in vals]
if len(set(lens)) == 1:
results = [dict(zip(*[keys, i])) for i in zip(*vals)]
self._results_buffer.extend(results)
return results
else:
print('WARNING: irregular output results. visualize failed.')
self._results_buffer.append(rt_outputs)
return None
def reset(self):
"""清空该任务头的缓冲区(在训练或推理过程中积累的处理结果)"""
self._results_buffer = []
def get_results(self):
"""返回当前任务头积累的处理结果。"""
return copy.deepcopy(self._results_buffer)
def epoch_postprocess(self, post_inputs=None, output_dir=None):
"""epoch级别的后处理。
每个训练或推理epoch结束后,对积累的各样本的后处理结果results进行后处理。默认情况下,当output_dir为None时,直接将results打印到
屏幕上。当指定output_dir时,将results存储在指定的文件夹内,并以任务头所处阶段来作为存储文件的文件名。
Args:
post_inputs: 当声明的epoch_inputs_attr不为空时,该参数会携带对应的输入变量的内容。
output_dir: 积累结果的保存路径。
"""
if output_dir is not None:
for i in self._results_buffer:
print(i)
else:
if not os.path.exists(output_dir):
os.makedirs(output_dir)
with open(os.path.join(output_dir, self._phase), 'w') as writer:
for i in self._results_buffer:
writer.write(json.dumps(i)+'\n')
```
在基类的基础上,定义一个全新的Head时需要至少实现的方法有:
- \_\_init\_\_
- inputs_attrs
- outputs_attr
- build
可以重写的方法有:
- epoch_inputs_attrs
- batch_postprocess
- epoch_postprocess
### backbone自定义
backbone的接口类(Interface)位于`paddlepalm/backbone/base_backbone.py`。
该接口类定义如下:
```python
# -*- coding: UTF-8 -*-
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
class Backbone(object):
"""interface of backbone model."""
def __init__(self, phase):
"""该函数完成一个主干网络的构造,至少需要包含一个phase参数。
注意:实现该构造函数时,必须保证对基类构造函数的调用,以创建必要的框架内建的成员变量。
Args:
phase: str类型。用于区分主干网络被调用时所处的运行阶段,目前支持训练阶段train和预测阶段predict
"""
assert isinstance(config, dict)
@property
def inputs_attr(self):
"""描述backbone从reader处需要得到的输入对象的属性,包含各个对象的名字、shape以及数据类型。当某个对象
为标量数据类型(如str, int, float等)时,shape设置为空列表[],当某个对象的某个维度长度可变时,shape
中的相应维度设置为-1。
Return:
dict类型。对各个输入对象的属性描述。例如,
对于文本分类和匹配任务,bert backbone依赖的reader对象主要包含如下的对象
{"token_ids": ([-1, max_len], 'int64'),
"input_ids": ([-1, max_len], 'int64'),
"segment_ids": ([-1, max_len], 'int64'),
"input_mask": ([-1, max_len], 'float32')}"""
raise NotImplementedError()
@property
def outputs_attr(self):
"""描述backbone输出对象的属性,包含各个对象的名字、shape以及数据类型。当某个对象为标量数据类型(如
str, int, float等)时,shape设置为空列表[],当某个对象的某个维度长度可变时,shape中的相应维度设置为-1。
Return:
dict类型。对各个输出对象的属性描述。例如,
对于文本分类和匹配任务,bert backbone的输出内容可能包含如下的对象
{"word_emb": ([-1, max_seqlen, word_emb_size], 'float32'),
"sentence_emb": ([-1, hidden_size], 'float32'),
"sim_vec": ([-1, hidden_size], 'float32')}"""
raise NotImplementedError()
def build(self, inputs):
"""建立backbone的计算图。将符合inputs_attr描述的静态图Variable输入映射成符合outputs_attr描述的静态图Variable输出。
Args:
inputs: dict类型。字典中包含inputs_attr中的对象名到计算图Variable的映射,inputs中至少会包含inputs_attr中定义的对象
Return:
需要输出的计算图变量,输出对象会被加入到fetch_list中,从而在每个训练/推理step时得到runtime的计算结果,该计算结果会被传入postprocess方法中供用户处理。
"""
raise NotImplementedError()
```
在基类的基础上,定义一个全新的Backbone时需要至少实现的方法有:
- \_\_init\_\_
- input_attrs
- output_attr
- build
### reader自定义
reader的接口类(Interface)位于`paddlepalm/reader/base_reader.py`。
该接口类定义如下:
```python
# -*- coding: UTF-8 -*-
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from copy import copy
class Reader(object):
"""interface of data reader."""
def __init__(self, phase='train'):
"""该函数完成一个Reader的构造,至少需要包含一个phase参数。
注意:实现该构造函数时,必须保证对基类构造函数的调用,以创建必要的框架内建的成员变量。
Args:
phase: str类型。用于区分主干网络被调用时所处的运行阶段,目前支持训练阶段train和预测阶段predict
"""
self._phase = phase
self._batch_size = None
self._num_epochs = 1
self._register = set()
self._registered_backbone = None
@classmethod
def create_register(self):
return set()
def clone(self, phase='train'):
"""拷贝一个新的reader对象。"""
if phase == self._phase:
return copy(self)
else:
ret = copy(self)
ret._phase = phase
return ret
def require_attr(self, attr_name):
"""在注册器中新增一个需要产生的对象。
Args:
attr_name: 需要产出的对象的对象名,例如’segment_ids‘。
"""
self._register.add(attr_name)
def register_with(self, backbone):
"""根据backbone对输入对象的依赖,在注册器中对每个依赖的输入对象进行注册。
Args:
backbone: 需要对接的主干网络。
"""
for attr in backbone.inputs_attr:
self.require_attr(attr)
self._registered_backbone = backbone
def get_registered_backbone(self):
"""返回该reader所注册的backbone。"""
return self._registered_backbone
def _get_registed_attrs(self, attrs):
ret = {}
for i in self._register:
if i not in attrs:
raise NotImplementedError('output attr {} is not found in this reader.'.format(i))
ret[i] = attrs[i]
return ret
def load_data(self, input_file, batch_size, num_epochs=None, \
file_format='tsv', shuffle_train=True):
"""将磁盘上的数据载入到reader中。
注意:实现该方法时需要同步创建self._batch_size和self._num_epochs。
Args:
input_file: 数据集文件路径。文件格式需要满足`file_format`参数的要求。
batch_size: 迭代器每次yield出的样本数量。注意:当环境中存在多个GPU时,batch_size需要保证被GPU卡数整除。
num_epochs: 数据集遍历次数。默认为None, 在单任务模式下代表遍历一次,在多任务模式下该参数会被上层的Trainer进行自动赋值。该参数仅对训练阶段有效。
file_format: 输入文件的文件格式。目前支持的格式: tsv. 默认为tsv.
shuffle_train: 是否打乱训练集中的样本。默认为True。该参数仅对训练阶段有效。
"""
raise NotImplementedError()
@property
def outputs_attr(self):
"""描述reader输出对象(被yield出的对象)的属性,包含各个对象的名字、shape以及数据类型。当某个对象为标量数据
类型(如str, int, float等)时,shape设置为空列表[],当某个对象的某个维度长度可变时,shape中的相应维度设置为-1。
注意:当使用mini-batch梯度下降学习策略时,,应为常规的输入对象设置batch_size维度(一般为-1)
Return:
dict类型。对各个输入对象的属性描述。例如,
对于文本分类和匹配任务,yield的输出内容可能包含如下的对象(下游backbone和task可按需访问其中的对象)
{"token_ids": ([-1, max_len], 'int64'),
"input_ids": ([-1, max_len], 'int64'),
"segment_ids": ([-1, max_len], 'int64'),
"input_mask": ([-1, max_len], 'float32'),
"label": ([-1], 'int')}
"""
raise NotImplementedError()
def _iterator(self):
"""数据集遍历接口,注意,当数据集遍历到尾部时该接口应自动完成指针重置,即重新从数据集头部开始新的遍历。
Yield:
dict类型。符合outputs_attr描述的当前step的输出对象。
"""
raise NotImplementedError()
def get_epoch_outputs(self):
"""返回数据集每个epoch遍历后的输出对象。"""
raise NotImplementedError()
@property
def num_examples(self):
"""数据集中的样本数量,即每个epoch中iterator所生成的样本数。注意,使用滑动窗口等可能导致数据集样本数发生变化的策略时
该接口应返回runtime阶段的实际样本数。"""
raise NotImplementedError()
@property
def num_epochs(self):
"""数据集遍历次数"""
return self._num_epochs
```
在基类的基础上,定义一个全新的Reader时需要至少实现的方法有:
- \_\_init\_\_
- outputs_attr
- load_data
- _iterator
- num_examples
可以重写的方法有:
- get_epoch_outputs
================================================
FILE: examples/classification/README.md
================================================
## Example 1: Classification
This task is a sentiment analysis task. The following sections detail model preparation, dataset preparation, and how to run the task.
### Step 1: Prepare Pre-trained Model & Dataset
#### Pre-trained Model
The pre-training model of this mission is: [ERNIE-v1-zh-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api).
Make sure you have downloaded the required pre-training model in the current folder.
#### Dataset
This example demonstrates with [ChnSentiCorp](https://github.com/SophonPlus/ChineseNlpCorpus/tree/master/datasets/ChnSentiCorp_htl_all), a Chinese sentiment analysis dataset.
Download dataset:
```shell
python download.py
```
If everything goes well, there will be a folder named `data/` created with all the data files in it.
The dataset file (for training) should have 2 fields, `text_a` and `label`, stored with [tsv](https://en.wikipedia.org/wiki/Tab-separated_values) format. Here shows an example:
```
label text_a
0 当当网名不符实,订货多日不见送货,询问客服只会推托,只会要求用户再下订单。如此服务留不住顾客的。去别的网站买书服务更好。
0 XP的驱动不好找!我的17号提的货,现在就降价了100元,而且还送杀毒软件!
1 <荐书> 推荐所有喜欢<红楼>的红迷们一定要收藏这本书,要知道当年我听说这本书的时候花很长时间去图书馆找和借都没能如愿,所以这次一看到当当有,马上买了,红迷们也要记得备货哦!
```
### Step 2: Train & Predict
The code used to perform this task is in `run.py`. If you have prepared the pre-training model and the data set required for the task, run:
```shell
python run.py
```
If you want to specify a specific gpu or use multiple gpus for training, please use **`CUDA_VISIBLE_DEVICES`**, for example:
```shell
CUDA_VISIBLE_DEVICES=0,1 python run.py
```
Note: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.**
Some logs will be shown below:
```
step 1/154 (epoch 0), loss: 5.512, speed: 0.51 steps/s
step 2/154 (epoch 0), loss: 2.595, speed: 3.36 steps/s
step 3/154 (epoch 0), loss: 1.798, speed: 3.48 steps/s
```
After the run, you can view the saved models in the `outputs/` folder and the predictions in the `outputs/predict` folder. Here are some examples of predictions:
```
{"index": 0, "logits": [-0.2014336884021759, 0.6799028515815735], "probs": [0.29290086030960083, 0.7070990800857544], "label": 1}
{"index": 1, "logits": [0.8593899011611938, -0.29743513464927673], "probs": [0.7607553601264954, 0.23924466967582703], "label": 0}
{"index": 2, "logits": [0.7462944388389587, -0.7083730101585388], "probs": [0.8107157349586487, 0.18928426504135132], "label": 0}
```
### Step 3: Evaluate
Once you have the prediction, you can run the evaluation script to evaluate the model:
```shell
python evaluate.py
```
The evaluation results are as follows:
```
data num: 1200
accuracy: 0.9575, precision: 0.9634, recall: 0.9523, f1: 0.9578
```
================================================
FILE: examples/classification/download.py
================================================
# -*- coding: utf-8 -*-
from __future__ import print_function
import os
import tarfile
import shutil
import sys
import urllib
URLLIB=urllib
if sys.version_info >= (3, 0):
import urllib.request
URLLIB=urllib.request
def download(src, url):
def _reporthook(count, chunk_size, total_size):
bytes_so_far = count * chunk_size
percent = float(bytes_so_far) / float(total_size)
if percent > 1:
percent = 1
print('\r>> Downloading... {:.1%}'.format(percent), end="")
URLLIB.urlretrieve(url, src, reporthook=_reporthook)
abs_path = os.path.abspath(__file__)
download_url = "https://ernie.bj.bcebos.com/task_data_zh.tgz"
downlaod_path = os.path.join(os.path.dirname(abs_path), "task_data_zh.tgz")
target_dir = os.path.dirname(abs_path)
download(downlaod_path, download_url)
tar = tarfile.open(downlaod_path)
tar.extractall(target_dir)
os.remove(downlaod_path)
abs_path = os.path.abspath(__file__)
dst_dir = os.path.join(os.path.dirname(abs_path), "data")
if not os.path.exists(dst_dir) or not os.path.isdir(dst_dir):
os.makedirs(dst_dir)
for file in os.listdir(os.path.join(target_dir, 'task_data', 'chnsenticorp')):
shutil.move(os.path.join(target_dir, 'task_data', 'chnsenticorp', file), dst_dir)
shutil.rmtree(os.path.join(target_dir, 'task_data'))
print(" done!")
================================================
FILE: examples/classification/evaluate.py
================================================
# -*- coding: utf-8 -*-
import json
import numpy as np
def accuracy(preds, labels):
preds = np.array(preds)
labels = np.array(labels)
return (preds == labels).mean()
def pre_recall_f1(preds, labels):
preds = np.array(preds)
labels = np.array(labels)
# recall=TP/(TP+FN)
tp = np.sum((labels == '1') & (preds == '1'))
fp = np.sum((labels == '0') & (preds == '1'))
fn = np.sum((labels == '1') & (preds == '0'))
r = tp * 1.0 / (tp + fn)
# Precision=TP/(TP+FP)
p = tp * 1.0 / (tp + fp)
epsilon = 1e-31
f1 = 2 * p * r / (p+r+epsilon)
return p, r, f1
def res_evaluate(res_dir="./outputs/predict/predictions.json", eval_phase='test'):
if eval_phase == 'test':
data_dir="./data/test.tsv"
elif eval_phase == 'dev':
data_dir="./data/dev.tsv"
else:
assert eval_phase in ['dev', 'test'], 'eval_phase should be dev or test'
labels = []
with open(data_dir, "r") as file:
first_flag = True
for line in file:
line = line.split("\t")
label = line[0]
if label=='label':
continue
labels.append(str(label))
file.close()
preds = []
with open(res_dir, "r") as file:
for line in file.readlines():
line = json.loads(line)
pred = line['label']
preds.append(str(pred))
file.close()
assert len(labels) == len(preds), "prediction result doesn't match to labels"
print('data num: {}'.format(len(labels)))
p, r, f1 = pre_recall_f1(preds, labels)
print("accuracy: {:.4f}, precision: {:.4f}, recall: {:.4f}, f1: {:.4f}".format(accuracy(preds, labels), p, r, f1))
res_evaluate()
================================================
FILE: examples/classification/run.py
================================================
# coding=utf-8
import paddlepalm as palm
import json
if __name__ == '__main__':
# configs
max_seqlen = 256
batch_size = 8
num_epochs = 10
lr = 5e-5
weight_decay = 0.01
vocab_path = './pretrain/ERNIE-v1-zh-base/vocab.txt'
train_file = './data/train.tsv'
predict_file = './data/test.tsv'
config = json.load(open('./pretrain/ERNIE-v1-zh-base/ernie_config.json'))
input_dim = config['hidden_size']
num_classes = 2
dropout_prob = 0.1
random_seed = 1
task_name = 'chnsenticorp'
save_path = './outputs/'
pred_output = './outputs/predict/'
save_type = 'ckpt'
print_steps = 20
pre_params = './pretrain/ERNIE-v1-zh-base/params'
# ----------------------- for training -----------------------
# step 1-1: create readers for training
cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed)
# step 1-2: load the training data
cls_reader.load_data(train_file, batch_size, num_epochs=num_epochs)
# step 2: create a backbone of the model to extract text features
ernie = palm.backbone.ERNIE.from_config(config)
# step 3: register the backbone in reader
cls_reader.register_with(ernie)
# step 4: create the task output head
cls_head = palm.head.Classify(num_classes, input_dim, dropout_prob)
# step 5-1: create a task trainer
trainer = palm.Trainer(task_name)
# step 5-2: build forward graph with backbone and task head
loss_var = trainer.build_forward(ernie, cls_head)
# step 6-1*: use warmup
n_steps = cls_reader.num_examples * num_epochs // batch_size
warmup_steps = int(0.1 * n_steps)
sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps)
# step 6-2: create a optimizer
adam = palm.optimizer.Adam(loss_var, lr, sched)
# step 6-3: build backward
trainer.build_backward(optimizer=adam, weight_decay=weight_decay)
# step 7: fit prepared reader and data
trainer.fit_reader(cls_reader)
# step 8-1*: load pretrained parameters
trainer.load_pretrain(pre_params)
# step 8-2*: set saver to save model
# save_steps = n_steps
save_steps = 2396
trainer.set_saver(save_steps=save_steps, save_path=save_path, save_type=save_type)
# step 8-3: start training
trainer.train(print_steps=print_steps)
# ----------------------- for prediction -----------------------
# step 1-1: create readers for prediction
print('prepare to predict...')
predict_cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed, phase='predict')
# step 1-2: load the training data
predict_cls_reader.load_data(predict_file, batch_size)
# step 2: create a backbone of the model to extract text features
pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict')
# step 3: register the backbone in reader
predict_cls_reader.register_with(pred_ernie)
# step 4: create the task output head
cls_pred_head = palm.head.Classify(num_classes, input_dim, phase='predict')
# step 5: build forward graph with backbone and task head
trainer.build_predict_forward(pred_ernie, cls_pred_head)
# step 6: load checkpoint
# model_path = './outputs/ckpt.step'+str(save_steps)
model_path = './outputs/ckpt.step'+str(11980)
trainer.load_ckpt(model_path)
# step 7: fit prepared reader and data
trainer.fit_reader(predict_cls_reader, phase='predict')
# step 8: predict
print('predicting..')
trainer.predict(print_steps=print_steps, output_dir=pred_output)
================================================
FILE: examples/matching/README.md
================================================
## Example 2: Matching
This task is a sentence pair matching task. The following sections detail model preparation, dataset preparation, and how to run the task with PaddlePALM.
### Step 1: Prepare Pre-trained Models & Datasets
#### Download Pre-trained Model
The pre-training model of this mission is: [ERNIE-v2-en-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api).
Make sure you have downloaded the required pre-training model in the current folder.
#### Dataset
Here takes the [Quora Question Pairs](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset as the testbed for matching.
Download dataset:
```shell
python download.py
```
After the dataset is downloaded, you should convert the data format for training:
```shell
python process.py data/quora_duplicate_questions.tsv data/train.tsv data/test.tsv
```
If everything goes well, there will be a folder named `data/` created with all the converted datas in it.
The dataset file (for training) should have 3 fields, `text_a`, `text_b` and `label`, stored with [tsv](https://en.wikipedia.org/wiki/Tab-separated_values) format. Here shows an example:
```
text_a text_b label
How can the arrangement of corynebacterium xerosis be described? How would you describe waves? 0
How do you fix a Google Play Store account that isn't working? What can cause the Google Play store to not open? How are such probelms fixed? 1
Which is the best earphone under 1000? What are the best earphones under 1k? 1
What are the differences between the Dell Inspiron 3000, 5000, and 7000 series laptops? "Should I buy an Apple MacBook Pro 15"" or a Dell Inspiron 17 5000 series?" 0
```
### Step 2: Train & Predict
The code used to perform this task is in `run.py`. If you have prepared the pre-training model and the data set required for the task, run:
```shell
python run.py
```
If you want to specify a specific gpu or use multiple gpus for training, please use **`CUDA_VISIBLE_DEVICES`**, for example:
```shell
CUDA_VISIBLE_DEVICES=0,1 python run.py
```
Note: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.**
Some logs will be shown below:
```
step 20/49087 (epoch 0), loss: 1.079, speed: 3.48 steps/s
step 40/49087 (epoch 0), loss: 1.251, speed: 5.18 steps/s
step 60/49087 (epoch 0), loss: 1.193, speed: 5.04 steps/s
```
After the run, you can view the saved models in the `outputs/` folder and the predictions in the `outputs/predict` folder. Here are some examples of predictions:
```
{"index": 0, "logits": [-0.32688724994659424, -0.8568955063819885], "probs": [0.629485011100769, 0.3705149292945862], "label": 0}
{"index": 1, "logits": [-0.2735646963119507, -0.7983021140098572], "probs": [0.6282548904418945, 0.37174513936042786], "label": 0}
{"index": 2, "logits": [-0.3381381630897522, -0.8614270091056824], "probs": [0.6279165148735046, 0.37208351492881775], "label": 0}
```
### Step 3: Evaluate
Once you have the prediction, you can run the evaluation script to evaluate the model:
```shell
python evaluate.py
```
The evaluation results are as follows:
```
data num: 4300
accuracy: 0.8619, precision: 0.8061, recall: 0.8377, f1: 0.8216
```
================================================
FILE: examples/matching/download.py
================================================
# -*- coding: utf-8 -*-
from __future__ import print_function
import os
import sys
import urllib
URLLIB=urllib
if sys.version_info >= (3, 0):
import urllib.request
URLLIB=urllib.request
def download(src, url):
def _reporthook(count, chunk_size, total_size):
bytes_so_far = count * chunk_size
percent = float(bytes_so_far) / float(total_size)
if percent > 1:
percent = 1
print('\r>> Downloading... {:.1%}'.format(percent), end="")
URLLIB.urlretrieve(url, src, reporthook=_reporthook)
abs_path = os.path.abspath(__file__)
data_dir = os.path.join(os.path.dirname(abs_path), "data")
if not os.path.exists(data_dir) or not os.path.isdir(data_dir):
os.makedirs(data_dir)
download_url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
downlaod_path = os.path.join(data_dir, "quora_duplicate_questions.tsv")
download(downlaod_path, download_url)
print(" done!")
================================================
FILE: examples/matching/evaluate.py
================================================
# -*- coding: utf-8 -*-
import json
import numpy as np
def accuracy(preds, labels):
preds = np.array(preds)
labels = np.array(labels)
return (preds == labels).mean()
def pre_recall_f1(preds, labels):
preds = np.array(preds)
labels = np.array(labels)
# recall=TP/(TP+FN)
tp = np.sum((labels == '1') & (preds == '1'))
fp = np.sum((labels == '0') & (preds == '1'))
fn = np.sum((labels == '1') & (preds == '0'))
r = tp * 1.0 / (tp + fn)
# Precision=TP/(TP+FP)
p = tp * 1.0 / (tp + fp)
epsilon = 1e-31
f1 = 2 * p * r / (p+r+epsilon)
return p, r, f1
def res_evaluate(res_dir="./outputs/predict/predictions.json", eval_phase='test'):
if eval_phase == 'test':
data_dir="./data/test.tsv"
elif eval_phase == 'dev':
data_dir="./data/dev.tsv"
else:
assert eval_phase in ['dev', 'test'], 'eval_phase should be dev or test'
labels = []
with open(data_dir, "r") as file:
first_flag = True
for line in file:
line = line.split("\t")
label = line[2][:-1]
if label=='label':
continue
labels.append(str(label))
file.close()
preds = []
with open(res_dir, "r") as file:
for line in file.readlines():
line = json.loads(line)
pred = line['label']
preds.append(str(pred))
file.close()
assert len(labels) == len(preds), "prediction result({}) doesn't match to labels({})".format(len(preds),len(labels))
print('data num: {}'.format(len(labels)))
p, r, f1 = pre_recall_f1(preds, labels)
print("accuracy: {:.4f}, precision: {:.4f}, recall: {:.4f}, f1: {:.4f}".format(accuracy(preds, labels), p, r, f1))
res_evaluate()
================================================
FILE: examples/matching/process.py
================================================
# -*- coding: utf-8 -*-
import sys
import os
if len(sys.argv) != 4:
exit(0)
data_dir = sys.argv[1]
if not os.path.exists(data_dir):
print("%s not exists" % data_dir)
exit(0)
train_dir = sys.argv[2]
train_file = open(train_dir, "w")
train_file.write("text_a\ttext_b\tlabel\n")
test_dir = sys.argv[3]
test_file = open(test_dir, "w")
test_file.write("text_a\ttext_b\tlabel\n")
with open(data_dir, "r") as file:
before = ""
cnt = 0
for line in file:
line = line.strip("\n")
line_t = line.split("\t")
flag = 0
if len(line_t) < 6:
if flag:
flag = 0
out_line = "{}{}\n".format(out_line, line)
else:
flag = 1
outline = "{}".format(line)
continue
else:
out_line = "{}\t{}\t{}\n".format(line_t[3], line_t[4], line_t[5])
cnt += 1
if 2 <= cnt <= 4301:
test_file.write(out_line)
if 4301 <= cnt <= 104301:
train_file.write(out_line)
train_file.close()
test_file.close()
================================================
FILE: examples/matching/run.py
================================================
# coding=utf-8
import paddlepalm as palm
import json
if __name__ == '__main__':
# configs
max_seqlen = 128
batch_size = 16
num_epochs = 3
lr = 3e-5
weight_decay = 0.0
num_classes = 2
random_seed = 1
dropout_prob = 0.1
save_path = './outputs/'
save_type = 'ckpt'
pred_model_path = './outputs/ckpt.step'+str(18732)
print_steps = 50
pred_output = './outputs/predict/'
pre_params = './pretrain/ERNIE-v2-en-base/params'
task_name = 'Quora Question Pairs matching'
vocab_path = './pretrain/ERNIE-v2-en-base/vocab.txt'
train_file = './data/train.tsv'
predict_file = './data/test.tsv'
config = json.load(open('./pretrain/ERNIE-v2-en-base/ernie_config.json'))
input_dim = config['hidden_size']
# ----------------------- for training -----------------------
# step 1-1: create readers for training
match_reader = palm.reader.MatchReader(vocab_path, max_seqlen, seed=random_seed)
# step 1-2: load the training data
match_reader.load_data(train_file, file_format='tsv', num_epochs=num_epochs, batch_size=batch_size)
# step 2: create a backbone of the model to extract text features
ernie = palm.backbone.ERNIE.from_config(config)
# step 3: register the backbone in reader
match_reader.register_with(ernie)
# step 4: create the task output head
match_head = palm.head.Match(num_classes, input_dim, dropout_prob)
# step 5-1: create a task trainer
trainer = palm.Trainer(task_name)
# step 5-2: build forward graph with backbone and task head
loss_var = trainer.build_forward(ernie, match_head)
# step 6-1*: use warmup
n_steps = match_reader.num_examples * num_epochs // batch_size
warmup_steps = int(0.1 * n_steps)
print('total_steps: {}'.format(n_steps))
print('warmup_steps: {}'.format(warmup_steps))
sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps)
# step 6-2: create a optimizer
adam = palm.optimizer.Adam(loss_var, lr, sched)
# step 6-3: build backward
trainer.build_backward(optimizer=adam, weight_decay=weight_decay)
# step 7: fit prepared reader and data
trainer.fit_reader(match_reader)
# step 8-1*: load pretrained parameters
trainer.load_pretrain(pre_params, False)
# step 8-2*: set saver to save model
# save_steps = n_steps-16
save_steps = 6244
trainer.set_saver(save_path=save_path, save_steps=save_steps, save_type=save_type)
# step 8-3: start training
trainer.train(print_steps=print_steps)
# ----------------------- for prediction -----------------------
# step 1-1: create readers for prediction
print('prepare to predict...')
predict_match_reader = palm.reader.MatchReader(vocab_path, max_seqlen, seed=random_seed, phase='predict')
# step 1-2: load the training data
predict_match_reader.load_data(predict_file, batch_size)
# step 2: create a backbone of the model to extract text features
pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict')
# step 3: register the backbone in reader
predict_match_reader.register_with(pred_ernie)
# step 4: create the task output head
match_pred_head = palm.head.Match(num_classes, input_dim, phase='predict')
# step 5: build forward graph with backbone and task head
trainer.build_predict_forward(pred_ernie, match_pred_head)
# step 6: load checkpoint
trainer.load_ckpt(pred_model_path)
# step 7: fit prepared reader and data
trainer.fit_reader(predict_match_reader, phase='predict')
# step 8: predict
print('predicting..')
trainer.predict(print_steps=print_steps, output_dir=pred_output)
================================================
FILE: examples/mrc/README.md
================================================
## Example 4: Machine Reading Comprehension
This task is a machine reading comprehension task. The following sections detail model preparation, dataset preparation, and how to run the task.
### Step 1: Prepare Pre-trained Models & Datasets
#### Pre-trianed Model
The pre-training model of this mission is: [ERNIE-v1-zh-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api).
Make sure you have downloaded the required pre-training model in the current folder.
#### Dataset
This task uses the `CMRC2018` dataset. `CMRC2018` is an evaluation conducted by Chinese information society. The task of evaluation is to extract reading comprehension.
Download dataset:
```shell
python download.py
```
If everything goes well, there will be a folder named `data/` created with all the datas in it.
Here is some example datas:
```json
"paragraphs": [
{
"id": "TRAIN_36",
"context": "NGC 6231是一个位于天蝎座的疏散星团,天球座标为赤经16时54分,赤纬-41度48分,视觉观测大小约45角分,亮度约2.6视星等,距地球5900光年。NGC 6231年龄约为三百二十万年,是一个非常年轻的星团,星团内的最亮星是5等的天蝎座 ζ1星。用双筒望远镜或小型望远镜就能看到个别的行星。NGC 6231在1654年被意大利天文学家乔瓦尼·巴蒂斯特·霍迪尔纳(Giovanni Battista Hodierna)以Luminosae的名字首次纪录在星表中,但是未见记载于夏尔·梅西耶的天体列表和威廉·赫歇尔的深空天体目录。这个天体在1678年被爱德蒙·哈雷(I.7)、1745年被夏西亚科斯(Jean-Phillippe Loys de Cheseaux)(9)、1751年被尼可拉·路易·拉卡伊(II.13)分别再次独立发现。",
"qas": [
{
"question": "NGC 6231的经纬度是多少?",
"id": "TRAIN_36_QUERY_0",
"answers": [
{
"text": "赤经16时54分,赤纬-41度48分",
"answer_start": 27
}
]
}
}
```
### Step 2: Train & Predict
The code used to perform this task is in `run.py`. If you have prepared the pre-training model and the data set required for the task, run:
```shell
python run.py
```
If you want to specify a specific gpu or use multiple gpus for training, please use **`CUDA_VISIBLE_DEVICES`**, for example:
```shell
CUDA_VISIBLE_DEVICES=0,1 python run.py
```
Note: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.**
Some logs will be shown below:
```
step 1/1515 (epoch 0), loss: 6.251, speed: 0.31 steps/s
step 2/1515 (epoch 0), loss: 6.206, speed: 0.80 steps/s
step 3/1515 (epoch 0), loss: 6.172, speed: 0.86 steps/s
```
After the run, you can view the saved models in the `outputs/` folder and the predictions in the `outputs/predict` folder. Here are some examples of predictions:
```json
{
"DEV_0_QUERY_0": "光 荣 和 ω-force 开 发",
"DEV_0_QUERY_1": "任 天 堂 游 戏 谜 之 村 雨 城",
"DEV_0_QUERY_2": "战 史 演 武 」&「 争 霸 演 武 」。",
"DEV_1_QUERY_0": "大 陆 传 统 器 乐 及 戏 曲 里 面 常 用 的 打 击 乐 记 谱 方 法 , 以 中 文 字 的 声 音 模 拟 敲 击 乐 的 声 音 , 纪 录 打 击 乐 的 各 种 不 同 的 演 奏 方 法 。",
"DEV_1_QUERY_1": "「 锣 鼓 点",
"DEV_1_QUERY_2": "锣 鼓 的 运 用 有 约 定 俗 成 的 程 式 , 依 照 角 色 行 当 的 身 份 、 性 格 、 情 绪 以 及 环 境 , 配 合 相 应 的 锣 鼓 点",
"DEV_1_QUERY_3": "鼓 、 锣 、 钹 和 板 四 类 型",
"DEV_2_QUERY_0": "364.6 公 里",
}
```
### Step 3: Evaluate
#### Library Dependencies
Before the evaluation, you need to install `nltk` and download the `punkt` tokenizer for nltk:
```shell
pip insall nltk
python -m nltk.downloader punkt
```
#### Evaluate
You can run the evaluation script to evaluate the model:
```shell
python evaluate.py
```
The evaluation results are as follows:
```
data_num: 3219
em_sroce: 0.6434, f1: 0.8518
```
================================================
FILE: examples/mrc/download.py
================================================
# -*- coding: utf-8 -*-
from __future__ import print_function
import os
import tarfile
import shutil
import sys
import urllib
URLLIB=urllib
if sys.version_info >= (3, 0):
import urllib.request
URLLIB=urllib.request
def download(src, url):
def _reporthook(count, chunk_size, total_size):
bytes_so_far = count * chunk_size
percent = float(bytes_so_far) / float(total_size)
if percent > 1:
percent = 1
print('\r>> Downloading... {:.1%}'.format(percent), end="")
URLLIB.urlretrieve(url, src, reporthook=_reporthook)
abs_path = os.path.abspath(__file__)
download_url = "https://ernie.bj.bcebos.com/task_data_zh.tgz"
downlaod_path = os.path.join(os.path.dirname(abs_path), "task_data_zh.tgz")
target_dir = os.path.dirname(abs_path)
download(downlaod_path, download_url)
tar = tarfile.open(downlaod_path)
tar.extractall(target_dir)
os.remove(downlaod_path)
abs_path = os.path.abspath(__file__)
dst_dir = os.path.join(os.path.dirname(abs_path), "data")
if not os.path.exists(dst_dir) or not os.path.isdir(dst_dir):
os.makedirs(dst_dir)
for file in os.listdir(os.path.join(target_dir, 'task_data', 'cmrc2018')):
shutil.move(os.path.join(target_dir, 'task_data', 'cmrc2018', file), dst_dir)
shutil.rmtree(os.path.join(target_dir, 'task_data'))
print(" done!")
================================================
FILE: examples/mrc/evaluate.py
================================================
# -*- coding: utf-8 -*-
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
'''
Evaluation script for CMRC 2018
version: v5
Note:
v5 formatted output, add usage description
v4 fixed segmentation issues
'''
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
from collections import Counter, OrderedDict
import string
import re
import argparse
import json
import sys
import nltk
import pdb
# split Chinese with English
def mixed_segmentation(in_str, rm_punc=False):
in_str = in_str.lower().strip()
segs_out = []
temp_str = ""
sp_char = [
'-', ':', '_', '*', '^', '/', '\\', '~', '`', '+', '=', ',', '。', ':',
'?', '!', '“', '”', ';', '’', '《', '》', '……', '·', '、', '「', '」', '(',
')', '-', '~', '『', '』',' '
]
for char in in_str:
if rm_punc and char in sp_char:
continue
if re.search(r'[\u4e00-\u9fa5]', char) or char in sp_char:
if temp_str != "":
ss = nltk.word_tokenize(temp_str)
segs_out.extend(ss)
temp_str = ""
segs_out.append(char)
else:
temp_str += char
#handling last part
if temp_str != "":
ss = nltk.word_tokenize(temp_str)
segs_out.extend(ss)
return segs_out
# remove punctuation
def remove_punctuation(in_str):
in_str = in_str.lower().strip()
sp_char = [
'-', ':', '_', '*', '^', '/', '\\', '~', '`', '+', '=', ',', '。', ':',
'?', '!', '“', '”', ';', '’', '《', '》', '……', '·', '、', '「', '」', '(',
')', '-', '~', '『', '』', ' '
]
out_segs = []
for char in in_str:
if char in sp_char:
continue
else:
out_segs.append(char)
return ''.join(out_segs)
# find longest common string
def find_lcs(s1, s2):
m = [[0 for i in range(len(s2) + 1)] for j in range(len(s1) + 1)]
mmax = 0
p = 0
for i in range(len(s1)):
for j in range(len(s2)):
if s1[i] == s2[j]:
m[i + 1][j + 1] = m[i][j] + 1
if m[i + 1][j + 1] > mmax:
mmax = m[i + 1][j + 1]
p = i + 1
return s1[p - mmax:p], mmax
def evaluate(ground_truth_file, prediction_file):
f1 = 0
em = 0
total_count = 0
skip_count = 0
for instances in ground_truth_file["data"]:
for instance in instances["paragraphs"]:
context_text = instance['context'].strip()
for qas in instance['qas']:
total_count += 1
query_id = qas['id'].strip()
query_text = qas['question'].strip()
answers = [ans["text"] for ans in qas["answers"]]
if query_id not in prediction_file:
print('Unanswered question: {}\n'.format(
query_id))
skip_count += 1
continue
prediction = prediction_file[query_id]
f1 += calc_f1_score(answers, prediction)
em += calc_em_score(answers, prediction)
f1_score = f1 / total_count
em_score = em / total_count
return f1_score, em_score, total_count, skip_count
def calc_f1_score(answers, prediction):
f1_scores = []
for ans in answers:
ans_segs = mixed_segmentation(ans, rm_punc=True)
prediction_segs = mixed_segmentation(prediction, rm_punc=True)
lcs, lcs_len = find_lcs(ans_segs, prediction_segs)
if lcs_len == 0:
f1_scores.append(0)
continue
precision = 1.0 * lcs_len / len(prediction_segs)
recall = 1.0 * lcs_len / len(ans_segs)
f1 = (2 * precision * recall) / (precision + recall)
f1_scores.append(f1)
return max(f1_scores)
def calc_em_score(answers, prediction):
em = 0
for ans in answers:
ans_ = remove_punctuation(ans)
prediction_ = remove_punctuation(prediction)
if ans_ == prediction_:
em = 1
break
return em
def eval_file(dataset_file, prediction_file):
ground_truth_file = json.load(open(dataset_file, 'r'))
prediction_file = json.load(open(prediction_file, 'r'))
F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)
AVG = (EM + F1) * 0.5
return EM, F1, AVG, TOTAL
if __name__ == '__main__':
EM, F1, AVG, TOTAL = eval_file("data/dev.json", "outputs/predict/predictions.json")
print('data_num: {}'.format(TOTAL))
print('em_sroce: {:.4f}, f1: {:.4f}'.format(EM,F1))
================================================
FILE: examples/mrc/run.py
================================================
# coding=utf-8
import paddlepalm as palm
import json
if __name__ == '__main__':
# configs
max_seqlen = 512
batch_size = 8
num_epochs = 2
lr = 3e-5
doc_stride = 128
max_query_len = 64
max_ans_len = 128
weight_decay = 0.01
print_steps = 20
vocab_path = './pretrain/ERNIE-v1-zh-base/vocab.txt'
do_lower_case = True
train_file = './data/train.json'
predict_file = './data/dev.json'
save_path = './outputs/'
pred_output = './outputs/predict/'
save_type = 'ckpt'
task_name = 'cmrc2018'
pre_params = './pretrain/ERNIE-v1-zh-base/params'
config = json.load(open('./pretrain/ERNIE-v1-zh-base/ernie_config.json'))
# ----------------------- for training -----------------------
# step 1-1: create readers for training
mrc_reader = palm.reader.MRCReader(vocab_path, max_seqlen, max_query_len, doc_stride, do_lower_case=do_lower_case)
# step 1-2: load the training data
mrc_reader.load_data(train_file, file_format='json', num_epochs=num_epochs, batch_size=batch_size)
# step 2: create a backbone of the model to extract text features
ernie = palm.backbone.ERNIE.from_config(config)
# step 3: register the backbone in reader
mrc_reader.register_with(ernie)
# step 4: create the task output head
mrc_head = palm.head.MRC(max_query_len, config['hidden_size'], do_lower_case=do_lower_case, max_ans_len=max_ans_len)
# step 5-1: create a task trainer
trainer = palm.Trainer(task_name)
# step 5-2: build forward graph with backbone and task head
loss_var = trainer.build_forward(ernie, mrc_head)
# step 6-1*: use warmup
n_steps = mrc_reader.num_examples * num_epochs // batch_size
warmup_steps = int(0.1 * n_steps)
sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps)
# step 6-2: create a optimizer
adam = palm.optimizer.Adam(loss_var, lr, sched)
# step 6-3: build backward
trainer.build_backward(optimizer=adam, weight_decay=weight_decay)
# step 7: fit prepared reader and data
trainer.fit_reader(mrc_reader)
# step 8-1*: load pretrained parameters
trainer.load_pretrain(pre_params)
# step 8-2*: set saver to save model
save_steps = 3040
trainer.set_saver(save_path=save_path, save_steps=save_steps, save_type=save_type)
# step 8-3: start training
trainer.train(print_steps=print_steps)
# ----------------------- for prediction -----------------------
# step 1-1: create readers for prediction
predict_mrc_reader = palm.reader.MRCReader(vocab_path, max_seqlen, max_query_len, doc_stride, do_lower_case=do_lower_case, phase='predict')
# step 1-2: load the training data
predict_mrc_reader.load_data(predict_file, batch_size)
# step 2: create a backbone of the model to extract text features
pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict')
# step 3: register the backbone in reader
predict_mrc_reader.register_with(pred_ernie)
# step 4: create the task output head
mrc_pred_head = palm.head.MRC(max_query_len, config['hidden_size'], do_lower_case=do_lower_case, max_ans_len=max_ans_len, phase='predict')
# step 5: build forward graph with backbone and task head
trainer.build_predict_forward(pred_ernie, mrc_pred_head)
# step 6: load checkpoint
pred_model_path = './outputs/ckpt.step'+str(3040)
trainer.load_ckpt(pred_model_path)
# step 7: fit prepared reader and data
trainer.fit_reader(predict_mrc_reader, phase='predict')
# step 8: predict
print('predicting..')
trainer.predict(print_steps=print_steps, output_dir="outputs/predict")
================================================
FILE: examples/multi-task/README.md
================================================
## Example 6: Joint Training of Dialogue Intent Recognition and Slot Filling
This example achieves the joint training ofg Dialogue Intent Recognition and Slot Filling. The intent recognition can be regared as a text classification task, and slot filling as sequence labeling task. Both classification and sequence labeling have been built-in in PaddlePALM.
### Step 1: Prepare Pre-trained Models & Datasets
#### Pre-trained Model
We prepare [ERNIE-v2-en-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api) as our pre-trained model for this example.
Make sure you have downloaded `ERNIE` to current folder.
#### Dataset
Here we use `Airline Travel Information System` dataset as our testbed.
Download dataset:
```shell
python download.py
```
After the dataset is downloaded, you should convert the data format for training:
```shell
python process.py
```
If everything goes well, there will be a folder named `data/atis/` created with all the datas in it.
Here is some example datas:
`data/atis/atis_slot/train.tsv` :
```
text_a label
i want to fly from boston at 838 am and arrive in denver at 1110 in the morning O O O O O B-fromloc.city_name O B-depart_time.time I-depart_time.time O O O B-toloc.city_name O B-arrive_time.time O O B-arrive_time.period_of_day
what flights are available from pittsburgh to baltimore on thursday morning O O O O O B-fromloc.city_name O B-toloc.city_name O B-depart_date.day_name B-depart_time.period_of_day
what is the arrival time in san francisco for the 755 am flight leaving washington O O O B-flight_time I-flight_time O B-fromloc.city_name I-fromloc.city_name O O B-depart_time.time I-depart_time.time O O B-fromloc.city_name
cheapest airfare from tacoma to orlando B-cost_relative O O B-fromloc.city_name O B-toloc.city_name
```
`data/atis/atis_intent/train.tsv` :
```
label text_a
0 i want to fly from boston at 838 am and arrive in denver at 1110 in the morning
0 what flights are available from pittsburgh to baltimore on thursday morning
1 what is the arrival time in san francisco for the 755 am flight leaving washington
2 cheapest airfare from tacoma to orlando
```
### Step 2: Train & Predict
The code used to perform this task is in `run.py`. If you have prepared the pre-training model and the data set required for the task, run:
```shell
python run.py
```
If you want to specify a specific gpu or use multiple gpus for training, please use **`CUDA_VISIBLE_DEVICES`**, for example:
```shell
CUDA_VISIBLE_DEVICES=0,1 python run.py
```
Note: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.**
Some logs will be shown below:
```
global step: 5, slot: step 3/309 (epoch 0), loss: 68.965, speed: 0.58 steps/s
global step: 10, intent: step 3/311 (epoch 0), loss: 3.407, speed: 8.76 steps/s
global step: 15, slot: step 12/309 (epoch 0), loss: 54.611, speed: 1.21 steps/s
global step: 20, intent: step 7/311 (epoch 0), loss: 3.487, speed: 10.28 steps/s
```
After the run, you can view the saved models in the `outputs/` folder.
If you want to use the trained model to predict the `atis_slot & atis_intent` data, run:
```shell
python predict-slot.py
python predict-intent.py
```
If you want to specify a specific gpu or use multiple gpus for predict, please use **`CUDA_VISIBLE_DEVICES`**, for example:
```shell
CUDA_VISIBLE_DEVICES=0,1 python predict-slot.py
CUDA_VISIBLE_DEVICES=0,1 python predict-intent.py
```
Note: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.**
After the run, you can view the predictions in the `outputs/predict-slot` folder and `outputs/predict-intent` folder. Here are some examples of predictions:
`atis_slot`:
```
[129, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 5, 19, 1, 1, 1, 1, 1, 21, 21, 68, 129]
[129, 1, 39, 37, 1, 1, 1, 1, 1, 2, 1, 5, 19, 1, 23, 3, 4, 129, 129, 129, 129, 129]
[129, 1, 39, 37, 1, 1, 1, 1, 1, 1, 2, 1, 5, 19, 129, 129, 129, 129, 129, 129, 129, 129]
[129, 1, 1, 1, 1, 1, 1, 14, 15, 1, 2, 1, 5, 19, 1, 39, 37, 129, 24, 129, 129, 129]
```
`atis_intent`:
```
{"index": 0, "logits": [9.938603401184082, -0.3914794623851776, -0.050973162055015564, -1.0229418277740479, 0.04799401015043259, -0.9632213115692139, -0.6427211761474609, -1.337939739227295, -0.7969412803649902, -1.4441455602645874, -0.6339573264122009, -1.0393054485321045, -0.9242327213287354, -1.9637483358383179, 0.16733427345752716, -0.5280354619026184, -1.7195699214935303, -2.199411630630493, -1.2833174467086792, -1.3081035614013672, -1.6036226749420166, -1.8527079820632935, -2.289180040359497, -2.267214775085449, -2.2578916549682617, -2.2010505199432373], "probs": [0.999531626701355, 3.26210938510485e-05, 4.585415081237443e-05, 1.7348344044876285e-05, 5.06243304698728e-05, 1.8415948943584226e-05, 2.5373808966833167e-05, 1.266065828531282e-05, 2.174747896788176e-05, 1.1384962817828637e-05, 2.5597169951652177e-05, 1.7066764485207386e-05, 1.914815220516175e-05, 6.771284006390488e-06, 5.70411684748251e-05, 2.8457265216275118e-05, 8.644025911053177e-06, 5.349628736439627e-06, 1.3371440218179487e-05, 1.3044088518654462e-05, 9.706698619993404e-06, 7.5665011536329985e-06, 4.890325726591982e-06, 4.99892985317274e-06, 5.045753368904116e-06, 5.340866664482746e-06], "label": 0}
{"index": 1, "logits": [0.8863624930381775, -2.232290506362915, 8.191509246826172, -0.03161466494202614, -0.9149583578109741, -2.172696352005005, -0.3937145471572876, -0.3954394459724426, 1.5333592891693115, 0.8630291223526001, -0.9684226512908936, -2.722721815109253, -0.0060247331857681274, -0.9865402579307556, 1.6328885555267334, 0.3972966969013214, 0.27919167280197144, -1.4911551475524902, -0.9552251696586609, -0.9169244170188904, -0.810670793056488, -1.5118697881698608, -2.0140435695648193, -1.6299077272415161, -1.8589974641799927, -2.07601261138916], "probs": [0.0006675600307062268, 2.9517297662096098e-05, 0.9932880997657776, 0.0002665741485543549, 0.0001102013120544143, 3.132982965325937e-05, 0.00018559220188762993, 0.00018527248175814748, 0.0012749042361974716, 0.0006521637551486492, 0.00010446414671605453, 1.8075270418194123e-05, 0.0002734838053584099, 0.00010258861584588885, 0.0014083238784223795, 0.00040934717981144786, 0.00036374686169438064, 6.193659646669403e-05, 0.00010585198469925672, 0.00010998480865964666, 0.0001223145518451929, 6.0666847275570035e-05, 3.671637750812806e-05, 5.391232480178587e-05, 4.287416595616378e-05, 3.4510172554291785e-05], "label": 0}
{"index": 2, "logits": [9.789957046508789, -0.1730862706899643, -0.7198237776756287, -1.0460278987884521, 0.23521068692207336, -0.5075851678848267, -0.44724929332733154, -1.2945927381515503, -0.6984466314315796, -1.8749892711639404, -0.4631594121456146, -0.6256799697875977, -1.0252169370651245, -1.951456069946289, -0.17572557926177979, -0.6771697402000427, -1.7992591857910156, -2.1457295417785645, -1.4203097820281982, -1.4963451623916626, -1.692310094833374, -1.9219486713409424, -2.2533645629882812, -2.430952310562134, -2.3094685077667236, -2.2399914264678955], "probs": [0.9994625449180603, 4.708383130491711e-05, 2.725377635215409e-05, 1.9667899323394522e-05, 7.082601223373786e-05, 3.3697724575176835e-05, 3.579350595828146e-05, 1.5339375750045292e-05, 2.784266871458385e-05, 8.58508519741008e-06, 3.522853512549773e-05, 2.9944207199150696e-05, 2.0081495677004568e-05, 7.953084605105687e-06, 4.695970710599795e-05, 2.8441407266655006e-05, 9.26048778637778e-06, 6.548832516273251e-06, 1.3527245755540207e-05, 1.2536826943687629e-05, 1.030578732752474e-05, 8.19125762063777e-06, 5.880556273041293e-06, 4.923717369820224e-06, 5.559719284065068e-06, 5.9597273320832755e-06], "label": 0}
{"index": 3, "logits": [9.787659645080566, -0.6223222017288208, -0.03971472755074501, -1.038114070892334, 0.24018540978431702, -0.8904737830162048, -0.7114139795303345, -1.2315020561218262, -0.5120854377746582, -1.4273980855941772, -0.44618460536003113, -1.0241562128067017, -0.9727545380592346, -1.8587366342544556, 0.020689941942691803, -0.6228570342063904, -1.6020199060440063, -2.130260467529297, -1.370570421218872, -1.40530526638031, -1.6782578229904175, -1.94076669216156, -2.2038567066192627, -2.336832284927368, -2.268157720565796, -2.140028953552246], "probs": [0.9994485974311829, 3.0113611501292326e-05, 5.392447565100156e-05, 1.986949791898951e-05, 7.134198676794767e-05, 2.303065048181452e-05, 2.7546762794372626e-05, 1.6375688574044034e-05, 3.362310235388577e-05, 1.3462414244713727e-05, 3.591357381083071e-05, 2.0148761905147694e-05, 2.12115264730528e-05, 8.74570196174318e-06, 5.728216274292208e-05, 3.0097504350123927e-05, 1.1305383850412909e-05, 6.666126409982098e-06, 1.4249604646465741e-05, 1.3763145034317859e-05, 1.0475521776243113e-05, 8.056933438638225e-06, 6.193143690325087e-06, 5.422014055511681e-06, 5.807448815176031e-06, 6.601325367228128e-06], "label": 0}
```
### Step 3: Evaluate
Once you have the prediction, you can run the evaluation script to evaluate the model:
```shell
python evaluate-slot.py
python evaluate-intent.py
```
The evaluation results are as follows:
`atis_slot`:
```
data num: 891
f1: 0.8934
```
`atis_intent`:
```
data num: 893
accuracy: 0.7088, precision: 1.0000, recall: 1.0000, f1: 1.0000
```
================================================
FILE: examples/multi-task/download.py
================================================
# -*- coding: utf-8 -*-
from __future__ import print_function
import os
import tarfile
import shutil
import sys
import urllib
URLLIB=urllib
if sys.version_info >= (3, 0):
import urllib.request
URLLIB=urllib.request
def download(src, url):
def _reporthook(count, chunk_size, total_size):
bytes_so_far = count * chunk_size
percent = float(bytes_so_far) / float(total_size)
if percent > 1:
percent = 1
print('\r>> Downloading... {:.1%}'.format(percent), end="")
URLLIB.urlretrieve(url, src, reporthook=_reporthook)
abs_path = os.path.abspath(__file__)
download_url = "https://baidu-nlp.bj.bcebos.com/dmtk_data_1.0.0.tar.gz"
downlaod_path = os.path.join(os.path.dirname(abs_path), "dmtk_data_1.0.0.tar.gz")
target_dir = os.path.dirname(abs_path)
download(downlaod_path, download_url)
tar = tarfile.open(downlaod_path)
tar.extractall(target_dir)
os.remove(downlaod_path)
shutil.rmtree(os.path.join(target_dir, 'data/dstc2/'))
shutil.rmtree(os.path.join(target_dir, 'data/mrda/'))
shutil.rmtree(os.path.join(target_dir, 'data/multi-woz/'))
shutil.rmtree(os.path.join(target_dir, 'data/swda/'))
shutil.rmtree(os.path.join(target_dir, 'data/udc/'))
print(" done!")
================================================
FILE: examples/multi-task/evaluate_intent.py
================================================
# -*- coding: utf-8 -*-
import json
import numpy as np
def accuracy(preds, labels):
preds = np.array(preds)
labels = np.array(labels)
return (preds == labels).mean()
def pre_recall_f1(preds, labels):
preds = np.array(preds)
labels = np.array(labels)
# recall=TP/(TP+FN)
tp = np.sum((labels == '1') & (preds == '1'))
fp = np.sum((labels == '0') & (preds == '1'))
fn = np.sum((labels == '1') & (preds == '0'))
r = tp * 1.0 / (tp + fn)
# Precision=TP/(TP+FP)
p = tp * 1.0 / (tp + fp)
epsilon = 1e-31
f1 = 2 * p * r / (p+r+epsilon)
return p, r, f1
def res_evaluate(res_dir="./outputs/predict-intent/predictions.json", eval_phase='test'):
if eval_phase == 'test':
data_dir="./data/atis/atis_intent/test.tsv"
elif eval_phase == 'dev':
data_dir="./data/dev.tsv"
else:
assert eval_phase in ['dev', 'test'], 'eval_phase should be dev or test'
labels = []
with open(data_dir, "r") as file:
first_flag = True
for line in file:
line = line.split("\t")
label = line[0]
if label=='label':
continue
labels.append(str(label))
file.close()
preds = []
with open(res_dir, "r") as file:
for line in file.readlines():
line = json.loads(line)
pred = line['label']
preds.append(str(pred))
file.close()
assert len(labels) == len(preds), "prediction result doesn't match to labels"
print('data num: {}'.format(len(labels)))
p, r, f1 = pre_recall_f1(preds, labels)
print("accuracy: {:.4f}, precision: {:.4f}, recall: {:.4f}, f1: {:.4f}".format(accuracy(preds, labels), p, r, f1))
res_evaluate()
================================================
FILE: examples/multi-task/evaluate_slot.py
================================================
# -*- coding: utf-8 -*-
import json
def load_label_map(map_dir="./data/atis/atis_slot/label_map.json"):
"""
:param map_dir: dict indictuing chunk type
:return:
"""
return json.load(open(map_dir, "r"))
def cal_chunk(pred_label, refer_label):
tp = dict()
fn = dict()
fp = dict()
for i in range(len(refer_label)):
if refer_label[i] == pred_label[i]:
if refer_label[i] not in tp:
tp[refer_label[i]] = 0
tp[refer_label[i]] += 1
else:
if pred_label[i] not in fp:
fp[pred_label[i]] = 0
fp[pred_label[i]] += 1
if refer_label[i] not in fn:
fn[refer_label[i]] = 0
fn[refer_label[i]] += 1
tp_total = sum(tp.values())
fn_total = sum(fn.values())
fp_total = sum(fp.values())
p_total = float(tp_total) / (tp_total + fp_total)
r_total = float(tp_total) / (tp_total + fn_total)
f_micro = 2 * p_total * r_total / (p_total + r_total)
return f_micro
def res_evaluate(res_dir="./outputs/predict-slot/predictions.json", data_dir="./data/atis/atis_slot/test.tsv"):
label_map = load_label_map()
total_label = []
with open(data_dir, "r") as file:
first_flag = True
for line in file:
if first_flag:
first_flag = False
continue
line = line.strip("\n")
if len(line) == 0:
continue
line = line.split("\t")
if len(line) < 2:
continue
labels = line[1][:-1].split("\x02")
total_label.append(labels)
total_label = [[label_map[j] for j in i] for i in total_label]
total_res = []
with open(res_dir, "r") as file:
cnt = 0
for line in file:
line = line.strip("\n")
if len(line) == 0:
continue
try:
res_arr = json.loads(line)
if len(total_label[cnt]) < len(res_arr):
total_res.append(res_arr[1: 1 + len(total_label[cnt])])
elif len(total_label[cnt]) == len(res_arr):
total_res.append(res_arr)
else:
total_res.append(res_arr)
total_label[cnt] = total_label[cnt][: len(res_arr)]
except:
print("json format error: {}".format(cnt))
print(line)
cnt += 1
total_res_equal = []
total_label_equal = []
assert len(total_label) == len(total_res), "prediction result doesn't match to labels"
for i in range(len(total_label)):
num = len(total_label[i])
total_label_equal.extend(total_label[i])
total_res[i] = total_res[i][:num]
total_res_equal.extend(total_res[i])
f1 = cal_chunk(total_res_equal, total_label_equal)
print('data num: {}'.format(len(total_label)))
print("f1: {:.4f}".format(f1))
res_evaluate()
================================================
FILE: examples/multi-task/joint_predict.py
================================================
# coding=utf-8
import paddlepalm as palm
import json
import numpy as np
if __name__ == '__main__':
# configs
max_seqlen = 128
batch_size = 128
num_epochs = 20
print_steps = 5
lr = 2e-5
num_classes = 130
weight_decay = 0.01
num_classes_intent = 26
dropout_prob = 0.1
random_seed = 0
label_map = './data/atis/atis_slot/label_map.json'
vocab_path = './pretrain/ERNIE-v2-en-base/vocab.txt'
train_slot = './data/atis/atis_slot/train.tsv'
train_intent = './data/atis/atis_intent/train.tsv'
config = json.load(open('./pretrain/ERNIE-v2-en-base/ernie_config.json'))
input_dim = config['hidden_size']
# ----------------------- for training -----------------------
# step 1-1: create readers
slot_reader = palm.reader.SequenceLabelReader(vocab_path, max_seqlen, label_map, seed=random_seed, phase='predict')
intent_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed, phase='predict')
# step 1-2: load train data
slot_reader.load_data(train_slot, file_format='tsv', num_epochs=None, batch_size=batch_size)
intent_reader.load_data(train_intent, batch_size=batch_size, num_epochs=None)
# step 2: create a backbone of the model to extract text features
ernie = palm.backbone.ERNIE.from_config(config, phase='predict')
# step 3: register readers with ernie backbone
slot_reader.register_with(ernie)
intent_reader.register_with(ernie)
# step 4: create task output heads
slot_head = palm.head.SequenceLabel(num_classes, input_dim, dropout_prob, phase='predict')
intent_head = palm.head.Classify(num_classes_intent, input_dim, dropout_prob, phase='predict')
# step 5-1: create task trainers and multiHeadTrainer
trainer_slot = palm.Trainer("slot", mix_ratio=1.0)
trainer_intent = palm.Trainer("intent", mix_ratio=1.0)
trainer = palm.MultiHeadTrainer([trainer_slot, trainer_intent])
# # step 5-2: build forward graph with backbone and task head
vars = trainer_intent.build_predict_forward(ernie, intent_head)
vars = trainer_slot.build_predict_forward(ernie, slot_head)
loss_var = trainer.build_predict_forward()
# load checkpoint
trainer.load_ckpt('outputs/ckpt.step300')
# merge inference readers
joint_iterator = trainer.merge_inference_readers([slot_reader, intent_reader])
# for test
# batch = next(joint_iterator('slot'))
# results = trainer.predict_one_batch('slot', batch)
# batch = next(joint_iterator('intent'))
# results = trainer.predict_one_batch('intent', batch)
# predict slot filling
print('processing slot filling examples...')
print('num examples: '+str(slot_reader.num_examples))
cnt = 0
for batch in joint_iterator('slot'):
cnt += len(trainer.predict_one_batch('slot', batch)['logits'])
if cnt % 1000 <= 128:
print(str(cnt)+'th example processed.')
print(str(cnt)+'th example processed.')
# predict intent recognition
print('processing intent recognition examples...')
print('num examples: '+str(intent_reader.num_examples))
cnt = 0
for batch in joint_iterator('intent'):
cnt += len(trainer.predict_one_batch('intent', batch)['logits'])
if cnt % 1000 <= 128:
print(str(cnt)+'th example processed.')
print(str(cnt)+'th example processed.')
================================================
FILE: examples/multi-task/predict_intent.py
================================================
# coding=utf-8
import paddlepalm as palm
import json
from paddlepalm.distribute import gpu_dev_count
if __name__ == '__main__':
# configs
max_seqlen = 256
batch_size = 16
num_epochs = 6
print_steps = 5
num_classes = 26
vocab_path = './pretrain/ERNIE-v2-en-base/vocab.txt'
predict_file = './data/atis/atis_intent/test.tsv'
save_path = './outputs/'
pred_output = './outputs/predict-intent/'
save_type = 'ckpt'
random_seed = 0
config = json.load(open('./pretrain/ERNIE-v2-en-base/ernie_config.json'))
input_dim = config['hidden_size']
# ----------------------- for prediction -----------------------
# step 1-1: create readers for prediction
print('prepare to predict...')
predict_cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed, phase='predict')
# step 1-2: load the training data
predict_cls_reader.load_data(predict_file, batch_size)
# step 2: create a backbone of the model to extract text features
pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict')
# step 3: register the backbone in reader
predict_cls_reader.register_with(pred_ernie)
# step 4: create the task output head
cls_pred_head = palm.head.Classify(num_classes, input_dim, phase='predict')
# step 5-1: create a task trainer
trainer = palm.Trainer("intent")
# step 5-2: build forward graph with backbone and task head
trainer.build_predict_forward(pred_ernie, cls_pred_head)
# step 6: load checkpoint
pred_model_path = './outputs/ckpt.step4641'
trainer.load_ckpt(pred_model_path)
# step 7: fit prepared reader and data
trainer.fit_reader(predict_cls_reader, phase='predict')
# step 8: predict
print('predicting..')
trainer.predict(print_steps=print_steps, output_dir=pred_output)
================================================
FILE: examples/multi-task/predict_slot.py
================================================
# coding=utf-8
import paddlepalm as palm
import json
from paddlepalm.distribute import gpu_dev_count
if __name__ == '__main__':
# configs
max_seqlen = 256
batch_size = 16
num_epochs = 6
print_steps = 5
num_classes = 130
label_map = './data/atis/atis_slot/label_map.json'
vocab_path = './pretrain/ERNIE-v2-en-base/vocab.txt'
predict_file = './data/atis/atis_slot/test.tsv'
save_path = './outputs/'
pred_output = './outputs/predict-slot/'
save_type = 'ckpt'
random_seed = 0
config = json.load(open('./pretrain/ERNIE-v2-en-base/ernie_config.json'))
input_dim = config['hidden_size']
# ----------------------- for prediction -----------------------
# step 1-1: create readers for prediction
print('prepare to predict...')
predict_seq_label_reader = palm.reader.SequenceLabelReader(vocab_path, max_seqlen, label_map, seed=random_seed, phase='predict')
# step 1-2: load the training data
predict_seq_label_reader.load_data(predict_file, batch_size)
# step 2: create a backbone of the model to extract text features
pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict')
# step 3: register the backbone in reader
predict_seq_label_reader.register_with(pred_ernie)
# step 4: create the task output head
seq_label_pred_head = palm.head.SequenceLabel(num_classes, input_dim, phase='predict')
# step 5-1: create a task trainer
trainer_seq_label = palm.Trainer("slot")
# step 5-2: build forward graph with backbone and task head
trainer_seq_label.build_predict_forward(pred_ernie, seq_label_pred_head)
# step 6: load checkpoint
pred_model_path = './outputs/ckpt.step4641'
trainer_seq_label.load_ckpt(pred_model_path)
# step 7: fit prepared reader and data
trainer_seq_label.fit_reader(predict_seq_label_reader, phase='predict')
# step 8: predict
print('predicting..')
trainer_seq_label.predict(print_steps=print_steps, output_dir=pred_output)
================================================
FILE: examples/multi-task/process.py
================================================
import os
import json
label_new = "data/atis/atis_slot/label_map.json"
label_old = "data/atis/atis_slot/map_tag_slot_id.txt"
train_old = "data/atis/atis_slot/train.txt"
train_new = "data/atis/atis_slot/train.tsv"
dev_old = "data/atis/atis_slot/dev.txt"
dev_new = "data/atis/atis_slot/dev.tsv"
test_old = "data/atis/atis_slot/test.txt"
test_new = "data/atis/atis_slot/test.tsv"
intent_test = "data/atis/atis_intent/test.tsv"
os.rename("data/atis/atis_intent/test.txt", intent_test)
intent_train = "data/atis/atis_intent/train.tsv"
os.rename("data/atis/atis_intent/train.txt", intent_train)
intent_dev = "data/atis/atis_intent/dev.tsv"
os.rename("data/atis/atis_intent/dev.txt", intent_dev)
with open(intent_dev, 'r+') as f:
content = f.read()
f.seek(0, 0)
f.write("label\ttext_a\n"+content)
f.close()
with open(intent_test, 'r+') as f:
content = f.read()
f.seek(0, 0)
f.write("label\ttext_a\n"+content)
f.close()
with open(intent_train, 'r+') as f:
content = f.read()
f.seek(0, 0)
f.write("label\ttext_a\n"+content)
f.close()
os.mknod(label_new)
os.mknod(train_new)
os.mknod(dev_new)
os.mknod(test_new)
tag = []
id = []
map = {}
with open(label_old, "r") as f:
with open(label_new, "w") as f2:
for line in f.readlines():
line = line.split('\t')
tag.append(line[0])
id.append(int(line[1][:-1]))
map[line[1][:-1]] = line[0]
re = {tag[i]:id[i] for i in range(len(tag))}
re = json.dumps(re)
f2.write(re)
f2.close()
f.close()
with open(train_old, "r") as f:
with open(train_new, "w") as f2:
f2.write("text_a\tlabel\n")
for line in f.readlines():
line = line.split('\t')
text = line[0].split(' ')
label = line[1].split(' ')
for t in text:
f2.write(t)
f2.write('\2')
f2.write('\t')
for t in label:
if t.endswith('\n'):
t = t[:-1]
f2.write(map[t])
f2.write('\2')
f2.write('\n')
f2.close()
f.close()
with open(test_old, "r") as f:
with open(test_new, "w") as f2:
f2.write("text_a\tlabel\n")
for line in f.readlines():
line = line.split('\t')
text = line[0].split(' ')
label = line[1].split(' ')
for t in text:
f2.write(t)
f2.write('\2')
f2.write('\t')
for t in label:
if t.endswith('\n'):
t = t[:-1]
f2.write(map[t])
f2.write('\2')
f2.write('\n')
f2.close()
f.close()
with open(dev_old, "r") as f:
with open(dev_new, "w") as f2:
f2.write("text_a\tlabel\n")
for line in f.readlines():
line = line.split('\t')
text = line[0].split(' ')
label = line[1].split(' ')
for t in text:
f2.write(t)
f2.write('\2')
f2.write('\t')
for t in label:
if t.endswith('\n'):
t = t[:-1]
f2.write(map[t])
f2.write('\2')
f2.write('\n')
f2.close()
f.close()
os.remove(label_old)
os.remove(train_old)
os.remove(test_old)
os.remove(dev_old)
================================================
FILE: examples/multi-task/run.py
================================================
# coding=utf-8
import paddlepalm as palm
import json
if __name__ == '__main__':
# configs
max_seqlen = 128
batch_size = 16
num_epochs = 20
print_steps = 5
lr = 2e-5
num_classes = 130
weight_decay = 0.01
num_classes_intent = 26
dropout_prob = 0.1
random_seed = 0
label_map = './data/atis/atis_slot/label_map.json'
vocab_path = './pretrain/ERNIE-v2-en-base/vocab.txt'
train_slot = './data/atis/atis_slot/train.tsv'
train_intent = './data/atis/atis_intent/train.tsv'
config = json.load(open('./pretrain/ERNIE-v2-en-base/ernie_config.json'))
input_dim = config['hidden_size']
# ----------------------- for training -----------------------
# step 1-1: create readers
seq_label_reader = palm.reader.SequenceLabelReader(vocab_path, max_seqlen, label_map, seed=random_seed)
cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed)
# step 1-2: load train data
seq_label_reader.load_data(train_slot, file_format='tsv', num_epochs=None, batch_size=batch_size)
cls_reader.load_data(train_intent, batch_size=batch_size, num_epochs=None)
# step 2: create a backbone of the model to extract text features
ernie = palm.backbone.ERNIE.from_config(config)
# step 3: register readers with ernie backbone
seq_label_reader.register_with(ernie)
cls_reader.register_with(ernie)
# step 4: create task output heads
seq_label_head = palm.head.SequenceLabel(num_classes, input_dim, dropout_prob)
cls_head = palm.head.Classify(num_classes_intent, input_dim, dropout_prob)
# step 5-1: create task trainers and multiHeadTrainer
trainer_seq_label = palm.Trainer("slot", mix_ratio=1.0)
trainer_cls = palm.Trainer("intent", mix_ratio=1.0)
trainer = palm.MultiHeadTrainer([trainer_seq_label, trainer_cls])
# # step 5-2: build forward graph with backbone and task head
loss1 = trainer_cls.build_forward(ernie, cls_head)
loss2 = trainer_seq_label.build_forward(ernie, seq_label_head)
loss_var = trainer.build_forward()
# step 6-1*: enable warmup for better fine-tuning
n_steps = seq_label_reader.num_examples * 1.5 * num_epochs // batch_size
warmup_steps = int(0.1 * n_steps)
sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps)
# step 6-2: build a optimizer
adam = palm.optimizer.Adam(loss_var, lr, sched)
# step 6-3: build backward graph
trainer.build_backward(optimizer=adam, weight_decay=weight_decay)
# step 7: fit readers to trainer
trainer.fit_readers_with_mixratio([seq_label_reader, cls_reader], "slot", num_epochs)
# step 8-1*: load pretrained model
trainer.load_pretrain('./pretrain/ERNIE-v2-en-base')
# step 8-2*: set saver to save models during training
trainer.set_saver(save_path='./outputs/', save_steps=300)
# step 8-3: start training
trainer.train(print_steps=10)
================================================
FILE: examples/predict/README.md
================================================
## Example 5: Prediction
This example demonstrates how to directly do prediction with PaddlePALM. You can either initialize the model from a checkpoint, a pretrained model or just randomly initialization. Here we reuse the task and data in example 1. Hence repeat the step 1 in example 1 to pretrain data.
After you have prepared the pre-training model and the data set required for the task, run:
```shell
python run.py
```
If you want to specify a specific gpu or use multiple gpus for predict, please use **`CUDA_VISIBLE_DEVICES`**, for example:
```shell
CUDA_VISIBLE_DEVICES=0,1 python run.py
```
Note: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.**
Some logs will be shown below:
```
step 1/154, speed: 0.51 steps/s
step 2/154, speed: 3.36 steps/s
step 3/154, speed: 3.48 steps/s
```
After the run, you can view the predictions in the `outputs/predict` folder. Here are some examples of predictions:
```
{"index": 0, "logits": [-0.2014336884021759, 0.6799028515815735], "probs": [0.29290086030960083, 0.7070990800857544], "label": 1}
{"index": 1, "logits": [0.8593899011611938, -0.29743513464927673], "probs": [0.7607553601264954, 0.23924466967582703], "label": 0}
{"index": 2, "logits": [0.7462944388389587, -0.7083730101585388], "probs": [0.8107157349586487, 0.18928426504135132], "label": 0}
```
### Step 3: Evaluate
Once you have the prediction, you can run the evaluation script to evaluate the model:
```shell
python evaluate.py
```
The evaluation results are as follows:
```
data num: 1200
accuracy: 0.4758, precision: 0.4730, recall: 0.3026, f1: 0.3691
```
================================================
FILE: examples/predict/download.py
================================================
# -*- coding: utf-8 -*-
from __future__ import print_function
import os
import tarfile
import shutil
import sys
import urllib
URLLIB=urllib
if sys.version_info >= (3, 0):
import urllib.request
URLLIB=urllib.request
def download(src, url):
def _reporthook(count, chunk_size, total_size):
bytes_so_far = count * chunk_size
percent = float(bytes_so_far) / float(total_size)
if percent > 1:
percent = 1
print('\r>> Downloading... {:.1%}'.format(percent), end="")
URLLIB.urlretrieve(url, src, reporthook=_reporthook)
abs_path = os.path.abspath(__file__)
download_url = "https://ernie.bj.bcebos.com/task_data_zh.tgz"
downlaod_path = os.path.join(os.path.dirname(abs_path), "task_data_zh.tgz")
target_dir = os.path.dirname(abs_path)
download(downlaod_path, download_url)
tar = tarfile.open(downlaod_path)
tar.extractall(target_dir)
os.remove(downlaod_path)
abs_path = os.path.abspath(__file__)
dst_dir = os.path.join(os.path.dirname(abs_path), "data")
if not os.path.exists(dst_dir) or not os.path.isdir(dst_dir):
os.makedirs(dst_dir)
for file in os.listdir(os.path.join(target_dir, 'task_data', 'chnsenticorp')):
shutil.move(os.path.join(target_dir, 'task_data', 'chnsenticorp', file), dst_dir)
shutil.rmtree(os.path.join(target_dir, 'task_data'))
print(" done!")
================================================
FILE: examples/predict/evaluate.py
================================================
# -*- coding: utf-8 -*-
import json
import numpy as np
def accuracy(preds, labels):
preds = np.array(preds)
labels = np.array(labels)
return (preds == labels).mean()
def pre_recall_f1(preds, labels):
preds = np.array(preds)
labels = np.array(labels)
# recall=TP/(TP+FN)
tp = np.sum((labels == '1') & (preds == '1'))
fp = np.sum((labels == '0') & (preds == '1'))
fn = np.sum((labels == '1') & (preds == '0'))
r = tp * 1.0 / (tp + fn)
# Precision=TP/(TP+FP)
p = tp * 1.0 / (tp + fp)
epsilon = 1e-31
f1 = 2 * p * r / (p+r+epsilon)
return p, r, f1
def res_evaluate(res_dir="./outputs/predict/predictions.json", eval_phase='test'):
if eval_phase == 'test':
data_dir="./data/test.tsv"
elif eval_phase == 'dev':
data_dir="./data/dev.tsv"
else:
assert eval_phase in ['dev', 'test'], 'eval_phase should be dev or test'
labels = []
with open(data_dir, "r") as file:
first_flag = True
for line in file:
line = line.split("\t")
label = line[0]
if label=='label':
continue
labels.append(str(label))
file.close()
preds = []
with open(res_dir, "r") as file:
for line in file.readlines():
line = json.loads(line)
pred = line['label']
preds.append(str(pred))
file.close()
assert len(labels) == len(preds), "prediction result doesn't match to labels"
print('data num: {}'.format(len(labels)))
p, r, f1 = pre_recall_f1(preds, labels)
print("accuracy: {:.4f}, precision: {:.4f}, recall: {:.4f}, f1: {:.4f}".format(accuracy(preds, labels), p, r, f1))
res_evaluate()
================================================
FILE: examples/predict/run.py
================================================
# coding=utf-8
import paddlepalm as palm
import json
if __name__ == '__main__':
# configs
max_seqlen = 256
batch_size = 8
vocab_path = './pretrain/ERNIE-v1-zh-base/vocab.txt'
predict_file = './data/test.tsv'
random_seed = 1
config = json.load(open('./pretrain/ERNIE-v1-zh-base/ernie_config.json'))
input_dim = config['hidden_size']
num_classes = 2
task_name = 'chnsenticorp'
pred_output = './outputs/predict/'
print_steps = 20
pre_params = './pretrain/ERNIE-v1-zh-base/params'
# ----------------------- for prediction -----------------------
# step 1-1: create readers for prediction
print('prepare to predict...')
predict_cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed, phase='predict')
# step 1-2: load the training data
predict_cls_reader.load_data(predict_file, batch_size)
# step 2: create a backbone of the model to extract text features
pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict')
# step 3: register the backbone in reader
predict_cls_reader.register_with(pred_ernie)
# step 4: create the task output head
cls_pred_head = palm.head.Classify(num_classes, input_dim, phase='predict')
# step 5-1: create a task trainer
trainer = palm.Trainer(task_name)
# step 5-2: build forward graph with backbone and task head
trainer.build_predict_forward(pred_ernie, cls_pred_head)
# step 6: load checkpoint
trainer.load_predict_model(pre_params)
# step 7: fit prepared reader and data
trainer.fit_reader(predict_cls_reader, phase='predict')
# step 8: predict
print('predicting..')
trainer.predict(print_steps=print_steps, output_dir=pred_output)
================================================
FILE: examples/tagging/README.md
================================================
## Example 3: Tagging
This task is a named entity recognition task. The following sections detail model preparation, dataset preparation, and how to run the task.
### Step 1: Prepare Pre-trained Models & Datasets
#### Pre-trianed Model
The pre-training model of this mission is: [ERNIE-v1-zh-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api).
Make sure you have downloaded the required pre-training model in the current folder.
#### Dataset
This task uses the `MSRA-NER(SIGHAN2006)` dataset.
Download dataset:
```shell
python download.py
```
If everything goes well, there will be a folder named `data/` created with all the datas in it.
The data should have 2 fields, `text_a label`, with tsv format. Here is some example datas:
```
text_a label
在 这 里 恕 弟 不 恭 之 罪 , 敢 在 尊 前 一 诤 : 前 人 论 书 , 每 曰 “ 字 字 有 来 历 , 笔 笔 有 出 处 ” , 细 读 公 字 , 何 尝 跳 出 前 人 藩 篱 , 自 隶 变 而 后 , 直 至 明 季 , 兄 有 何 新 出 ? O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
相 比 之 下 , 青 岛 海 牛 队 和 广 州 松 日 队 的 雨 中 之 战 虽 然 也 是 0 ∶ 0 , 但 乏 善 可 陈 。 O O O O O B-ORG I-ORG I-ORG I-ORG I-ORG O B-ORG I-ORG I-ORG I-ORG I-ORG O O O O O O O O O O O O O O O O O O O
理 由 多 多 , 最 无 奈 的 却 是 : 5 月 恰 逢 双 重 考 试 , 她 攻 读 的 博 士 学 位 论 文 要 通 考 ; 她 任 教 的 两 所 学 校 , 也 要 在 这 段 时 日 大 考 。 O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
```
### Step 2: Train & Predict
The code used to perform this task is in `run.py`. If you have prepared the pre-training model and the data set required for the task, run:
```shell
python run.py
```
If you want to specify a specific gpu or use multiple gpus for training, please use **`CUDA_VISIBLE_DEVICES`**, for example:
```shell
CUDA_VISIBLE_DEVICES=0,1 python run.py
```
Note: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.**
Some logs will be shown below:
```
step 1/652 (epoch 0), loss: 216.002, speed: 0.32 steps/s
step 2/652 (epoch 0), loss: 202.567, speed: 1.28 steps/s
step 3/652 (epoch 0), loss: 170.677, speed: 1.05 steps/s
```
After the run, you can view the saved models in the `outputs/` folder and the predictions in the `outputs/predict` folder. Here are some examples of predictions:
```
[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 6, 4, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
```
### Step 3: Evaluate
Once you have the prediction, you can run the evaluation script to evaluate the model:
```python
python evaluate.py
```
The evaluation results are as follows:
```
data num: 4636
f1: 0.9918
```
================================================
FILE: examples/tagging/download.py
================================================
# -*- coding: utf-8 -*-
from __future__ import print_function
import os
import tarfile
import shutil
import sys
import urllib
URLLIB=urllib
if sys.version_info >= (3, 0):
import urllib.request
URLLIB=urllib.request
def download(src, url):
def _reporthook(count, chunk_size, total_size):
bytes_so_far = count * chunk_size
percent = float(bytes_so_far) / float(total_size)
if percent > 1:
percent = 1
print('\r>> Downloading... {:.1%}'.format(percent), end="")
URLLIB.urlretrieve(url, src, reporthook=_reporthook)
abs_path = os.path.abspath(__file__)
download_url = "https://ernie.bj.bcebos.com/task_data_zh.tgz"
downlaod_path = os.path.join(os.path.dirname(abs_path), "task_data_zh.tgz")
target_dir = os.path.dirname(abs_path)
download(downlaod_path, download_url)
tar = tarfile.open(downlaod_path)
tar.extractall(target_dir)
os.remove(downlaod_path)
abs_path = os.path.abspath(__file__)
dst_dir = os.path.join(os.path.dirname(abs_path), "data")
if not os.path.exists(dst_dir) or not os.path.isdir(dst_dir):
os.makedirs(dst_dir)
for file in os.listdir(os.path.join(target_dir, 'task_data', 'msra_ner')):
shutil.move(os.path.join(target_dir, 'task_data', 'msra_ner', file), dst_dir)
shutil.rmtree(os.path.join(target_dir, 'task_data'))
print(" done!")
================================================
FILE: examples/tagging/evaluate.py
================================================
# -*- coding: utf-8 -*-
import json
def load_label_map(map_dir="./data/label_map.json"):
"""
:param map_dir: dict indictuing chunk type
:return:
"""
return json.load(open(map_dir, "r"))
def cal_chunk(pred_label, refer_label):
tp = dict()
fn = dict()
fp = dict()
for i in range(len(refer_label)):
if refer_label[i] == pred_label[i]:
if refer_label[i] not in tp:
tp[refer_label[i]] = 0
tp[refer_label[i]] += 1
else:
if pred_label[i] not in fp:
fp[pred_label[i]] = 0
fp[pred_label[i]] += 1
if refer_label[i] not in fn:
fn[refer_label[i]] = 0
fn[refer_label[i]] += 1
tp_total = sum(tp.values())
fn_total = sum(fn.values())
fp_total = sum(fp.values())
p_total = float(tp_total) / (tp_total + fp_total)
r_total = float(tp_total) / (tp_total + fn_total)
f_micro = 2 * p_total * r_total / (p_total + r_total)
return f_micro
def res_evaluate(res_dir="./outputs/predict/predictions.json", data_dir="./data/test.tsv"):
label_map = load_label_map()
total_label = []
with open(data_dir, "r") as file:
first_flag = True
for line in file:
if first_flag:
first_flag = False
continue
line = line.strip("\n")
if len(line) == 0:
continue
line = line.split("\t")
if len(line) < 2:
continue
labels = line[1].split("\x02")
total_label.append(labels)
total_label = [[label_map[j] for j in i] for i in total_label]
total_res = []
with open(res_dir, "r") as file:
cnt = 0
for line in file:
line = line.strip("\n")
if len(line) == 0:
continue
try:
res_arr = json.loads(line)
if len(total_label[cnt]) < len(res_arr):
total_res.append(res_arr[1: 1 + len(total_label[cnt])])
elif len(total_label[cnt]) == len(res_arr):
total_res.append(res_arr)
else:
total_res.append(res_arr)
total_label[cnt] = total_label[cnt][: len(res_arr)]
except:
print("json format error: {}".format(cnt))
print(line)
cnt += 1
total_res_equal = []
total_label_equal = []
assert len(total_label) == len(total_res), "prediction result doesn't match to labels"
for i in range(len(total_label)):
num = len(total_label[i])
total_label_equal.extend(total_label[i])
total_res[i] = total_res[i][:num]
total_res_equal.extend(total_res[i])
f1 = cal_chunk(total_res_equal, total_label_equal)
print('data num: {}'.format(len(total_label)))
print("f1: {:.4f}".format(f1))
res_evaluate()
================================================
FILE: examples/tagging/run.py
================================================
# coding=utf-8
import paddlepalm as palm
import json
if __name__ == '__main__':
# configs
max_seqlen = 256
batch_size = 16
num_epochs = 6
lr = 5e-5
num_classes = 7
weight_decay = 0.01
dropout_prob = 0.1
vocab_path = './pretrain/ERNIE-v1-zh-base/vocab.txt'
label_map = './data/label_map.json'
random_seed = 1
train_file = './data/train.tsv'
predict_file = './data/test.tsv'
save_path='./outputs/'
save_type='ckpt'
pre_params = './pretrain/ERNIE-v1-zh-base/params'
config = json.load(open('./pretrain/ERNIE-v1-zh-base/ernie_config.json'))
input_dim = config['hidden_size']
task_name = 'msra_ner'
pred_output = './outputs/predict/'
train_print_steps = 10
pred_print_steps = 20
# ----------------------- for training -----------------------
# step 1-1: create readers for training
seq_label_reader = palm.reader.SequenceLabelReader(vocab_path, max_seqlen, label_map, seed=random_seed)
# step 1-2: load the training data
seq_label_reader.load_data(train_file, file_format='tsv', num_epochs=num_epochs, batch_size=batch_size)
# step 2: create a backbone of the model to extract text features
ernie = palm.backbone.ERNIE.from_config(config)
# step 3: register the backbone in reader
seq_label_reader.register_with(ernie)
# step 4: create the task output head
seq_label_head = palm.head.SequenceLabel(num_classes, input_dim, dropout_prob)
# step 5-1: create a task trainer
trainer = palm.Trainer(task_name)
# step 5-2: build forward graph with backbone and task head
loss_var = trainer.build_forward(ernie, seq_label_head)
# step 6-1*: use warmup
n_steps = seq_label_reader.num_examples * num_epochs // batch_size
warmup_steps = int(0.1 * n_steps)
print('total_steps: {}'.format(n_steps))
print('warmup_steps: {}'.format(warmup_steps))
sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps)
# step 6-2: create a optimizer
adam = palm.optimizer.Adam(loss_var, lr, sched)
# step 6-3: build backward
trainer.build_backward(optimizer=adam, weight_decay=weight_decay)
# step 7: fit prepared reader and data
trainer.fit_reader(seq_label_reader)
# step 8-1*: load pretrained parameters
trainer.load_pretrain(pre_params)
# step 8-2*: set saver to save model
save_steps = 1951
# print('save_steps: {}'.format(save_steps))
trainer.set_saver(save_path=save_path, save_steps=save_steps, save_type=save_type)
# # step 8-3: start training
trainer.train(print_steps=train_print_steps)
# ----------------------- for prediction -----------------------
# step 1-1: create readers for prediction
print('prepare to predict...')
predict_seq_label_reader = palm.reader.SequenceLabelReader(vocab_path, max_seqlen, label_map, seed=random_seed, phase='predict')
# step 1-2: load the training data
predict_seq_label_reader.load_data(predict_file, batch_size)
# step 2: create a backbone of the model to extract text features
pred_ernie = palm.backbone.ERNIE.from_config(config, phase='predict')
# step 3: register the backbone in reader
predict_seq_label_reader.register_with(pred_ernie)
# step 4: create the task output head
seq_label_pred_head = palm.head.SequenceLabel(num_classes, input_dim, phase='predict')
# step 5: build forward graph with backbone and task head
trainer.build_predict_forward(pred_ernie, seq_label_pred_head)
# step 6: load checkpoint
pred_model_path = './outputs/ckpt.step' + str(save_steps)
trainer.load_ckpt(pred_model_path)
# step 7: fit prepared reader and data
trainer.fit_reader(predict_seq_label_reader, phase='predict')
# step 8: predict
print('predicting..')
trainer.predict(print_steps=pred_print_steps, output_dir=pred_output)
================================================
FILE: examples/train_with_eval/README.md
================================================
## Train with Evaluation version of Example 1: Classification
This task is a sentiment analysis task. The following sections detail model preparation, dataset preparation, and how to run the task. Here to demonstrate how to do evaluation during training in PaddlePALM.
### Step 1: Prepare Pre-trained Model & Dataset
#### Pre-trained Model
The pre-training model of this mission is: [ERNIE-v1-zh-base](https://github.com/PaddlePaddle/PALM/tree/r0.3-api).
Make sure you have downloaded the required pre-training model in the current folder.
#### Dataset
This example demonstrates with [ChnSentiCorp](https://github.com/SophonPlus/ChineseNlpCorpus/tree/master/datasets/ChnSentiCorp_htl_all), a Chinese sentiment analysis dataset.
Download dataset:
```shell
python download.py
```
If everything goes well, there will be a folder named `data/` created with all the data files in it.
The dataset file (for training) should have 2 fields, `text_a` and `label`, stored with [tsv](https://en.wikipedia.org/wiki/Tab-separated_values) format. Here shows an example:
```
label text_a
0 当当网名不符实,订货多日不见送货,询问客服只会推托,只会要求用户再下订单。如此服务留不住顾客的。去别的网站买书服务更好。
0 XP的驱动不好找!我的17号提的货,现在就降价了100元,而且还送杀毒软件!
1 <荐书> 推荐所有喜欢<红楼>的红迷们一定要收藏这本书,要知道当年我听说这本书的时候花很长时间去图书馆找和借都没能如愿,所以这次一看到当当有,马上买了,红迷们也要记得备货哦!
```
### Step 2: Train & Predict
The code used to perform this task is in `run.py`. If you have prepared the pre-training model and the data set required for the task, run:
```shell
python run.py
```
If you want to specify a specific gpu or use multiple gpus for training, please use **`CUDA_VISIBLE_DEVICES`**, for example:
```shell
CUDA_VISIBLE_DEVICES=0,1 python run.py
```
Note: On multi-gpu mode, PaddlePALM will automatically split each batch onto the available cards. For example, if the `batch_size` is set 64, and there are 4 cards visible for PaddlePALM, then the batch_size in each card is actually 64/4=16. If you want to change the `batch_size` or the number of gpus used in the example, **you need to ensure that the set batch_size can be divided by the number of cards.**
Some logs will be shown below:
```
step 1/154 (epoch 0), loss: 5.512, speed: 0.51 steps/s
step 2/154 (epoch 0), loss: 2.595, speed: 3.36 steps/s
step 3/154 (epoch 0), loss: 1.798, speed: 3.48 steps/s
```
After the run, you can view the saved models in the `outputs/` folder and the predictions in the `outputs/predict` folder. Here are some examples of predictions:
```
{"index": 0, "logits": [-0.2014336884021759, 0.6799028515815735], "probs": [0.29290086030960083, 0.7070990800857544], "label": 1}
{"index": 1, "logits": [0.8593899011611938, -0.29743513464927673], "probs": [0.7607553601264954, 0.23924466967582703], "label": 0}
{"index": 2, "logits": [0.7462944388389587, -0.7083730101585388], "probs": [0.8107157349586487, 0.18928426504135132], "label": 0}
```
### Step 3: Evaluate
Once you have the prediction, you can run the evaluation script to evaluate the model:
```shell
python evaluate.py
```
The evaluation results are as follows:
```
data num: 1200
accuracy: 0.9575, precision: 0.9634, recall: 0.9523, f1: 0.9578
```
================================================
FILE: examples/train_with_eval/download.py
================================================
# -*- coding: utf-8 -*-
from __future__ import print_function
import os
import tarfile
import shutil
import sys
import urllib
URLLIB=urllib
if sys.version_info >= (3, 0):
import urllib.request
URLLIB=urllib.request
def download(src, url):
def _reporthook(count, chunk_size, total_size):
bytes_so_far = count * chunk_size
percent = float(bytes_so_far) / float(total_size)
if percent > 1:
percent = 1
print('\r>> Downloading... {:.1%}'.format(percent), end="")
URLLIB.urlretrieve(url, src, reporthook=_reporthook)
abs_path = os.path.abspath(__file__)
download_url = "https://ernie.bj.bcebos.com/task_data_zh.tgz"
downlaod_path = os.path.join(os.path.dirname(abs_path), "task_data_zh.tgz")
target_dir = os.path.dirname(abs_path)
download(downlaod_path, download_url)
tar = tarfile.open(downlaod_path)
tar.extractall(target_dir)
os.remove(downlaod_path)
abs_path = os.path.abspath(__file__)
dst_dir = os.path.join(os.path.dirname(abs_path), "data")
if not os.path.exists(dst_dir) or not os.path.isdir(dst_dir):
os.makedirs(dst_dir)
for file in os.listdir(os.path.join(target_dir, 'task_data', 'chnsenticorp')):
shutil.move(os.path.join(target_dir, 'task_data', 'chnsenticorp', file), dst_dir)
shutil.rmtree(os.path.join(target_dir, 'task_data'))
print(" done!")
================================================
FILE: examples/train_with_eval/evaluate.py
================================================
# -*- coding: utf-8 -*-
import json
import numpy as np
def accuracy(preds, labels):
preds = np.array(preds)
labels = np.array(labels)
return (preds == labels).mean()
def pre_recall_f1(preds, labels):
preds = np.array(preds)
labels = np.array(labels)
# recall=TP/(TP+FN)
tp = np.sum((labels == '1') & (preds == '1'))
fp = np.sum((labels == '0') & (preds == '1'))
fn = np.sum((labels == '1') & (preds == '0'))
r = tp * 1.0 / (tp + fn)
# Precision=TP/(TP+FP)
p = tp * 1.0 / (tp + fp)
epsilon = 1e-31
f1 = 2 * p * r / (p+r+epsilon)
return p, r, f1
def res_evaluate(res_dir="./outputs/predict/predictions.json", eval_phase='test'):
if eval_phase == 'test':
data_dir="./data/test.tsv"
elif eval_phase == 'dev':
data_dir="./data/dev.tsv"
else:
assert eval_phase in ['dev', 'test'], 'eval_phase should be dev or test'
labels = []
with open(data_dir, "r") as file:
first_flag = True
for line in file:
line = line.split("\t")
label = line[0]
if label=='label':
continue
labels.append(str(label))
file.close()
preds = []
with open(res_dir, "r") as file:
for line in file.readlines():
line = json.loads(line)
pred = line['label']
preds.append(str(pred))
file.close()
assert len(labels) == len(preds), "prediction result doesn't match to labels"
print('data num: {}'.format(len(labels)))
p, r, f1 = pre_recall_f1(preds, labels)
print("accuracy: {:.4f}, precision: {:.4f}, recall: {:.4f}, f1: {:.4f}".format(accuracy(preds, labels), p, r, f1))
res_evaluate()
================================================
FILE: examples/train_with_eval/run.py
================================================
# coding=utf-8
import paddlepalm as palm
import json
if __name__ == '__main__':
# configs
max_seqlen = 256
batch_size = 8
num_epochs = 10
lr = 5e-5
weight_decay = 0.01
vocab_path = './pretrain/ERNIE-v1-zh-base/vocab.txt'
train_file = './data/train.tsv'
predict_file = './data/test.tsv'
config = json.load(open('./pretrain/ERNIE-v1-zh-base/ernie_config.json'))
input_dim = config['hidden_size']
num_classes = 2
dropout_prob = 0.1
random_seed = 1
task_name = 'chnsenticorp'
save_path = './outputs/'
pred_output = './outputs/predict/'
save_type = 'ckpt'
print_steps = 20
pre_params = './pretrain/ERNIE-v1-zh-base/params'
# ----------------------- for training -----------------------
# step 1-1: create readers for training
cls_reader = palm.reader.ClassifyReader(vocab_path, max_seqlen, seed=random_seed)
# step 1-2: load the training data
cls_reader.load_data(train_file, batch_size, num_epochs=num_epochs)
# step 2: create a backbone of the model to extract text features
ernie = palm.backbone.ERNIE.from_config(config)
# step 3: register the backbone in reader
cls_reader.register_with(ernie)
# step 4: create the task output head
cls_head = palm.head.Classify(num_classes, input_dim, dropout_prob)
# step 5-1: create a task trainer
trainer = palm.Trainer(task_name)
# step 5-2: build forward graph with backbone and task head
loss_var = trainer.build_forward(ernie, cls_head)
# step 6-1*: use warmup
n_steps = cls_reader.num_examples * num_epochs // batch_size
warmup_steps = int(0.1 * n_steps)
sched = palm.lr_sched.TriangularSchedualer(warmup_steps, n_steps)
# step 6-2: create a optimizer
adam = palm.optimizer.Adam(loss_var, lr, sched)
# step 6-3: build backward
trainer.build_backward(optimizer=adam, weight_decay=weight_decay)
# step 7: fit prepared reader and data
iterator = trainer.fit_reader(cls_reader)
# step 8-1*: load pretrained parameters
trainer.load_pretrain(pre_params)
# step 8-2*: set saver to save model
# save_steps = n_steps
save_steps = 2396
trainer.set_saver(save_steps=save_steps, save_path=save_path, save_type=save_type)
# step 8-3: start training
# you can repeatly get one train batch with trainer.get_one_batch()
# batch = trainer.get_one_batch()
for step, batch in enumerate(iterator, start=1):
trainer.train_one_step(batch)
if step % 100 == 0:
print('do evaluation.')
# insert evaluation code here
================================================
FILE: paddlepalm/__init__.py
================================================
from . import downloader
# from mtl_controller import Controller
#import controller
from . import optimizer
from . import lr_sched
from . import backbone
from . import reader
from . import head
from .trainer import Trainer
from .multihead_trainer import MultiHeadTrainer
#del interface
#del task_instance
#del default_settings
#del utils
================================================
FILE: paddlepalm/_downloader.py
================================================
# -*- coding: UTF-8 -*-
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import os
import tarfile
import shutil
from collections import OrderedDict
import sys
import urllib
URLLIB=urllib
if sys.version_info >= (3, 0):
import urllib.request
URLLIB=urllib.request
__all__ = ["download", "ls"]
_pretrain = (('RoBERTa-zh-base', 'https://bert-models.bj.bcebos.com/chinese_roberta_wwm_ext_L-12_H-768_A-12.tar.gz'),
('RoBERTa-zh-large', 'https://bert-models.bj.bcebos.com/chinese_roberta_wwm_large_ext_L-24_H-1024_A-16.tar.gz'),
('ERNIE-v2-en-base', 'https://ernie.bj.bcebos.com/ERNIE_Base_en_stable-2.0.0.tar.gz'),
('ERNIE-v2-en-large', 'https://ernie.bj.bcebos.com/ERNIE_Large_en_stable-2.0.0.tar.gz'),
('XLNet-cased-base','https://xlnet.bj.bcebos.com/xlnet_cased_L-12_H-768_A-12.tgz'),
('XLNet-cased-large','https://xlnet.bj.bcebos.com/xlnet_cased_L-24_H-1024_A-16.tgz'),
('ERNIE-v1-zh-base','https://baidu-nlp.bj.bcebos.com/ERNIE_stable-1.0.1.tar.gz'),
('ERNIE-v1-zh-base-max-len-512','https://ernie.bj.bcebos.com/ERNIE_1.0_max-len-512.tar.gz'),
('BERT-en-uncased-large-whole-word-masking','https://bert-models.bj.bcebos.com/wwm_uncased_L-24_H-1024_A-16.tar.gz'),
('BERT-en-cased-large-whole-word-masking','https://bert-models.bj.bcebos.com/wwm_cased_L-24_H-1024_A-16.tar.gz'),
('BERT-en-uncased-base', 'https://bert-models.bj.bcebos.com/uncased_L-12_H-768_A-12.tar.gz'),
('BERT-en-uncased-large', 'https://bert-models.bj.bcebos.com/uncased_L-24_H-1024_A-16.tar.gz'),
('BERT-en-cased-base','https://bert-models.bj.bcebos.com/cased_L-12_H-768_A-12.tar.gz'),
('BERT-en-cased-large','https://bert-models.bj.bcebos.com/cased_L-24_H-1024_A-16.tar.gz'),
('BERT-multilingual-uncased-base','https://bert-models.bj.bcebos.com/multilingual_L-12_H-768_A-12.tar.gz'),
('BERT-multilingual-cased-base','https://bert-models.bj.bcebos.com/multi_cased_L-12_H-768_A-12.tar.gz'),
('BERT-zh-base','https://bert-models.bj.bcebos.com/chinese_L-12_H-768_A-12.tar.gz'),
('utils', None))
_vocab = (('utils', None),('utils', None))
_backbone =(('utils', None),('utils', None))
_head = (('utils', None),('utils', None))
_reader = (('utils', None),('utils', None))
_items = (('pretrain', OrderedDict(_pretrain)),
('vocab', OrderedDict(_vocab)),
('backbone', OrderedDict(_backbone)),
('head', OrderedDict(_head)),
('reader', OrderedDict(_reader))
)
_items = OrderedDict(_items)
def _download(item, scope, path, silent=False, convert=False):
data_url = _items[item][scope]
if data_url == None:
return
if not silent:
print('Downloading {}: {} from {}...'.format(item, scope, data_url))
data_dir = path + '/' + item + '/' + scope
if not os.path.exists(data_dir):
os.makedirs(os.path.join(data_dir))
data_name = data_url.split('/')[-1]
filename = data_dir + '/' + data_name
# print process
def _reporthook(count, chunk_size, total_size):
bytes_so_far = count * chunk_size
percent = float(bytes_so_far) / float(total_size)
if percent > 1:
percent = 1
if not silent:
print('\r>> Downloading... {:.1%}'.format(percent), end = "")
URLLIB.urlretrieve(data_url, filename, reporthook=_reporthook)
if not silent:
print(' done!')
if item == 'pretrain':
if not silent:
print ('Extracting {}...'.format(data_name), end=" ")
if os.path.exists(filename):
tar = tarfile.open(filename, 'r')
tar.extractall(path = data_dir)
tar.close()
os.remove(filename)
if len(os.listdir(data_dir))==1:
source_path = data_dir + '/' + data_name.split('.')[0]
fileList = os.listdir(source_path)
for file in fileList:
filePath = os.path.join(source_path, file)
shutil.move(filePath, data_dir)
os.removedirs(source_path)
if not silent:
print ('done!')
if convert:
if not silent:
print ('Converting params...', end=" ")
_convert(data_dir, silent)
if not silent:
print ('done!')
def _convert(path, silent=False):
if os.path.isfile(path + '/params/__palminfo__'):
if not silent:
print ('already converted.')
else:
if os.path.exists(path + '/params/'):
os.rename(path + '/params/', path + '/params1/')
os.mkdir(path + '/params/')
tar_model = tarfile.open(path + '/params/' + '__palmmodel__', 'w')
tar_info = open(path + '/params/'+ '__palminfo__', 'w')
for root, dirs, files in os.walk(path + '/params1/'):
for file in files:
src_file = os.path.join(root, file)
tar_model.add(src_file, '__paddlepalm_' + file)
tar_info.write('__paddlepalm_' + file)
os.remove(src_file)
tar_model.close()
tar_info.close()
os.removedirs(path + '/params1/')
def download(item, scope='all', path='.'):
"""download an item. The available scopes and contained items can be showed with `paddlepalm.downloader.ls`.
Args:
item: the item to download.
scope: the scope of the item to download.
path: the target dir to download to. Default is `.`, means current dir.
"""
# item = item.lower()
# scope = scope.lower()
assert item in _items, '{} is not found. Support list: {}'.format(item, list(_items.keys()))
if _items[item]['utils'] is not None:
_download(item, 'utils', path, silent=True)
if scope != 'all':
assert scope in _items[item], '{} is not found. Support scopes: {}'.format(scope, list(_items[item].keys()))
_download(item, scope, path)
else:
for s in _items[item].keys():
_download(item, s, path)
def _ls(item, scope, l = 10):
if scope != 'all':
assert scope in _items[item], '{} is not found. Support scopes: {}'.format(scope, list(_items[item].keys()))
print ('{}'.format(scope))
else:
for s in _items[item].keys():
if s == 'utils':
continue
print (' => '+s)
def ls(item='all', scope='all'):
if scope == 'utils':
return
if item != 'all':
assert item in _items, '{} is not found. Support scopes: {}'.format(item, list(_items.keys()))
print ('Available {} items:'.format(item))
_ls(item, scope)
else:
l = max(map(len, _items.keys()))
for i in _items.keys():
print ('Available {} items: '.format(i))
_ls(i, scope, l)
================================================
FILE: paddlepalm/backbone/README.md
================================================
================================================
FILE: paddlepalm/backbone/__init__.py
================================================
from .ernie import ERNIE
from .bert import BERT
================================================
FILE: paddlepalm/backbone/base_backbone.py
================================================
# -*- coding: UTF-8 -*-
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
class Backbone(object):
"""interface of backbone model."""
def __init__(self, phase):
"""该函数完成一个主干网络的构造,至少需要包含一个phase参数。
注意:实现该构造函数时,必须保证对基类构造函数的调用,以创建必要的框架内建的成员变量。
Args:
phase: str类型。用于区分主干网络被调用时所处的运行阶段,目前支持训练阶段train和预测阶段predict
"""
assert isinstance(config, dict)
@property
def inputs_attr(self):
"""描述backbone从reader处需要得到的输入对象的属性,包含各个对象的名字、shape以及数据类型。当某个对象
为标量数据类型(如str, int, float等)时,shape设置为空列表[],当某个对象的某个维度长度可变时,shape
中的相应维度设置为-1。
Return:
dict类型。对各个输入对象的属性描述。例如,
对于文本分类和匹配任务,bert backbone依赖的reader对象主要包含如下的对象
{"token_ids": ([-1, max_len], 'int64'),
"input_ids": ([-1, max_len], 'int64'),
"segment_ids": ([-1, max_len], 'int64'),
"input_mask": ([-1, max_len], 'float32')}"""
raise NotImplementedError()
@property
def outputs_attr(self):
"""描述backbone输出对象的属性,包含各个对象的名字、shape以及数据类型。当某个对象为标量数据类型(如
str, int, float等)时,shape设置为空列表[],当某个对象的某个维度长度可变时,shape中的相应维度设置为-1。
Return:
dict类型。对各个输出对象的属性描述。例如,
对于文本分类和匹配任务,bert backbone的输出内容可能包含如下的对象
{"word_emb": ([-1, max_seqlen, word_emb_size], 'float32'),
"sentence_emb": ([-1, hidden_size], 'float32'),
"sim_vec": ([-1, hidden_size], 'float32')}"""
raise NotImplementedError()
def build(self, inputs):
"""建立backbone的计算图。将符合inputs_attr描述的静态图Variable输入映射成符合outputs_attr描述的静态图Variable输出。
Args:
inputs: dict类型。字典中包含inputs_attr中的对象名到计算图Variable的映射,inputs中至少会包含inputs_attr中定义的对象
Return:
需要输出的计算图变量,输出对象会被加入到fetch_list中,从而在每个训练/推理step时得到runtime的计算结果,该计算结果会被传入postprocess方法中供用户处理。
"""
raise NotImplementedError()
================================================
FILE: paddlepalm/backbone/bert.py
================================================
# -*- coding: UTF-8 -*-
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""v1.1
BERT model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from paddle import fluid
from paddle.fluid import layers
from paddlepalm.backbone.utils.transformer import pre_process_layer, encoder
from paddlepalm.backbone.base_backbone import Backbone
class BERT(Backbone):
def __init__(self, hidden_size, num_hidden_layers, num_attention_heads, vocab_size, \
max_position_embeddings, type_vocab_size, hidden_act, hidden_dropout_prob, \
attention_probs_dropout_prob, initializer_range, is_pairwise=False, phase='train'):
self._emb_size = hidden_size
self._n_layer = num_hidden_layers
self._n_head = num_attention_heads
self._voc_size = vocab_size
self._max_position_seq_len = max_position_embeddings
self._sent_types = type_vocab_size
self._hidden_act = hidden_act
self._prepostprocess_dropout = 0. if phase == 'predict' else hidden_dropout_prob
self._attention_dropout = 0. if phase == 'predict' else attention_probs_dropout_prob
self._word_emb_name = "word_embedding"
self._pos_emb_name = "pos_embedding"
self._sent_emb_name = "sent_embedding"
self._task_emb_name = "task_embedding"
self._emb_dtype = "float32"
self._phase = phase
self._is_pairwise = is_pairwise
self._param_initializer = fluid.initializer.TruncatedNormal(
scale=initializer_range)
@classmethod
def from_config(self, config, phase='train'):
assert 'hidden_size' in config, "{} is required to initialize ERNIE".format('')
assert 'num_hidden_layers' in config, "{} is required to initialize ERNIE".format('num_hidden_layers')
assert 'num_attention_heads' in config, "{} is required to initialize ERNIE".format('num_attention_heads')
assert 'vocab_size' in config, "{} is required to initialize ERNIE".format('vocab_size')
assert 'max_position_embeddings' in config, "{} is required to initialize ERNIE".format('max_position_embeddings')
assert 'sent_type_vocab_size' in config or 'type_vocab_size' in config, \
"{} is required to initialize ERNIE".format('type_vocab_size')
assert 'hidden_act' in config, "{} is required to initialize ERNIE".format('hidden_act')
assert 'hidden_dropout_prob' in config, "{} is required to initialize ERNIE".format('hidden_dropout_prob')
assert 'attention_probs_dropout_prob' in config, \
"{} is required to initialize ERNIE".format('attention_probs_dropout_prob')
assert 'initializer_range' in config, "{} is required to initialize ERNIE".format('initializer_range')
hidden_size = config['hidden_size']
num_hidden_layers = config['num_hidden_layers']
num_attention_heads = config['num_attention_heads']
vocab_size = config['vocab_size']
max_position_embeddings = config['max_position_embeddings']
if 'sent_type_vocab_size' in config:
sent_type_vocab_size = config['sent_type_vocab_size']
else:
sent_type_vocab_size = config['type_vocab_size']
hidden_act = config['hidden_act']
hidden_dropout_prob = config['hidden_dropout_prob']
attention_probs_dropout_prob = config['attention_probs_dropout_prob']
initializer_range = config['initializer_range']
if 'is_pairwise' in config:
is_pairwise = config['is_pairwise']
else:
is_pairwise = False
return self(hidden_size, num_hidden_layers, num_attention_heads, vocab_size, \
max_position_embeddings, sent_type_vocab_size, \
hidden_act, hidden_dropout_prob, attention_probs_dropout_prob, initializer_range, is_pairwise, phase)
@property
def inputs_attr(self):
ret = {"token_ids": [[-1, -1], 'int64'],
"position_ids": [[-1, -1], 'int64'],
"segment_ids": [[-1, -1], 'int64'],
"input_mask": [[-1, -1, 1], 'float32'],
}
if self._is_pairwise and self._phase=='train':
ret.update({"token_ids_neg": [[-1, -1], 'int64'],
"position_ids_neg": [[-1, -1], 'int64'],
"segment_ids_neg": [[-1, -1], 'int64'],
"input_mask_neg": [[-1, -1, 1], 'float32'],
})
return ret
@property
def outputs_attr(self):
ret = {"word_embedding": [[-1, -1, self._emb_size], 'float32'],
"embedding_table": [[-1, self._voc_size, self._emb_size], 'float32'],
"encoder_outputs": [[-1, -1, self._emb_size], 'float32'],
"sentence_embedding": [[-1, self._emb_size], 'float32'],
"sentence_pair_embedding": [[-1, self._emb_size], 'float32']}
if self._is_pairwise and self._phase == 'train':
ret.update({"word_embedding_neg": [[-1, -1, self._emb_size], 'float32'],
"encoder_outputs_neg": [[-1, -1, self._emb_size], 'float32'],
"sentence_embedding_neg": [[-1, self._emb_size], 'float32'],
"sentence_pair_embedding_neg": [[-1, self._emb_size], 'float32']})
return ret
def build(self, inputs, scope_name=""):
src_ids = inputs['token_ids']
pos_ids = inputs['position_ids']
sent_ids = inputs['segment_ids']
input_mask = inputs['input_mask']
self._emb_dtype = 'float32'
input_buffer = {}
output_buffer = {}
input_buffer['base'] = [src_ids, pos_ids, sent_ids, input_mask]
output_buffer['base'] = {}
if self._is_pairwise and self._phase =='train':
src_ids = inputs['token_ids_neg']
pos_ids = inputs['position_ids_neg']
sent_ids = inputs['segment_ids_neg']
input_mask = inputs['input_mask_neg']
input_buffer['neg'] = [src_ids, pos_ids, sent_ids, input_mask]
output_buffer['neg'] = {}
for key, (src_ids, pos_ids, sent_ids, input_mask) in input_buffer.items():
# padding id in vocabulary must be set to 0
emb_out = fluid.embedding(
input=src_ids,
size=[self._voc_size, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=scope_name+self._word_emb_name, initializer=self._param_initializer),
is_sparse=False)
# fluid.global_scope().find_var('backbone-word_embedding').get_tensor()
embedding_table = fluid.default_main_program().global_block().var(scope_name+self._word_emb_name)
position_emb_out = fluid.embedding(
input=pos_ids,
size=[self._max_position_seq_len, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=scope_name+self._pos_emb_name, initializer=self._param_initializer))
sent_emb_out = fluid.embedding(
sent_ids,
size=[self._sent_types, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=scope_name+self._sent_emb_name, initializer=self._param_initializer))
emb_out = emb_out + position_emb_out
emb_out = emb_out + sent_emb_out
emb_out = pre_process_layer(
emb_out, 'nd', self._prepostprocess_dropout, name=scope_name+'pre_encoder')
self_attn_mask = fluid.layers.matmul(
x=input_mask, y=input_mask, transpose_y=True)
self_attn_mask = fluid.layers.scale(
x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
n_head_self_attn_mask = fluid.layers.stack(
x=[self_attn_mask] * self._n_head, axis=1)
n_head_self_attn_mask.stop_gradient = True
enc_out = encoder(
enc_input=emb_out,
attn_bias=n_head_self_attn_mask,
n_layer=self._n_layer,
n_head=self._n_head,
d_key=self._emb_size // self._n_head,
d_value=self._emb_size // self._n_head,
d_model=self._emb_size,
d_inner_hid=self._emb_size * 4,
prepostprocess_dropout=self._prepostprocess_dropout,
attention_dropout=self._attention_dropout,
relu_dropout=0,
hidden_act=self._hidden_act,
preprocess_cmd="",
postprocess_cmd="dan",
param_initializer=self._param_initializer,
name=scope_name+'encoder')
next_sent_feat = fluid.layers.slice(
input=enc_out, axes=[1], starts=[0], ends=[1])
next_sent_feat = fluid.layers.reshape(next_sent_feat, [-1, next_sent_feat.shape[-1]])
next_sent_feat = fluid.layers.fc(
input=next_sent_feat,
size=self._emb_size,
act="tanh",
param_attr=fluid.ParamAttr(
name=scope_name+"pooled_fc.w_0", initializer=self._param_initializer),
bias_attr=scope_name+"pooled_fc.b_0")
output_buffer[key]['word_embedding'] = emb_out
output_buffer[key]['encoder_outputs'] = enc_out
output_buffer[key]['sentence_embedding'] = next_sent_feat
output_buffer[key]['sentence_pair_embedding'] = next_sent_feat
ret = {}
ret['embedding_table'] = embedding_table
ret['word_embedding'] = output_buffer['base']['word_embedding']
ret['encoder_outputs'] = output_buffer['base']['encoder_outputs']
ret['sentence_embedding'] = output_buffer['base']['sentence_embedding']
ret['sentence_pair_embedding'] = output_buffer['base']['sentence_pair_embedding']
if self._is_pairwise and self._phase == 'train':
ret['word_embedding_neg'] = output_buffer['neg']['word_embedding']
ret['encoder_outputs_neg'] = output_buffer['neg']['encoder_outputs']
ret['sentence_embedding_neg'] = output_buffer['neg']['sentence_embedding']
ret['sentence_pair_embedding_neg'] = output_buffer['neg']['sentence_pair_embedding']
return ret
def postprocess(self, rt_outputs):
pass
class Model(BERT):
"""BERT wrapper for ConfigController"""
def __init__(self, config, phase):
BERT.from_config(config, phase=phase)
================================================
FILE: paddlepalm/backbone/ernie.py
================================================
# -*- coding: UTF-8 -*-
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Ernie model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
from paddle import fluid
from paddle.fluid import layers
from paddlepalm.backbone.utils.transformer import pre_process_layer, encoder
from paddlepalm.backbone.base_backbone import Backbone
class ERNIE(Backbone):
def __init__(self, hidden_size, num_hidden_layers, num_attention_heads, vocab_size, \
max_position_embeddings, sent_type_vocab_size, task_type_vocab_size, \
hidden_act, hidden_dropout_prob, attention_probs_dropout_prob, initializer_range, is_pairwise=False, use_task_emb=True, phase='train'):
# self._is_training = phase == 'train' # backbone一般不用关心运行阶段,因为outputs在任何阶段基本不会变
self._emb_size = hidden_size
self._n_layer = num_hidden_layers
self._n_head = num_attention_heads
self._voc_size = vocab_size
self._max_position_seq_len = max_position_embeddings
self._sent_types = sent_type_vocab_size
self._task_types = task_type_vocab_size
self._hidden_act = hidden_act
self._prepostprocess_dropout = 0. if phase == 'predict' else hidden_dropout_prob
self._attention_dropout = 0. if phase == 'predict' else attention_probs_dropout_prob
self._word_emb_name = "word_embedding"
self._pos_emb_name = "pos_embedding"
self._sent_emb_name = "sent_embedding"
self._task_emb_name = "task_embedding"
self._emb_dtype = "float32"
self._is_pairwise = is_pairwise
self._use_task_emb = use_task_emb
self._phase=phase
self._param_initializer = fluid.initializer.TruncatedNormal(
scale=initializer_range)
@classmethod
def from_config(cls, config, phase='train'):
assert 'hidden_size' in config, "{} is required to initialize ERNIE".format('hidden_size')
assert 'num_hidden_layers' in config, "{} is required to initialize ERNIE".format('num_hidden_layers')
assert 'num_attention_heads' in config, "{} is required to initialize ERNIE".format('num_attention_heads')
assert 'vocab_size' in config, "{} is required to initialize ERNIE".format('vocab_size')
assert 'max_position_embeddings' in config, "{} is required to initialize ERNIE".format('max_position_embeddings')
assert 'sent_type_vocab_size' in config or 'type_vocab_size' in config, "{} is required to initialize ERNIE".format('sent_type_vocab_size')
# assert 'task_type_vocab_size' in config, "{} is required to initialize ERNIE".format('task_type_vocab_size')
assert 'hidden_act' in config, "{} is required to initialize ERNIE".format('hidden_act')
assert 'hidden_dropout_prob' in config, "{} is required to initialize ERNIE".format('hidden_dropout_prob')
assert 'attention_probs_dropout_prob' in config, "{} is required to initialize ERNIE".format('attention_probs_dropout_prob')
assert 'initializer_range' in config, "{} is required to initialize ERNIE".format('initializer_range')
hidden_size = config['hidden_size']
num_hidden_layers = config['num_hidden_layers']
num_attention_heads = config['num_attention_heads']
vocab_size = config['vocab_size']
max_position_embeddings = config['max_position_embeddings']
if 'sent_type_vocab_size' in config:
sent_type_vocab_size = config['sent_type_vocab_size']
else:
sent_type_vocab_size = config['type_vocab_size']
if 'task_type_vocab_size' in config:
task_type_vocab_size = config['task_type_vocab_size']
else:
task_type_vocab_size = config['type_vocab_size']
if 'use_task_emb' in config:
use_task_emb = config['use_task_emb']
else:
use_task_emb = True
hidden_act = config['hidden_act']
hidden_dropout_prob = config['hidden_dropout_prob']
attention_probs_dropout_prob = config['attention_probs_dropout_prob']
initializer_range = config['initializer_range']
if 'is_pairwise' in config:
is_pairwise = config['is_pairwise']
else:
is_pairwise = False
return cls(hidden_size, num_hidden_layers, num_attention_heads, vocab_size, \
max_position_embeddings, sent_type_vocab_size, task_type_vocab_size, \
hidden_act, hidden_dropout_prob, attention_probs_dropout_prob, initializer_range, is_pairwise, use_task_emb=use_task_emb, phase=phase)
@property
def inputs_attr(self):
ret = {"token_ids": [[-1, -1], 'int64'],
"position_ids": [[-1, -1], 'int64'],
"segment_ids": [[-1, -1], 'int64'],
"input_mask": [[-1, -1, 1], 'float32'],
"task_ids": [[-1,-1], 'int64']}
if self._is_pairwise and self._phase=='train':
ret.update({"token_ids_neg": [[-1, -1], 'int64'],
"position_ids_neg": [[-1, -1], 'int64'],
"segment_ids_neg": [[-1, -1], 'int64'],
"input_mask_neg": [[-1, -1, 1], 'float32'],
"task_ids_neg": [[-1,-1], 'int64']
})
return ret
@property
def outputs_attr(self):
ret = {"word_embedding": [[-1, -1, self._emb_size], 'float32'],
"embedding_table": [[-1, self._voc_size, self._emb_size], 'float32'],
"encoder_outputs": [[-1, -1, self._emb_size], 'float32'],
"sentence_embedding": [[-1, self._emb_size], 'float32'],
"sentence_pair_embedding": [[-1, self._emb_size], 'float32']}
if self._is_pairwise and self._phase == 'train':
ret.update({"word_embedding_neg": [[-1, -1, self._emb_size], 'float32'],
"encoder_outputs_neg": [[-1, -1, self._emb_size], 'float32'],
"sentence_embedding_neg": [[-1, self._emb_size], 'float32'],
"sentence_pair_embedding_neg": [[-1, self._emb_size], 'float32']})
return ret
def build(self, inputs, scope_name=""):
src_ids = inputs['token_ids']
pos_ids = inputs['position_ids']
sent_ids = inputs['segment_ids']
input_mask = inputs['input_mask']
task_ids = inputs['task_ids']
input_buffer = {}
output_buffer = {}
input_buffer['base'] = [src_ids, pos_ids, sent_ids, input_mask, task_ids]
output_buffer['base'] = {}
if self._is_pairwise and self._phase =='train':
src_ids = inputs['token_ids_neg']
pos_ids = inputs['position_ids_neg']
sent_ids = inputs['segment_ids_neg']
input_mask = inputs['input_mask_neg']
task_ids = inputs['task_ids_neg']
input_buffer['neg'] = [src_ids, pos_ids, sent_ids, input_mask, task_ids]
output_buffer['neg'] = {}
for key, (src_ids, pos_ids, sent_ids, input_mask, task_ids) in input_buffer.items():
# padding id in vocabulary must be set to 0
emb_out = fluid.embedding(
input=src_ids,
size=[self._voc_size, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=scope_name+self._word_emb_name, initializer=self._param_initializer),
is_sparse=False)
# fluid.global_scope().find_var('backbone-word_embedding').get_tensor()
embedding_table = fluid.default_main_program().global_block().var(scope_name+self._word_emb_name)
position_emb_out = fluid.embedding(
input=pos_ids,
size=[self._max_position_seq_len, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=scope_name+self._pos_emb_name, initializer=self._param_initializer))
sent_emb_out = fluid.embedding(
sent_ids,
size=[self._sent_types, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=scope_name+self._sent_emb_name, initializer=self._param_initializer))
emb_out = emb_out + position_emb_out
emb_out = emb_out + sent_emb_out
if self._use_task_emb:
task_emb_out = fluid.embedding(
task_ids,
size=[self._task_types, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=scope_name+self._task_emb_name,
initializer=self._param_initializer))
emb_out = emb_out + task_emb_out
emb_out = pre_process_layer(
emb_out, 'nd', self._prepostprocess_dropout, name=scope_name+'pre_encoder')
self_attn_mask = fluid.layers.matmul(
x=input_mask, y=input_mask, transpose_y=True)
self_attn_mask = fluid.layers.scale(
x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
n_head_self_attn_mask = fluid.layers.stack(
x=[self_attn_mask] * self._n_head, axis=1)
n_head_self_attn_mask.stop_gradient = True
enc_out = encoder(
enc_input=emb_out,
attn_bias=n_head_self_attn_mask,
n_layer=self._n_layer,
n_head=self._n_head,
d_key=self._emb_size // self._n_head,
d_value=self._emb_size // self._n_head,
d_model=self._emb_size,
d_inner_hid=self._emb_size * 4,
prepostprocess_dropout=self._prepostprocess_dropout,
attention_dropout=self._attention_dropout,
relu_dropout=0,
hidden_act=self._hidden_act,
preprocess_cmd="",
postprocess_cmd="dan",
param_initializer=self._param_initializer,
name=scope_name+'encoder')
next_sent_feat = fluid.layers.slice(
input=enc_out, axes=[1], starts=[0], ends=[1])
next_sent_feat = fluid.layers.reshape(next_sent_feat, [-1, next_sent_feat.shape[-1]])
next_sent_feat = fluid.layers.fc(
input=next_sent_feat,
size=self._emb_size,
act="tanh",
param_attr=fluid.ParamAttr(
name=scope_name+"pooled_fc.w_0", initializer=self._param_initializer),
bias_attr=scope_name+"pooled_fc.b_0")
output_buffer[key]['word_embedding'] = emb_out
output_buffer[key]['encoder_outputs'] = enc_out
output_buffer[key]['sentence_embedding'] = next_sent_feat
output_buffer[key]['sentence_pair_embedding'] = next_sent_feat
ret = {}
ret['embedding_table'] = embedding_table
ret['word_embedding'] = output_buffer['base']['word_embedding']
ret['encoder_outputs'] = output_buffer['base']['encoder_outputs']
ret['sentence_embedding'] = output_buffer['base']['sentence_embedding']
ret['sentence_pair_embedding'] = output_buffer['base']['sentence_pair_embedding']
if self._is_pairwise and self._phase == 'train':
ret['word_embedding_neg'] = output_buffer['neg']['word_embedding']
ret['encoder_outputs_neg'] = output_buffer['neg']['encoder_outputs']
ret['sentence_embedding_neg'] = output_buffer['neg']['sentence_embedding']
ret['sentence_pair_embedding_neg'] = output_buffer['neg']['sentence_pair_embedding']
return ret
def postprocess(self, rt_outputs):
pass
class Model(ERNIE):
def __init__(self, config, phase):
ERNIE.from_config(config, phase=phase)
================================================
FILE: paddlepalm/backbone/utils/__init__.py
================================================
================================================
FILE: paddlepalm/backbone/utils/transformer.py
================================================
# -*- coding: UTF-8 -*-
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Transformer encoder."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from functools import partial
import paddle.fluid as fluid
import paddle.fluid.layers as layers
from paddle.fluid.layer_helper import LayerHelper as LayerHelper
from functools import reduce # py3
def layer_norm(x, begin_norm_axis=1, epsilon=1e-6, param_attr=None, bias_attr=None):
helper = LayerHelper('layer_norm', **locals())
mean = layers.reduce_mean(x, dim=begin_norm_axis, keep_dim=True)
shift_x = layers.elementwise_sub(x=x, y=mean, axis=0)
variance = layers.reduce_mean(layers.square(shift_x), dim=begin_norm_axis, keep_dim=True)
r_stdev = layers.rsqrt(variance + epsilon)
norm_x = layers.elementwise_mul(x=shift_x, y=r_stdev, axis=0)
param_shape = [reduce(lambda x, y: x * y, norm_x.shape[begin_norm_axis:])]
param_dtype = norm_x.dtype
scale = helper.create_parameter(
attr=param_attr,
shape=param_shape,
dtype=param_dtype,
default_initializer=fluid.initializer.Constant(1.))
bias = helper.create_parameter(
attr=bias_attr,
shape=param_shape,
dtype=param_dtype,
is_bias=True,
default_initializer=fluid.initializer.Constant(0.))
out = layers.elementwise_mul(x=norm_x, y=scale, axis=-1)
out = layers.elementwise_add(x=out, y=bias, axis=-1)
return out
def multi_head_attention(queries,
keys,
values,
attn_bias,
d_key,
d_value,
d_model,
n_head=1,
dropout_rate=0.,
cache=None,
param_initializer=None,
name='multi_head_att'):
"""
Multi-Head Attention. Note that attn_bias is added to the logit before
computing softmax activiation to mask certain selected positions so that
they will not considered in attention weights.
"""
keys = queries if keys is None else keys
values = keys if values is None else values
if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
raise ValueError(
"Inputs: quries, keys and values should all be 3-D tensors.")
def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
"""
Add linear projection to queries, keys, and values.
"""
q = layers.fc(input=queries,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_query_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_query_fc.b_0')
k = layers.fc(input=keys,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_key_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_key_fc.b_0')
v = layers.fc(input=values,
size=d_value * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_value_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_value_fc.b_0')
return q, k, v
def __split_heads(x, n_head):
"""
Reshape the last dimension of inpunt tensor x so that it becomes two
dimensions and then transpose. Specifically, input a tensor with shape
[bs, max_sequence_length, n_head * hidden_dim] then output a tensor
with shape [bs, n_head, max_sequence_length, hidden_dim].
"""
hidden_size = x.shape[-1]
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
reshaped = layers.reshape(
x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
# permuate the dimensions into:
# [batch_size, n_head, max_sequence_len, hidden_size_per_head]
return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
def __combine_heads(x):
"""
Transpose and then reshape the last two dimensions of inpunt tensor x
so that it becomes one dimension, which is reverse to __split_heads.
"""
if len(x.shape) == 3: return x
if len(x.shape) != 4:
raise ValueError("Input(x) should be a 4-D Tensor.")
trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
return layers.reshape(
x=trans_x,
shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
inplace=True)
def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
"""
Scaled Dot-Product Attention
"""
scaled_q = layers.scale(x=q, scale=d_key**-0.5)
product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
if attn_bias:
product += attn_bias
weights = layers.softmax(product)
if dropout_rate:
weights = layers.dropout(
weights,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
out = layers.matmul(weights, v)
return out
q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
if cache is not None: # use cache and concat time steps
# Since the inplace reshape in __split_heads changes the shape of k and
# v, which is the cache input for next time step, reshape the cache
# input from the previous time step first.
k = cache["k"] = layers.concat(
[layers.reshape(
cache["k"], shape=[0, 0, d_model]), k], axis=1)
v = cache["v"] = layers.concat(
[layers.reshape(
cache["v"], shape=[0, 0, d_model]), v], axis=1)
q = __split_heads(q, n_head)
k = __split_heads(k, n_head)
v = __split_heads(v, n_head)
ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
dropout_rate)
out = __combine_heads(ctx_multiheads)
# Project back to the model size.
proj_out = layers.fc(input=out,
size=d_model,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_output_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_output_fc.b_0')
return proj_out
def positionwise_feed_forward(x,
d_inner_hid,
d_hid,
dropout_rate,
hidden_act,
param_initializer=None,
name='ffn'):
"""
Position-wise Feed-Forward Networks.
This module consists of two linear transformations with a ReLU activation
in between, which is applied to each position separately and identically.
"""
hidden = layers.fc(input=x,
size=d_inner_hid,
num_flatten_dims=2,
act=hidden_act,
param_attr=fluid.ParamAttr(
name=name + '_fc_0.w_0',
initializer=param_initializer),
bias_attr=name + '_fc_0.b_0')
if dropout_rate:
hidden = layers.dropout(
hidden,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
out = layers.fc(input=hidden,
size=d_hid,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_fc_1.w_0', initializer=param_initializer),
bias_attr=name + '_fc_1.b_0')
return out
def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
name=''):
"""
Add residual connection, layer normalization and droput to the out tensor
optionally according to the value of process_cmd.
This will be used before or after multi-head attention and position-wise
feed-forward networks.
"""
for cmd in process_cmd:
if cmd == "a": # add residual connection
out = out + prev_out if prev_out else out
elif cmd == "n": # add layer normalization
out_dtype = out.dtype
if out_dtype == fluid.core.VarDesc.VarType.FP16:
out = layers.cast(x=out, dtype="float32")
out = layer_norm(
out,
begin_norm_axis=len(out.shape) - 1,
param_attr=fluid.ParamAttr(
name=name + '_layer_norm_scale',
initializer=fluid.initializer.Constant(1.)),
bias_attr=fluid.ParamAttr(
name=name + '_layer_norm_bias',
initializer=fluid.initializer.Constant(0.)))
if out_dtype == fluid.core.VarDesc.VarType.FP16:
out = layers.cast(x=out, dtype="float16")
elif cmd == "d": # add dropout
if dropout_rate:
out = layers.dropout(
out,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
return out
pre_process_layer = partial(pre_post_process_layer, None)
post_process_layer = pre_post_process_layer
def encoder_layer(enc_input,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd="n",
postprocess_cmd="da",
param_initializer=None,
name=''):
"""The encoder layers that can be stacked to form a deep encoder.
This module consits of a multi-head (self) attention followed by
position-wise feed-forward networks and both the two components companied
with the post_process_layer to add residual connection, layer normalization
and droput.
"""
attn_output = multi_head_attention(
pre_process_layer(
enc_input,
preprocess_cmd,
prepostprocess_dropout,
name=name + '_pre_att'),
None,
None,
attn_bias,
d_key,
d_value,
d_model,
n_head,
attention_dropout,
param_initializer=param_initializer,
name=name + '_multi_head_att')
attn_output = post_process_layer(
enc_input,
attn_output,
postprocess_cmd,
prepostprocess_dropout,
name=name + '_post_att')
ffd_output = positionwise_feed_forward(
pre_process_layer(
attn_output,
preprocess_cmd,
prepostprocess_dropout,
name=name + '_pre_ffn'),
d_inner_hid,
d_model,
relu_dropout,
hidden_act,
param_initializer=param_initializer,
name=name + '_ffn')
return post_process_layer(
attn_output,
ffd_output,
postprocess_cmd,
prepostprocess_dropout,
name=name + '_post_ffn')
def encoder(enc_input,
attn_bias,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd="n",
postprocess_cmd="da",
param_initializer=None,
name=''):
"""
The encoder is composed of a stack of identical layers returned by calling
encoder_layer.
"""
for i in range(n_layer):
enc_output = encoder_layer(
enc_input,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd,
postprocess_cmd,
param_initializer=param_initializer,
name=name + '_layer_' + str(i))
enc_input = enc_output
enc_output = pre_process_layer(
enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
return enc_output
================================================
FILE: paddlepalm/distribute/__init__.py
================================================
from paddle import fluid
import os
import multiprocessing
gpu_dev_count = int(fluid.core.get_cuda_device_count())
cpu_dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
from .reader import yield_pieces, data_feeder, decode_fake
================================================
FILE: paddlepalm/distribute/reader.py
================================================
from . import gpu_dev_count, cpu_dev_count
try:
import queue as Queue
except ImportError:
import Queue
from threading import Thread
dev_count = gpu_dev_count if gpu_dev_count > 0 else cpu_dev_count
def yield_pieces(data, distribute_strategy, batch_size):
"""
Args:
distribute_strategy: support s=split, c=copy, u=unstack,
"""
assert batch_size % dev_count == 0, "batch_size need to be integer times larger than dev_count."
# print('data in yield pieces')
# print(len(data))
assert type(data) == type(distribute_strategy), [type(data), type(distribute_strategy)]
assert len(data) == len(distribute_strategy), [len(data), len(distribute_strategy)]
if isinstance(data, dict):
keys = list(data.keys())
data_list = [data[i] for i in keys]
ds_list = [distribute_strategy[i] for i in keys]
else:
assert isinstance(data, list), "the input data must be a list or dict, and contained with multiple tensors."
data_list = data
ds_list = distribute_strategy
stride = batch_size // dev_count
p = stride
# while p < len(data_list) + stride:
while p <= batch_size:
temp = []
for d, s in zip(data_list, ds_list):
s = s.strip().lower()
if s == 's' or s == 'split':
if p - stride >= len(d):
# print('WARNING: no more examples to feed empty devices')
temp = []
return
temp.append(d[p-stride:p])
elif s == 'u' or s == 'unstack':
assert len(d) <= dev_count, 'Tensor size on dim 0 must be less equal to dev_count when unstack is applied.'
if p//stride > len(d):
# print('WARNING: no more examples to feed empty devices')
return
temp.append(d[p//stride-1])
elif s == 'c' or s == 'copy':
temp.append(d)
else:
raise NotImplementedError()
p += stride
if type(data) == dict:
yield dict(zip(*[keys, temp]))
else:
# print('yielded pieces')
# print(len(temp))
yield temp
def data_feeder(reader, postprocess_fn=None, prefetch_steps=2, phase='train', is_multi=False):
if postprocess_fn is None:
def postprocess_fn(batch, id=-1, phase='train', is_multi=False):
return batch
def worker(reader, dev_count, queue):
dev_batches = []
for index, data in enumerate(reader()):
if len(dev_batches) < dev_count:
dev_batches.append(data)
if len(dev_batches) == dev_count:
queue.put((dev_batches, 0))
dev_batches = []
# For the prediction of the remained batches, pad more batches to
# the number of devices and the padded samples would be removed in
# prediction outputs.
if len(dev_batches) > 0:
num_pad = dev_count - len(dev_batches)
for i in range(len(dev_batches), dev_count):
dev_batches.append(dev_batches[-1])
queue.put((dev_batches, num_pad))
queue.put(None)
queue = Queue.Queue(dev_count*prefetch_steps)
p = Thread(
target=worker, args=(reader, dev_count, queue))
p.daemon = True
p.start()
while True:
ret = queue.get()
queue.task_done()
if ret is not None:
batches, num_pad = ret
if dev_count > 1 and phase == 'train' and is_multi:
id = batches[0]['__task_id'][0]
else:
id = -1
batch_buf = []
flag_buf = []
for idx, batch in enumerate(batches):
# flag = num_pad == 0
flag = idx-len(batches) < -num_pad
# if num_pad > 0:
# num_pad -= 1
batch = postprocess_fn(batch, id, phase, is_multi=is_multi)
# batch = postprocess_fn(batch)
batch_buf.append(batch)
flag_buf.append(flag)
yield batch_buf, flag_buf
else:
break
queue.join()
def decode_fake(nums, mask, bs):
bs //= dev_count
n_t = 0
for flag in mask:
if not flag:
break
n_t = n_t + 1
n_f = len(mask) - n_t
p1 = nums - (n_t-1) * bs
assert p1 % (n_f+1) == 0
each_f = p1 // (n_f+1)
return each_f * n_f
================================================
FILE: paddlepalm/downloader.py
================================================
from ._downloader import *
================================================
FILE: paddlepalm/head/__init__.py
================================================
from .cls import Classify
from .match import Match
from .ner import SequenceLabel
from .mrc import MRC
from .mlm import MaskLM
================================================
FILE: paddlepalm/head/base_head.py
================================================
# -*- coding: UTF-8 -*-
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import json
import copy
class Head(object):
def __init__(self, phase='train'):
"""该函数完成一个任务头的构造,至少需要包含一个phase参数。
注意:实现该构造函数时,必须保证对基类构造函数的调用,以创建必要的框架内建的成员变量。
Args:
phase: str类型。用于区分任务头被调用时所处的任务运行阶段,目前支持训练阶段train和预测阶段predict
"""
self._stop_gradient = {}
self._phase = phase
self._prog = None
self._results_buffer = []
@property
def inputs_attrs(self):
"""step级别的任务输入对象声明。
描述该任务头所依赖的reader、backbone和来自其他任务头的输出对象(每个step获取一次)。使用字典进行描述,
字典的key为输出对象所在的组件(如’reader‘,’backbone‘等),value为该组件下任务头所需要的输出对象集。
输出对象集使用字典描述,key为输出对象的名字(该名字需保证在相关组件的输出对象集中),value为该输出对象
的shape和dtype。当某个输出对象的某个维度长度可变时,shape中的相应维度设置为-1。
Return:
dict类型。描述该任务头所依赖的step级输入,即来自各个组件的输出对象。"""
raise NotImplementedError()
@property
def outputs_attr(self):
"""step级别的任务输出对象声明。
描述该任务头的输出对象(每个step输出一次),包括每个输出对象的名字,shape和dtype。输出对象会被加入到
fetch_list中,从而在每个训练/推理step时得到实时的计算结果,该计算结果可以传入batch_postprocess方
法中进行当前step的后处理。当某个对象为标量数据类型(如str, int, float等)时,shape设置为空列表[],
当某个对象的某个维度长度可变时,shape中的相应维度设置为-1。
Return:
dict类型。描述该任务头所产生的输出对象。注意,在训练阶段时必须包含名为loss的输出对象。
"""
raise NotImplementedError()
@property
def epoch_inputs_attrs(self):
"""epoch级别的任务输入对象声明。
描述该任务所依赖的来自reader、backbone和来自其他任务头的输出对象(每个epoch结束后产生一次),如完整的
样本集,有效的样本数等。使用字典进行描述,字典的key为输出对象所在的组件(如’reader‘,’backbone‘等),
value为该组件下任务头所需要的输出对象集。输出对象集使用字典描述,key为输出对象的名字(该名字需保证在相关
组件的输出对象集中),value为该输出对象的shape和dtype。当某个输出对象的某个维度长度可变时,shape中的相
应维度设置为-1。
Return:
dict类型。描述该任务头所产生的输出对象。注意,在训练阶段时必须包含名为loss的输出对象。
"""
return {}
def build(self, inputs, scope_name=""):
"""建立任务头的计算图。
将符合inputs_attrs描述的来自各个对象集的静态图Variables映射成符合outputs_attr描述的静态图Variable输出。
Args:
inputs: dict类型。字典中包含inputs_attrs中的对象名到计算图Variable的映射,inputs中至少会包含inputs_attr中定义的对象
Return:
需要输出的计算图变量,输出对象会被加入到fetch_list中,从而在每个训练/推理step时得到runtime的计算结果,该计算结果会被传入postprocess方法中供用户处理。
"""
raise NotImplementedError()
def batch_postprocess(self, rt_outputs):
"""batch/step级别的后处理。
每个训练或推理step后针对当前batch的任务头输出对象的实时计算结果来进行相关后处理。
默认将输出结果存储到缓冲区self._results_buffer中。"""
if isinstance(rt_outputs, dict):
keys = rt_outputs.keys()
vals = [rt_outputs[k] for k in keys]
lens = [len(v) for v in vals]
if len(set(lens)) == 1:
results = [dict(zip(*[keys, i])) for i in zip(*vals)]
self._results_buffer.extend(results)
return results
else:
print('WARNING: irregular output results. visualize failed.')
self._results_buffer.append(rt_outputs)
return None
def reset(self):
"""清空该任务头的缓冲区(在训练或推理过程中积累的处理结果)"""
self._results_buffer = []
def get_results(self):
"""返回当前任务头积累的处理结果。"""
return copy.deepcopy(self._results_buffer)
def epoch_postprocess(self, post_inputs=None, output_dir=None):
"""epoch级别的后处理。
每个训练或推理epoch结束后,对积累的各样本的后处理结果results进行后处理。默认情况下,当output_dir为None时,直接将results打印到
屏幕上。当指定output_dir时,将results存储在指定的文件夹内,并以任务头所处阶段来作为存储文件的文件名。
Args:
post_inputs: 当声明的epoch_inputs_attr不为空时,该参数会携带对应的输入变量的内容。
output_dir: 积累结果的保存路径。
"""
if output_dir is not None:
if not os.path.exists(output_dir):
os.makedirs(output_dir)
with open(os.path.join(output_dir, self._phase), 'w') as writer:
for i in self._results_buffer:
writer.write(json.dumps(i)+'\n')
else:
return self._results_buffer
================================================
FILE: paddlepalm/head/cls.py
================================================
# -*- coding: UTF-8 -*-
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle.fluid as fluid
from paddle.fluid import layers
from paddlepalm.head.base_head import Head
import numpy as np
import os
import json
class Classify(Head):
"""
classification
"""
def __init__(self, num_classes, input_dim, dropout_prob=0.0, \
param_initializer_range=0.02, phase='train'):
self._is_training = phase == 'train'
self._hidden_size = input_dim
self.num_classes = num_classes
self._dropout_prob = dropout_prob if phase == 'train' else 0.0
self._param_initializer = fluid.initializer.TruncatedNormal(
scale=param_initializer_range)
self._preds = []
self._probs = []
@property
def inputs_attrs(self):
reader = {}
bb = {"sentence_embedding": [[-1, self._hidden_size], 'float32']}
if self._is_training:
reader["label_ids"] = [[-1], 'int64']
return {'reader': reader, 'backbone': bb}
@property
def outputs_attrs(self):
if self._is_training:
return {'loss': [[1], 'float32']}
else:
return {'logits': [[-1, self.num_classes], 'float32'],
'probs': [[-1, self.num_classes], 'float32']}
def build(self, inputs, scope_name=''):
sent_emb = inputs['backbone']['sentence_embedding']
if self._is_training:
label_ids = inputs['reader']['label_ids']
cls_feats = fluid.layers.dropout(
x=sent_emb,
dropout_prob=self._dropout_prob,
dropout_implementation="upscale_in_train")
logits = fluid.layers.fc(
input=sent_emb,
size=self.num_classes,
param_attr=fluid.ParamAttr(
name=scope_name+"cls_out_w",
initializer=self._param_initializer),
bias_attr=fluid.ParamAttr(
name=scope_name+"cls_out_b", initializer=fluid.initializer.Constant(0.)))
probs = fluid.layers.softmax(logits)
if self._is_training:
loss = fluid.layers.cross_entropy(
input=probs, label=label_ids)
loss = layers.mean(loss)
return {"loss": loss}
else:
return {"logits":logits,
"probs":probs}
def batch_postprocess(self, rt_outputs):
if not self._is_training:
logits = rt_outputs['logits']
probs = rt_outputs['probs']
self._preds.extend(logits.tolist())
self._probs.extend(probs.tolist())
def epoch_postprocess(self, post_inputs, output_dir=None):
# there is no post_inputs needed and not declared in epoch_inputs_attrs, hence no elements exist in post_inputs
if not self._is_training:
results = []
for i in range(len(self._preds)):
label = int(np.argmax(np.array(self._preds[i])))
result = {'index': i, 'label': label, 'logits': self._preds[i], 'probs': self._probs[i]}
results.append(result)
if output_dir is not None:
with open(os.path.join(output_dir, 'predictions.json'), 'w') as writer:
for result in results:
result = json.dumps(result)
writer.write(result+'\n')
print('Predictions saved at '+os.path.join(output_dir, 'predictions.json'))
return results
================================================
FILE: paddlepalm/head/match.py
================================================
# -*- coding: UTF-8 -*-
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle.fluid as fluid
from paddle.fluid import layers
from paddlepalm.head.base_head import Head
import numpy as np
import os
import json
def computeHingeLoss(pos, neg, margin):
loss_part1 = fluid.layers.elementwise_sub(
fluid.layers.fill_constant_batch_size_like(
input=pos, shape=[-1, 1], value=margin, dtype='float32'), pos)
loss_part2 = fluid.layers.elementwise_add(loss_part1, neg)
loss_part3 = fluid.layers.elementwise_max(
fluid.layers.fill_constant_batch_size_like(
input=loss_part2, shape=[-1, 1], value=0.0, dtype='float32'), loss_part2)
return loss_part3
class Match(Head):
'''
matching
'''
def __init__(self, num_classes, input_dim, dropout_prob=0.0, param_initializer_range=0.02, \
learning_strategy='pointwise', margin=0.5, phase='train'):
"""
Args:
phase: train, eval, pred
lang: en, ch, ...
learning_strategy: pointwise, pairwise
"""
self._is_training = phase == 'train'
self._hidden_size = input_dim
self._num_classes = num_classes
self._dropout_prob = dropout_prob if phase == 'train' else 0.0
self._param_initializer = fluid.initializer.TruncatedNormal(
scale=param_initializer_range)
self._learning_strategy = learning_strategy
self._margin = margin
self._preds = []
self._preds_logits = []
@property
def inputs_attrs(self):
reader = {}
bb = {"sentence_pair_embedding": [[-1, self._hidden_size], 'float32']}
if self._is_training:
if self._learning_strategy == 'pointwise':
reader["label_ids"] = [[-1], 'int64']
elif self._learning_strategy == 'pairwise':
bb["sentence_pair_embedding_neg"] = [[-1, self._hidden_size], 'float32']
return {'reader': reader, 'backbone': bb}
@property
def outputs_attrs(self):
if self._is_training:
return {"loss": [[1], 'float32']}
else:
if self._learning_strategy=='paiwise':
return {"probs": [[-1, 1], 'float32']}
else:
return {"logits": [[-1, self._num_classes], 'float32'],
"probs": [[-1, self._num_classes], 'float32']}
def build(self, inputs, scope_name=""):
# inputs
cls_feats = inputs["backbone"]["sentence_pair_embedding"]
if self._is_training:
cls_feats = fluid.layers.dropout(
x=cls_feats,
dropout_prob=self._dropout_prob,
dropout_implementation="upscale_in_train")
if self._learning_strategy == 'pairwise':
cls_feats_neg = inputs["backbone"]["sentence_pair_embedding_neg"]
cls_feats_neg = fluid.layers.dropout(
x=cls_feats_neg,
dropout_prob=self._dropout_prob,
dropout_implementation="upscale_in_train")
elif self._learning_strategy == 'pointwise':
labels = inputs["reader"]["label_ids"]
# loss
# for pointwise
if self._learning_strategy == 'pointwise':
logits = fluid.layers.fc(
input=cls_feats,
size=self._num_classes,
param_attr=fluid.ParamAttr(
name=scope_name+"cls_out_w",
initializer=self._param_initializer),
bias_attr=fluid.ParamAttr(
name=scope_name+"cls_out_b",
initializer=fluid.initializer.Constant(0.)))
probs = fluid.layers.softmax(logits)
if self._is_training:
ce_loss = fluid.layers.cross_entropy(
input=probs, label=labels)
loss = fluid.layers.mean(x=ce_loss)
return {'loss': loss}
# for pred
else:
return {'logits': logits,
'probs': probs}
# for pairwise
elif self._learning_strategy == 'pairwise':
pos_score = fluid.layers.fc(
input=cls_feats,
size=1,
act = "sigmoid",
param_attr=fluid.ParamAttr(
name=scope_name+"cls_out_w_pr",
initializer=self._param_initializer),
bias_attr=fluid.ParamAttr(
name=scope_name+"cls_out_b_pr",
initializer=fluid.initializer.Constant(0.)))
pos_score = fluid.layers.reshape(x=pos_score, shape=[-1, 1], inplace=True)
if self._is_training:
neg_score = fluid.layers.fc(
input=cls_feats_neg,
size=1,
act = "sigmoid",
param_attr=fluid.ParamAttr(
name=scope_name+"cls_out_w_pr",
initializer=self._param_initializer),
bias_attr=fluid.ParamAttr(
name=scope_name+"cls_out_b_pr",
gitextract_o6rx2q6_/
├── .gitignore
├── README.md
├── README_zh.md
├── customization_cn.md
├── examples/
│ ├── classification/
│ │ ├── README.md
│ │ ├── download.py
│ │ ├── evaluate.py
│ │ └── run.py
│ ├── matching/
│ │ ├── README.md
│ │ ├── download.py
│ │ ├── evaluate.py
│ │ ├── process.py
│ │ └── run.py
│ ├── mrc/
│ │ ├── README.md
│ │ ├── download.py
│ │ ├── evaluate.py
│ │ └── run.py
│ ├── multi-task/
│ │ ├── README.md
│ │ ├── download.py
│ │ ├── evaluate_intent.py
│ │ ├── evaluate_slot.py
│ │ ├── joint_predict.py
│ │ ├── predict_intent.py
│ │ ├── predict_slot.py
│ │ ├── process.py
│ │ └── run.py
│ ├── predict/
│ │ ├── README.md
│ │ ├── download.py
│ │ ├── evaluate.py
│ │ └── run.py
│ ├── tagging/
│ │ ├── README.md
│ │ ├── download.py
│ │ ├── evaluate.py
│ │ └── run.py
│ └── train_with_eval/
│ ├── README.md
│ ├── download.py
│ ├── evaluate.py
│ └── run.py
├── paddlepalm/
│ ├── __init__.py
│ ├── _downloader.py
│ ├── backbone/
│ │ ├── README.md
│ │ ├── __init__.py
│ │ ├── base_backbone.py
│ │ ├── bert.py
│ │ ├── ernie.py
│ │ └── utils/
│ │ ├── __init__.py
│ │ └── transformer.py
│ ├── distribute/
│ │ ├── __init__.py
│ │ └── reader.py
│ ├── downloader.py
│ ├── head/
│ │ ├── __init__.py
│ │ ├── base_head.py
│ │ ├── cls.py
│ │ ├── match.py
│ │ ├── mlm.py
│ │ ├── mrc.py
│ │ └── ner.py
│ ├── lr_sched/
│ │ ├── __init__.py
│ │ ├── base_schedualer.py
│ │ ├── slanted_triangular_schedualer.py
│ │ └── warmup_schedualer.py
│ ├── multihead_trainer.py
│ ├── optimizer/
│ │ ├── __init__.py
│ │ ├── adam.py
│ │ └── base_optimizer.py
│ ├── reader/
│ │ ├── __init__.py
│ │ ├── base_reader.py
│ │ ├── cls.py
│ │ ├── match.py
│ │ ├── mlm.py
│ │ ├── mrc.py
│ │ ├── seq_label.py
│ │ └── utils/
│ │ ├── __init__.py
│ │ ├── batching4bert.py
│ │ ├── batching4ernie.py
│ │ ├── mlm_batching.py
│ │ ├── mrqa_helper.py
│ │ └── reader4ernie.py
│ ├── tokenizer/
│ │ ├── __init__.py
│ │ ├── bert_tokenizer.py
│ │ └── ernie_tokenizer.py
│ ├── trainer.py
│ └── utils/
│ ├── __init__.py
│ ├── basic_helper.py
│ ├── config_helper.py
│ ├── plot_helper.py
│ ├── print_helper.py
│ ├── reader_helper.py
│ ├── saver.py
│ └── textprocess_helper.py
├── setup.cfg
├── setup.py
└── test/
├── test2/
│ ├── config.yaml
│ ├── run.py
│ └── run.sh
└── test3/
├── config.yaml
├── run.py
└── run.sh
SYMBOL INDEX (406 symbols across 53 files)
FILE: examples/classification/download.py
function download (line 13) | def download(src, url):
FILE: examples/classification/evaluate.py
function accuracy (line 6) | def accuracy(preds, labels):
function pre_recall_f1 (line 11) | def pre_recall_f1(preds, labels):
function res_evaluate (line 26) | def res_evaluate(res_dir="./outputs/predict/predictions.json", eval_phas...
FILE: examples/matching/download.py
function download (line 11) | def download(src, url):
FILE: examples/matching/evaluate.py
function accuracy (line 6) | def accuracy(preds, labels):
function pre_recall_f1 (line 11) | def pre_recall_f1(preds, labels):
function res_evaluate (line 26) | def res_evaluate(res_dir="./outputs/predict/predictions.json", eval_phas...
FILE: examples/mrc/download.py
function download (line 13) | def download(src, url):
FILE: examples/mrc/evaluate.py
function mixed_segmentation (line 39) | def mixed_segmentation(in_str, rm_punc=False):
function remove_punctuation (line 69) | def remove_punctuation(in_str):
function find_lcs (line 86) | def find_lcs(s1, s2):
function evaluate (line 100) | def evaluate(ground_truth_file, prediction_file):
function calc_f1_score (line 129) | def calc_f1_score(answers, prediction):
function calc_em_score (line 145) | def calc_em_score(answers, prediction):
function eval_file (line 156) | def eval_file(dataset_file, prediction_file):
FILE: examples/multi-task/download.py
function download (line 13) | def download(src, url):
FILE: examples/multi-task/evaluate_intent.py
function accuracy (line 6) | def accuracy(preds, labels):
function pre_recall_f1 (line 11) | def pre_recall_f1(preds, labels):
function res_evaluate (line 26) | def res_evaluate(res_dir="./outputs/predict-intent/predictions.json", ev...
FILE: examples/multi-task/evaluate_slot.py
function load_label_map (line 6) | def load_label_map(map_dir="./data/atis/atis_slot/label_map.json"):
function cal_chunk (line 14) | def cal_chunk(pred_label, refer_label):
function res_evaluate (line 41) | def res_evaluate(res_dir="./outputs/predict-slot/predictions.json", data...
FILE: examples/predict/download.py
function download (line 13) | def download(src, url):
FILE: examples/predict/evaluate.py
function accuracy (line 6) | def accuracy(preds, labels):
function pre_recall_f1 (line 11) | def pre_recall_f1(preds, labels):
function res_evaluate (line 26) | def res_evaluate(res_dir="./outputs/predict/predictions.json", eval_phas...
FILE: examples/tagging/download.py
function download (line 13) | def download(src, url):
FILE: examples/tagging/evaluate.py
function load_label_map (line 6) | def load_label_map(map_dir="./data/label_map.json"):
function cal_chunk (line 14) | def cal_chunk(pred_label, refer_label):
function res_evaluate (line 41) | def res_evaluate(res_dir="./outputs/predict/predictions.json", data_dir=...
FILE: examples/train_with_eval/download.py
function download (line 13) | def download(src, url):
FILE: examples/train_with_eval/evaluate.py
function accuracy (line 6) | def accuracy(preds, labels):
function pre_recall_f1 (line 11) | def pre_recall_f1(preds, labels):
function res_evaluate (line 26) | def res_evaluate(res_dir="./outputs/predict/predictions.json", eval_phas...
FILE: paddlepalm/_downloader.py
function _download (line 61) | def _download(item, scope, path, silent=False, convert=False):
function _convert (line 111) | def _convert(path, silent=False):
function download (line 131) | def download(item, scope='all', path='.'):
function _ls (line 154) | def _ls(item, scope, l = 10):
function ls (line 164) | def ls(item='all', scope='all'):
FILE: paddlepalm/backbone/base_backbone.py
class Backbone (line 17) | class Backbone(object):
method __init__ (line 20) | def __init__(self, phase):
method inputs_attr (line 30) | def inputs_attr(self):
method outputs_attr (line 45) | def outputs_attr(self):
method build (line 57) | def build(self, inputs):
FILE: paddlepalm/backbone/bert.py
class BERT (line 29) | class BERT(Backbone):
method __init__ (line 32) | def __init__(self, hidden_size, num_hidden_layers, num_attention_heads...
method from_config (line 59) | def from_config(self, config, phase='train'):
method inputs_attr (line 98) | def inputs_attr(self):
method outputs_attr (line 113) | def outputs_attr(self):
method build (line 126) | def build(self, inputs, scope_name=""):
method postprocess (line 238) | def postprocess(self, rt_outputs):
class Model (line 242) | class Model(BERT):
method __init__ (line 244) | def __init__(self, config, phase):
FILE: paddlepalm/backbone/ernie.py
class ERNIE (line 30) | class ERNIE(Backbone):
method __init__ (line 32) | def __init__(self, hidden_size, num_hidden_layers, num_attention_heads...
method from_config (line 63) | def from_config(cls, config, phase='train'):
method inputs_attr (line 107) | def inputs_attr(self):
method outputs_attr (line 124) | def outputs_attr(self):
method build (line 137) | def build(self, inputs, scope_name=""):
method postprocess (line 260) | def postprocess(self, rt_outputs):
class Model (line 265) | class Model(ERNIE):
method __init__ (line 267) | def __init__(self, config, phase):
FILE: paddlepalm/backbone/utils/transformer.py
function layer_norm (line 28) | def layer_norm(x, begin_norm_axis=1, epsilon=1e-6, param_attr=None, bias...
function multi_head_attention (line 56) | def multi_head_attention(queries,
function positionwise_feed_forward (line 192) | def positionwise_feed_forward(x,
function pre_post_process_layer (line 227) | def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
function encoder_layer (line 267) | def encoder_layer(enc_input,
function encoder (line 330) | def encoder(enc_input,
FILE: paddlepalm/distribute/reader.py
function yield_pieces (line 11) | def yield_pieces(data, distribute_strategy, batch_size):
function data_feeder (line 63) | def data_feeder(reader, postprocess_fn=None, prefetch_steps=2, phase='tr...
function decode_fake (line 118) | def decode_fake(nums, mask, bs):
FILE: paddlepalm/head/base_head.py
class Head (line 20) | class Head(object):
method __init__ (line 22) | def __init__(self, phase='train'):
method inputs_attrs (line 34) | def inputs_attrs(self):
method outputs_attr (line 47) | def outputs_attr(self):
method epoch_inputs_attrs (line 62) | def epoch_inputs_attrs(self):
method build (line 76) | def build(self, inputs, scope_name=""):
method batch_postprocess (line 88) | def batch_postprocess(self, rt_outputs):
method reset (line 106) | def reset(self):
method get_results (line 110) | def get_results(self):
method epoch_postprocess (line 114) | def epoch_postprocess(self, post_inputs=None, output_dir=None):
FILE: paddlepalm/head/cls.py
class Classify (line 24) | class Classify(Head):
method __init__ (line 28) | def __init__(self, num_classes, input_dim, dropout_prob=0.0, \
method inputs_attrs (line 43) | def inputs_attrs(self):
method outputs_attrs (line 51) | def outputs_attrs(self):
method build (line 59) | def build(self, inputs, scope_name=''):
method batch_postprocess (line 86) | def batch_postprocess(self, rt_outputs):
method epoch_postprocess (line 94) | def epoch_postprocess(self, post_inputs, output_dir=None):
FILE: paddlepalm/head/match.py
function computeHingeLoss (line 25) | def computeHingeLoss(pos, neg, margin):
class Match (line 36) | class Match(Head):
method __init__ (line 41) | def __init__(self, num_classes, input_dim, dropout_prob=0.0, param_ini...
method inputs_attrs (line 67) | def inputs_attrs(self):
method outputs_attrs (line 79) | def outputs_attrs(self):
method build (line 89) | def build(self, inputs, scope_name=""):
method batch_postprocess (line 162) | def batch_postprocess(self, rt_outputs):
method reset (line 172) | def reset(self):
method epoch_postprocess (line 176) | def epoch_postprocess(self, post_inputs, output_dir=None):
FILE: paddlepalm/head/mlm.py
class MaskLM (line 23) | class MaskLM(Head):
method __init__ (line 27) | def __init__(self, input_dim, vocab_size, hidden_act, dropout_prob=0.0, \
method inputs_attrs (line 40) | def inputs_attrs(self):
method outputs_attrs (line 53) | def outputs_attrs(self):
method build (line 59) | def build(self, inputs, scope_name=""):
method batch_postprocess (line 121) | def batch_postprocess(self, rt_outputs):
method epoch_postprocess (line 128) | def epoch_postprocess(self, post_inputs, output_dir=None):
FILE: paddlepalm/head/mrc.py
class MRC (line 30) | class MRC(Head):
method __init__ (line 35) | def __init__(self, max_query_len, input_dim, pred_output_path=None, ve...
method inputs_attrs (line 54) | def inputs_attrs(self):
method epoch_inputs_attrs (line 65) | def epoch_inputs_attrs(self):
method outputs_attr (line 71) | def outputs_attr(self):
method build (line 80) | def build(self, inputs, scope_name=""):
method batch_postprocess (line 131) | def batch_postprocess(self, rt_outputs):
method epoch_postprocess (line 153) | def epoch_postprocess(self, post_inputs, output_dir=None):
function _write_predictions (line 174) | def _write_predictions(all_examples, all_features, all_results, n_best_s...
function _get_final_text (line 377) | def _get_final_text(pred_text, orig_text, do_lower_case, verbose):
function _get_best_indexes (line 472) | def _get_best_indexes(logits, n_best_size):
function _compute_softmax (line 485) | def _compute_softmax(scores):
FILE: paddlepalm/head/ner.py
class SequenceLabel (line 23) | class SequenceLabel(Head):
method __init__ (line 27) | def __init__(self, num_classes, input_dim, dropout_prob=0.0, learning_...
method inputs_attrs (line 50) | def inputs_attrs(self):
method outputs_attrs (line 59) | def outputs_attrs(self):
method build (line 65) | def build(self, inputs, scope_name=''):
method batch_postprocess (line 112) | def batch_postprocess(self, rt_outputs):
method epoch_postprocess (line 118) | def epoch_postprocess(self, post_inputs, output_dir=None):
FILE: paddlepalm/lr_sched/base_schedualer.py
class Schedualer (line 2) | class Schedualer():
method __init__ (line 4) | def __init__(self):
method _set_prog (line 7) | def _set_prog(self, prog):
method _build (line 10) | def _build(self, learning_rate):
FILE: paddlepalm/lr_sched/slanted_triangular_schedualer.py
class TriangularSchedualer (line 4) | class TriangularSchedualer(Schedualer):
method __init__ (line 8) | def __init__(self, warmup_steps, num_train_steps):
method _build (line 22) | def _build(self, learning_rate):
FILE: paddlepalm/lr_sched/warmup_schedualer.py
function WarmupSchedualer (line 5) | def WarmupSchedualer(Schedualer):
FILE: paddlepalm/multihead_trainer.py
class MultiHeadTrainer (line 15) | class MultiHeadTrainer(Trainer):
method __init__ (line 20) | def __init__(self, trainers):
method build_forward (line 58) | def build_forward(self):
method build_predict_forward (line 100) | def build_predict_forward(self):
method merge_inference_readers (line 132) | def merge_inference_readers(self, readers):
method fit_readers_with_mixratio (line 192) | def fit_readers_with_mixratio(self, readers, sampling_reference, num_e...
method _check_finish (line 280) | def _check_finish(self, task_name, silent=False):
method train (line 290) | def train(self, print_steps=5):
method train_one_step (line 332) | def train_one_step(self, batch):
method predict_one_batch (line 355) | def predict_one_batch(self, task_name, batch):
method predict (line 366) | def predict(self, output_dir=None, print_steps=1000):
method overall_train_steps (line 372) | def overall_train_steps(self):
FILE: paddlepalm/optimizer/adam.py
class Adam (line 25) | class Adam(Optimizer):
method __init__ (line 27) | def __init__(self, loss_var, lr, lr_schedualer=None):
method _build (line 35) | def _build(self, grad_clip=None):
method get_cur_learning_rate (line 52) | def get_cur_learning_rate(self):
FILE: paddlepalm/optimizer/base_optimizer.py
class Optimizer (line 2) | class Optimizer(object):
method __init__ (line 4) | def __init__(self, loss_var, lr, lr_schedualer=None):
method _build (line 8) | def _build(self, grad_clip=None):
method _set_prog (line 11) | def _set_prog(self, prog, init_prog):
method get_cur_learning_rate (line 17) | def get_cur_learning_rate(self):
FILE: paddlepalm/reader/base_reader.py
class Reader (line 17) | class Reader(object):
method __init__ (line 20) | def __init__(self, phase='train'):
method create_register (line 34) | def create_register(self):
method clone (line 37) | def clone(self, phase='train'):
method require_attr (line 46) | def require_attr(self, attr_name):
method register_with (line 54) | def register_with(self, backbone):
method get_registered_backbone (line 64) | def get_registered_backbone(self):
method _get_registed_attrs (line 68) | def _get_registed_attrs(self, attrs):
method load_data (line 76) | def load_data(self, input_file, batch_size, num_epochs=None, \
method outputs_attr (line 92) | def outputs_attr(self):
method _iterator (line 107) | def _iterator(self):
method get_epoch_outputs (line 114) | def get_epoch_outputs(self):
method num_examples (line 119) | def num_examples(self):
method num_epochs (line 125) | def num_epochs(self):
FILE: paddlepalm/reader/cls.py
class ClassifyReader (line 20) | class ClassifyReader(Reader):
method __init__ (line 39) | def __init__(self, vocab_path, max_len, tokenizer='wordpiece', \
method outputs_attr (line 82) | def outputs_attr(self):
method load_data (line 94) | def load_data(self, input_file, batch_size, num_epochs=None, \
method _iterator (line 113) | def _iterator(self):
method get_epoch_outputs (line 125) | def get_epoch_outputs(self):
method num_examples (line 130) | def num_examples(self):
method num_epochs (line 134) | def num_epochs(self):
FILE: paddlepalm/reader/match.py
class MatchReader (line 20) | class MatchReader(Reader):
method __init__ (line 47) | def __init__(self, vocab_path, max_len, tokenizer='wordpiece', lang='e...
method outputs_attr (line 100) | def outputs_attr(self):
method load_data (line 116) | def load_data(self, input_file, batch_size, num_epochs=None, \
method _iterator (line 135) | def _iterator(self):
method num_examples (line 154) | def num_examples(self):
method num_epochs (line 158) | def num_epochs(self):
FILE: paddlepalm/reader/mlm.py
class MaskLMReader (line 20) | class MaskLMReader(Reader):
method __init__ (line 22) | def __init__(self, vocab_path, max_len, tokenizer='wordpiece', \
method outputs_attr (line 54) | def outputs_attr(self):
method load_data (line 67) | def load_data(self, input_file, batch_size, num_epochs=None, \
method _iterator (line 76) | def _iterator(self):
method get_epoch_outputs (line 89) | def get_epoch_outputs(self):
method num_examples (line 94) | def num_examples(self):
method num_epochs (line 98) | def num_epochs(self):
FILE: paddlepalm/reader/mrc.py
class MRCReader (line 20) | class MRCReader(Reader):
method __init__ (line 57) | def __init__(self, vocab_path, max_len, max_query_len, doc_stride, \
method outputs_attr (line 112) | def outputs_attr(self):
method epoch_outputs_attr (line 125) | def epoch_outputs_attr(self):
method load_data (line 130) | def load_data(self, input_file, batch_size, num_epochs=None, file_form...
method _iterator (line 147) | def _iterator(self):
method get_epoch_outputs (line 166) | def get_epoch_outputs(self):
method num_examples (line 172) | def num_examples(self):
method num_epochs (line 176) | def num_epochs(self):
FILE: paddlepalm/reader/seq_label.py
class SequenceLabelReader (line 19) | class SequenceLabelReader(Reader):
method __init__ (line 24) | def __init__(self, vocab_path, max_len, label_map_config, tokenizer='w...
method outputs_attr (line 58) | def outputs_attr(self):
method load_data (line 69) | def load_data(self, input_file, batch_size, num_epochs=None, \
method _iterator (line 88) | def _iterator(self):
method get_epoch_outputs (line 100) | def get_epoch_outputs(self):
method num_examples (line 105) | def num_examples(self):
method num_epochs (line 109) | def num_epochs(self):
FILE: paddlepalm/reader/utils/batching4bert.py
function mask (line 22) | def mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3):
function prepare_batch_data (line 75) | def prepare_batch_data(insts,
function pad_batch_data (line 138) | def pad_batch_data(insts,
FILE: paddlepalm/reader/utils/batching4ernie.py
function mask (line 26) | def mask(batch_tokens,
function pad_batch_data (line 121) | def pad_batch_data(insts,
FILE: paddlepalm/reader/utils/mlm_batching.py
function mask (line 22) | def mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3...
function prepare_batch_data (line 94) | def prepare_batch_data(insts,
function pad_batch_data (line 152) | def pad_batch_data(insts,
FILE: paddlepalm/reader/utils/mrqa_helper.py
class MRQAExample (line 16) | class MRQAExample(object):
method __init__ (line 22) | def __init__(self,
method __str__ (line 38) | def __str__(self):
method __repr__ (line 41) | def __repr__(self):
class MRQAFeature (line 56) | class MRQAFeature(object):
method __init__ (line 59) | def __init__(self,
FILE: paddlepalm/reader/utils/reader4ernie.py
function csv_reader (line 52) | def csv_reader(fd, delimiter='\t'):
class Reader (line 59) | class Reader(object):
method __init__ (line 60) | def __init__(self,
method get_train_progress (line 106) | def get_train_progress(self):
method _read_tsv (line 110) | def _read_tsv(self, input_file, quotechar=None):
method _truncate_seq_pair (line 123) | def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
method _convert_example_to_record (line 140) | def _convert_example_to_record(self, example, max_seq_length, tokenizer):
method _prepare_batch_data (line 278) | def _prepare_batch_data(self, examples, batch_size, phase='train'):
method get_num_examples (line 308) | def get_num_examples(self, input_file=None, phase='train'):
method data_generator (line 316) | def data_generator(self,
class MaskLMReader (line 357) | class MaskLMReader(Reader):
method _convert_example_to_record (line 359) | def _convert_example_to_record(self, example, max_seq_length, tokenizer):
method batch_reader (line 426) | def batch_reader(self, examples, batch_size, in_tokens, phase):
method data_generator (line 445) | def data_generator(self,
class ClassifyReader (line 497) | class ClassifyReader(Reader):
method _read_tsv (line 498) | def _read_tsv(self, input_file, quotechar=None):
method _pad_batch_records (line 519) | def _pad_batch_records(self, batch_records):
class SequenceLabelReader (line 580) | class SequenceLabelReader(Reader):
method _pad_batch_records (line 581) | def _pad_batch_records(self, batch_records):
method _reseg_token_label (line 608) | def _reseg_token_label(self, tokens, labels, tokenizer):
method _convert_example_to_record (line 626) | def _convert_example_to_record(self, example, max_seq_length, tokenizer):
class ExtractEmbeddingReader (line 658) | class ExtractEmbeddingReader(Reader):
method _pad_batch_records (line 659) | def _pad_batch_records(self, batch_records):
class MRCReader (line 685) | class MRCReader(Reader):
method __init__ (line 686) | def __init__(self,
method _read_json (line 733) | def _read_json(self, input_file, is_training):
method _improve_answer_span (line 789) | def _improve_answer_span(self, doc_tokens, input_start, input_end,
method _check_is_max_context (line 801) | def _check_is_max_context(self, doc_spans, cur_span_index, position):
method _convert_example_to_feature (line 820) | def _convert_example_to_feature(self, examples, max_seq_length, tokeni...
method _prepare_batch_data (line 933) | def _prepare_batch_data(self, records, batch_size, phase=None):
method _pad_batch_records (line 966) | def _pad_batch_records(self, batch_records, is_training):
method get_num_examples (line 1010) | def get_num_examples(self, phase):
method get_features (line 1013) | def get_features(self, phase):
method get_examples (line 1016) | def get_examples(self, phase):
method data_generator (line 1019) | def data_generator(self,
FILE: paddlepalm/tokenizer/bert_tokenizer.py
function convert_to_unicode (line 26) | def convert_to_unicode(text):
function printable_text (line 46) | def printable_text(text):
function load_vocab (line 69) | def load_vocab(vocab_file):
function convert_by_vocab (line 84) | def convert_by_vocab(vocab, items):
function convert_tokens_to_ids (line 92) | def convert_tokens_to_ids(vocab, tokens):
function convert_ids_to_tokens (line 96) | def convert_ids_to_tokens(inv_vocab, ids):
function whitespace_tokenize (line 100) | def whitespace_tokenize(text):
class FullTokenizer (line 109) | class FullTokenizer(object):
method __init__ (line 112) | def __init__(self, vocab_file, do_lower_case=True):
method tokenize (line 118) | def tokenize(self, text):
method convert_tokens_to_ids (line 126) | def convert_tokens_to_ids(self, tokens):
method convert_ids_to_tokens (line 129) | def convert_ids_to_tokens(self, ids):
class CharTokenizer (line 133) | class CharTokenizer(object):
method __init__ (line 136) | def __init__(self, vocab_file, do_lower_case=True):
method tokenize (line 141) | def tokenize(self, text):
method convert_tokens_to_ids (line 149) | def convert_tokens_to_ids(self, tokens):
method convert_ids_to_tokens (line 152) | def convert_ids_to_tokens(self, ids):
class BasicTokenizer (line 156) | class BasicTokenizer(object):
method __init__ (line 159) | def __init__(self, do_lower_case=True):
method tokenize (line 168) | def tokenize(self, text):
method _run_strip_accents (line 195) | def _run_strip_accents(self, text):
method _run_split_on_punc (line 206) | def _run_split_on_punc(self, text):
method _tokenize_chinese_chars (line 226) | def _tokenize_chinese_chars(self, text):
method _is_chinese_char (line 239) | def _is_chinese_char(self, cp):
method _clean_text (line 261) | def _clean_text(self, text):
class WordpieceTokenizer (line 275) | class WordpieceTokenizer(object):
method __init__ (line 278) | def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=...
method tokenize (line 283) | def tokenize(self, text):
function _is_whitespace (line 337) | def _is_whitespace(char):
function _is_control (line 349) | def _is_control(char):
function _is_punctuation (line 361) | def _is_punctuation(char):
FILE: paddlepalm/tokenizer/ernie_tokenizer.py
function convert_to_unicode (line 30) | def convert_to_unicode(text):
function printable_text (line 50) | def printable_text(text):
function load_vocab (line 73) | def load_vocab(vocab_file):
function convert_by_vocab (line 88) | def convert_by_vocab(vocab, items):
function convert_tokens_to_ids (line 96) | def convert_tokens_to_ids(vocab, tokens):
function convert_ids_to_tokens (line 100) | def convert_ids_to_tokens(inv_vocab, ids):
function whitespace_tokenize (line 104) | def whitespace_tokenize(text):
class FullTokenizer (line 113) | class FullTokenizer(object):
method __init__ (line 116) | def __init__(self, vocab_file, do_lower_case=True):
method tokenize (line 122) | def tokenize(self, text):
method convert_tokens_to_ids (line 130) | def convert_tokens_to_ids(self, tokens):
method convert_ids_to_tokens (line 133) | def convert_ids_to_tokens(self, ids):
class CharTokenizer (line 137) | class CharTokenizer(object):
method __init__ (line 140) | def __init__(self, vocab_file, do_lower_case=True):
method tokenize (line 145) | def tokenize(self, text):
method convert_tokens_to_ids (line 153) | def convert_tokens_to_ids(self, tokens):
method convert_ids_to_tokens (line 156) | def convert_ids_to_tokens(self, ids):
class BasicTokenizer (line 160) | class BasicTokenizer(object):
method __init__ (line 163) | def __init__(self, do_lower_case=True):
method tokenize (line 172) | def tokenize(self, text):
method _run_strip_accents (line 199) | def _run_strip_accents(self, text):
method _run_split_on_punc (line 210) | def _run_split_on_punc(self, text):
method _tokenize_chinese_chars (line 230) | def _tokenize_chinese_chars(self, text):
method _is_chinese_char (line 243) | def _is_chinese_char(self, cp):
method _clean_text (line 265) | def _clean_text(self, text):
class WordpieceTokenizer (line 279) | class WordpieceTokenizer(object):
method __init__ (line 282) | def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=...
method tokenize (line 287) | def tokenize(self, text):
function _is_whitespace (line 341) | def _is_whitespace(char):
function _is_control (line 353) | def _is_control(char):
function _is_punctuation (line 365) | def _is_punctuation(char):
function tokenize_chinese_chars (line 381) | def tokenize_chinese_chars(text):
FILE: paddlepalm/trainer.py
class Trainer (line 31) | class Trainer(object):
method __init__ (line 36) | def __init__(self, name, mix_ratio=1.0, reuse_head_with=None):
method build_forward (line 104) | def build_forward(self, backbone, task_head):
method build_predict_forward (line 198) | def build_predict_forward(self, pred_backbone, pred_head):
method build_backward (line 269) | def build_backward(self, optimizer, weight_decay=None, use_ema=False, ...
method set_as_aux (line 318) | def set_as_aux(self):
method fit_reader (line 322) | def fit_reader(self, reader, phase='train'):
method load_ckpt (line 395) | def load_ckpt(self, model_path):
method load_predict_model (line 432) | def load_predict_model(self, model_path, convert=False):
method load_pretrain (line 448) | def load_pretrain(self, model_path, convert=False):
method set_saver (line 463) | def set_saver(self, save_path, save_steps, save_type='ckpt'):
method train (line 513) | def train(self, print_steps=5):
method predict (line 548) | def predict(self, output_dir=None, print_steps=1000):
method reset_buffer (line 590) | def reset_buffer(self):
method _check_phase (line 593) | def _check_phase(self, phase):
method _set_multitask (line 596) | def _set_multitask(self):
method _set_nomultitask (line 599) | def _set_nomultitask(self):
method _set_task_id (line 602) | def _set_task_id(self, task_id):
method _init_exe_prog (line 605) | def _init_exe_prog(self, for_train=True):
method get_one_batch (line 628) | def get_one_batch(self, phase='train'):
method _set_exe (line 637) | def _set_exe(self, exe):
method _set_dist_train (line 640) | def _set_dist_train(self, prog):
method _set_dist_pred (line 643) | def _set_dist_pred(self, prog):
method _set_fetch_list (line 646) | def _set_fetch_list(self, fetch_list):
method train_one_step (line 649) | def train_one_step(self, batch):
method predict_one_batch (line 676) | def predict_one_batch(self, batch):
method name (line 691) | def name(self):
method num_examples (line 695) | def num_examples(self):
method mix_ratio (line 699) | def mix_ratio(self):
method mix_ratio (line 703) | def mix_ratio(self, value):
method num_epochs (line 707) | def num_epochs(self):
method cur_train_step (line 711) | def cur_train_step(self):
method cur_train_epoch (line 715) | def cur_train_epoch(self):
method steps_pur_epoch (line 719) | def steps_pur_epoch(self):
method _build_head (line 722) | def _build_head(self, net_inputs, phase, scope=""):
method _save (line 730) | def _save(self, save_path, suffix=None):
method _load (line 753) | def _load(self, infer_model_path=None):
FILE: paddlepalm/utils/basic_helper.py
function get_basename (line 9) | def get_basename(f):
function get_suffix (line 13) | def get_suffix(f):
function parse_yaml (line 17) | def parse_yaml(f, asdict=True, support_cmd_line=False):
function parse_json (line 32) | def parse_json(f, asdict=True, support_cmd_line=False):
function parse_list (line 47) | def parse_list(string, astype=str):
function try_float (line 55) | def try_float(s):
function check_io (line 64) | def check_io(in_attr, out_attr, strict=False, in_name="left", out_name="...
function encode_inputs (line 74) | def encode_inputs(inputs, scope_name, sep='.', cand_set=None):
function decode_inputs (line 87) | def decode_inputs(inputs, scope_name, sep='.', keep_unk_keys=True):
function build_executor (line 99) | def build_executor(on_gpu):
function fit_attr (line 110) | def fit_attr(conf, fit_attr, strict=False):
FILE: paddlepalm/utils/config_helper.py
class JsonConfig (line 32) | class JsonConfig(object):
method __init__ (line 37) | def __init__(self, config_path):
method _parse (line 40) | def _parse(self, config_path):
method __getitem__ (line 51) | def __getitem__(self, key):
method asdict (line 54) | def asdict(self):
method print_config (line 57) | def print_config(self):
class ArgumentGroup (line 63) | class ArgumentGroup(object):
method __init__ (line 64) | def __init__(self, parser, title, des):
method add_arg (line 67) | def add_arg(self, name, type, default, help, **kwargs):
class ArgConfig (line 77) | class ArgConfig(object):
method __init__ (line 82) | def __init__(self):
method add_arg (line 135) | def add_arg(self, name, dtype, default, descrip):
method build_conf (line 138) | def build_conf(self):
function str2bool (line 142) | def str2bool(v):
function print_arguments (line 148) | def print_arguments(args, log=None):
class PDConfig (line 161) | class PDConfig(object):
method __init__ (line 167) | def __init__(self, json_file=None, yaml_file=None, fuse_args=True):
method load_json (line 202) | def load_json(self, file_path, fuse_args=True):
method load_yaml (line 226) | def load_yaml(self, file_path_list, fuse_args=True):
method build (line 251) | def build(self):
method asdict (line 255) | def asdict(self):
method __add__ (line 258) | def __add__(self, new_arg):
method __getattr__ (line 273) | def __getattr__(self, name):
method Print (line 285) | def Print(self):
FILE: paddlepalm/utils/print_helper.py
function print_dict (line 17) | def print_dict(dic, title=""):
FILE: paddlepalm/utils/reader_helper.py
function create_feed_batch_process_fn (line 29) | def create_feed_batch_process_fn(net_inputs):
function check_io (line 69) | def check_io(in_attr, out_attr, strict=False, in_name="left", out_name="...
function _check_and_adapt_shape_dtype (line 79) | def _check_and_adapt_shape_dtype(rt_val, attr, message=""):
function _zero_batch (line 98) | def _zero_batch(attrs):
function _zero_batch_x (line 107) | def _zero_batch_x(attrs, batch_size):
function create_net_inputs (line 120) | def create_net_inputs(input_attrs, is_async=False, iterator_fn=None, dev...
function create_iterator_fn (line 137) | def create_iterator_fn(iterator, iterator_prefix, shape_and_dtypes, outn...
function create_multihead_inference_fn (line 169) | def create_multihead_inference_fn(iterators, iterator_prefixes, joint_sh...
function create_multihead_iterator_fn (line 202) | def create_multihead_iterator_fn(iterators, iterator_prefixes, joint_sha...
function create_joint_iterator_fn (line 238) | def create_joint_iterator_fn(iterators, iterator_prefixes, joint_shape_a...
function merge_input_attrs (line 340) | def merge_input_attrs(backbone_attr, task_attrs, insert_taskid=True, ins...
FILE: paddlepalm/utils/saver.py
function init_checkpoint (line 28) | def init_checkpoint(exe, init_checkpoint_path, main_program, skip_list =...
function init_pretraining_params (line 47) | def init_pretraining_params(exe,
FILE: paddlepalm/utils/textprocess_helper.py
function is_whitespace (line 16) | def is_whitespace(c):
Condensed preview — 98 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (494K chars).
[
{
"path": ".gitignore",
"chars": 148,
"preview": "*.pyc\npaddlepalm.egg-info\ndata\n__pycache__\n*egg-info\npretrain_model\npretrain\noutput*\noutput_model\nbuild\ndist\npaddle_palm"
},
{
"path": "README.md",
"chars": 13650,
"preview": "# PaddlePALM\n\nEnglish | [简体中文](./README_zh.md)\n\nPaddlePALM (PArallel Learning from Multi-tasks) is a fast, flexible, ext"
},
{
"path": "README_zh.md",
"chars": 7900,
"preview": "# PaddlePALM\n\n[English](./README.md) | 简体中文\n\nPaddlePALM (PArallel Learning from Multi-tasks) 是一个灵活,通用且易于使用的NLP大规模预训练和多任务"
},
{
"path": "customization_cn.md",
"chars": 12536,
"preview": "# PALM组件定制化教程\n\nPALM支持对如下组件自定义:\n\n- **head**\n 定义一个新的任务输出头,接收来自backbone和reader的输入,输出训练阶段的loss和预测阶段的预测结果。例如:分类任务头,序列标注任务头,机"
},
{
"path": "examples/classification/README.md",
"chars": 3017,
"preview": "## Example 1: Classification\nThis task is a sentiment analysis task. The following sections detail model preparation, da"
},
{
"path": "examples/classification/download.py",
"chars": 1334,
"preview": "# -*- coding: utf-8 -*-\nfrom __future__ import print_function\nimport os\nimport tarfile\nimport shutil\nimport sys\nimport "
},
{
"path": "examples/classification/evaluate.py",
"chars": 1719,
"preview": "# -*- coding: utf-8 -*-\n\nimport json\nimport numpy as np\n\ndef accuracy(preds, labels):\n preds = np.array(preds)\n l"
},
{
"path": "examples/classification/run.py",
"chars": 3601,
"preview": "# coding=utf-8\nimport paddlepalm as palm\nimport json\n\n\nif __name__ == '__main__':\n\n # configs\n max_seqlen = 256\n "
},
{
"path": "examples/matching/README.md",
"chars": 3522,
"preview": "## Example 2: Matching\nThis task is a sentence pair matching task. The following sections detail model preparation, data"
},
{
"path": "examples/matching/download.py",
"chars": 935,
"preview": "# -*- coding: utf-8 -*-\nfrom __future__ import print_function\nimport os\nimport sys\nimport urllib\nURLLIB=urllib\nif sys.v"
},
{
"path": "examples/matching/evaluate.py",
"chars": 1763,
"preview": "# -*- coding: utf-8 -*-\n\nimport json\nimport numpy as np\n\ndef accuracy(preds, labels):\n preds = np.array(preds)\n l"
},
{
"path": "examples/matching/process.py",
"chars": 1089,
"preview": "# -*- coding: utf-8 -*-\n\nimport sys\nimport os\n\nif len(sys.argv) != 4:\n exit(0)\n\ndata_dir = sys.argv[1]\nif not os.pat"
},
{
"path": "examples/matching/run.py",
"chars": 3729,
"preview": "# coding=utf-8\nimport paddlepalm as palm\nimport json\n\nif __name__ == '__main__':\n\n # configs \n max_seqlen = 128\n "
},
{
"path": "examples/mrc/README.md",
"chars": 3691,
"preview": "## Example 4: Machine Reading Comprehension\nThis task is a machine reading comprehension task. The following sections de"
},
{
"path": "examples/mrc/download.py",
"chars": 1327,
"preview": "# -*- coding: utf-8 -*-\nfrom __future__ import print_function\nimport os\nimport tarfile\nimport shutil\nimport sys\nimport "
},
{
"path": "examples/mrc/evaluate.py",
"chars": 5212,
"preview": "# -*- coding: utf-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "examples/mrc/run.py",
"chars": 3690,
"preview": "# coding=utf-8\nimport paddlepalm as palm\nimport json\n\n\nif __name__ == '__main__':\n\n # configs\n max_seqlen = 512\n "
},
{
"path": "examples/multi-task/README.md",
"chars": 9825,
"preview": "## Example 6: Joint Training of Dialogue Intent Recognition and Slot Filling\nThis example achieves the joint training of"
},
{
"path": "examples/multi-task/download.py",
"chars": 1223,
"preview": "# -*- coding: utf-8 -*-\nfrom __future__ import print_function\nimport os\nimport tarfile\nimport shutil\nimport sys\nimport "
},
{
"path": "examples/multi-task/evaluate_intent.py",
"chars": 1746,
"preview": "# -*- coding: utf-8 -*-\n\nimport json\nimport numpy as np\n\ndef accuracy(preds, labels):\n preds = np.array(preds)\n l"
},
{
"path": "examples/multi-task/evaluate_slot.py",
"chars": 2993,
"preview": "# -*- coding: utf-8 -*-\n\nimport json\n\n\ndef load_label_map(map_dir=\"./data/atis/atis_slot/label_map.json\"):\n \"\"\"\n "
},
{
"path": "examples/multi-task/joint_predict.py",
"chars": 3389,
"preview": "# coding=utf-8\nimport paddlepalm as palm\nimport json\nimport numpy as np\n\n\nif __name__ == '__main__':\n\n # configs\n "
},
{
"path": "examples/multi-task/predict_intent.py",
"chars": 1873,
"preview": "# coding=utf-8\nimport paddlepalm as palm\nimport json\nfrom paddlepalm.distribute import gpu_dev_count\n\n\nif __name__ == '_"
},
{
"path": "examples/multi-task/predict_slot.py",
"chars": 2039,
"preview": "# coding=utf-8\nimport paddlepalm as palm\nimport json\nfrom paddlepalm.distribute import gpu_dev_count\n\n\nif __name__ == '_"
},
{
"path": "examples/multi-task/process.py",
"chars": 3381,
"preview": "import os\nimport json\n\nlabel_new = \"data/atis/atis_slot/label_map.json\"\nlabel_old = \"data/atis/atis_slot/map_tag_slot_id"
},
{
"path": "examples/multi-task/run.py",
"chars": 2929,
"preview": "# coding=utf-8\nimport paddlepalm as palm\nimport json\n\n\nif __name__ == '__main__':\n\n # configs\n max_seqlen = 128\n "
},
{
"path": "examples/predict/README.md",
"chars": 1945,
"preview": "## Example 5: Prediction\nThis example demonstrates how to directly do prediction with PaddlePALM. You can either initial"
},
{
"path": "examples/predict/download.py",
"chars": 1334,
"preview": "# -*- coding: utf-8 -*-\nfrom __future__ import print_function\nimport os\nimport tarfile\nimport shutil\nimport sys\nimport "
},
{
"path": "examples/predict/evaluate.py",
"chars": 1719,
"preview": "# -*- coding: utf-8 -*-\n\nimport json\nimport numpy as np\n\ndef accuracy(preds, labels):\n preds = np.array(preds)\n l"
},
{
"path": "examples/predict/run.py",
"chars": 1770,
"preview": "# coding=utf-8\nimport paddlepalm as palm\nimport json\n\n\nif __name__ == '__main__':\n\n # configs\n max_seqlen = 256\n "
},
{
"path": "examples/tagging/README.md",
"chars": 3259,
"preview": "## Example 3: Tagging\nThis task is a named entity recognition task. The following sections detail model preparation, dat"
},
{
"path": "examples/tagging/download.py",
"chars": 1326,
"preview": "# -*- coding: utf-8 -*-\nfrom __future__ import print_function\nimport os\nimport tarfile\nimport shutil\nimport sys\nimport "
},
{
"path": "examples/tagging/evaluate.py",
"chars": 2954,
"preview": "# -*- coding: utf-8 -*-\n\nimport json\n\n\ndef load_label_map(map_dir=\"./data/label_map.json\"):\n \"\"\"\n :param map_dir:"
},
{
"path": "examples/tagging/run.py",
"chars": 3925,
"preview": "# coding=utf-8\nimport paddlepalm as palm\nimport json\n\nif __name__ == '__main__':\n \n # configs\n max_seqlen = 256\n "
},
{
"path": "examples/train_with_eval/README.md",
"chars": 3123,
"preview": "## Train with Evaluation version of Example 1: Classification\nThis task is a sentiment analysis task. The following sect"
},
{
"path": "examples/train_with_eval/download.py",
"chars": 1334,
"preview": "# -*- coding: utf-8 -*-\nfrom __future__ import print_function\nimport os\nimport tarfile\nimport shutil\nimport sys\nimport "
},
{
"path": "examples/train_with_eval/evaluate.py",
"chars": 1719,
"preview": "# -*- coding: utf-8 -*-\n\nimport json\nimport numpy as np\n\ndef accuracy(preds, labels):\n preds = np.array(preds)\n l"
},
{
"path": "examples/train_with_eval/run.py",
"chars": 2620,
"preview": "# coding=utf-8\nimport paddlepalm as palm\nimport json\n\n\nif __name__ == '__main__':\n\n # configs\n max_seqlen = 256\n "
},
{
"path": "paddlepalm/__init__.py",
"chars": 342,
"preview": "from . import downloader\n# from mtl_controller import Controller \n#import controller\nfrom . import optimizer\nfrom . impo"
},
{
"path": "paddlepalm/_downloader.py",
"chars": 7490,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/backbone/README.md",
"chars": 0,
"preview": ""
},
{
"path": "paddlepalm/backbone/__init__.py",
"chars": 50,
"preview": "\nfrom .ernie import ERNIE\nfrom .bert import BERT\n\n"
},
{
"path": "paddlepalm/backbone/base_backbone.py",
"chars": 2491,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/backbone/bert.py",
"chars": 11324,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/backbone/ernie.py",
"chars": 12658,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/backbone/utils/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "paddlepalm/backbone/utils/transformer.py",
"chars": 13736,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/distribute/__init__.py",
"chars": 252,
"preview": "from paddle import fluid\nimport os\nimport multiprocessing\n\ngpu_dev_count = int(fluid.core.get_cuda_device_count())\ncpu_d"
},
{
"path": "paddlepalm/distribute/reader.py",
"chars": 4518,
"preview": "\nfrom . import gpu_dev_count, cpu_dev_count\ntry:\n import queue as Queue\nexcept ImportError:\n import Queue\nfrom thr"
},
{
"path": "paddlepalm/downloader.py",
"chars": 27,
"preview": "from ._downloader import *\n"
},
{
"path": "paddlepalm/head/__init__.py",
"chars": 128,
"preview": "\nfrom .cls import Classify\nfrom .match import Match\nfrom .ner import SequenceLabel\nfrom .mrc import MRC\nfrom .mlm import"
},
{
"path": "paddlepalm/head/base_head.py",
"chars": 4515,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/head/cls.py",
"chars": 4107,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/head/match.py",
"chars": 7746,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/head/mlm.py",
"chars": 5326,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/head/mrc.py",
"chars": 20639,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/head/ner.py",
"chars": 4613,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/lr_sched/__init__.py",
"chars": 114,
"preview": "\nfrom .slanted_triangular_schedualer import TriangularSchedualer\nfrom .warmup_schedualer import WarmupSchedualer\n\n"
},
{
"path": "paddlepalm/lr_sched/base_schedualer.py",
"chars": 209,
"preview": "\nclass Schedualer():\n\n def __init__(self):\n self._prog = None\n \n def _set_prog(self, prog):\n self"
},
{
"path": "paddlepalm/lr_sched/slanted_triangular_schedualer.py",
"chars": 1997,
"preview": "from paddlepalm.lr_sched.base_schedualer import Schedualer\nfrom paddle import fluid\n\nclass TriangularSchedualer(Schedual"
},
{
"path": "paddlepalm/lr_sched/warmup_schedualer.py",
"chars": 1184,
"preview": "\nfrom paddlepalm.lr_sched.base_schedualer import Schedualer\nimport paddle.fluid as fluid\n\ndef WarmupSchedualer(Scheduale"
},
{
"path": "paddlepalm/multihead_trainer.py",
"chars": 15640,
"preview": "\nfrom paddle import fluid\nfrom paddle.fluid import layers\nfrom paddlepalm.distribute import gpu_dev_count, cpu_dev_count"
},
{
"path": "paddlepalm/optimizer/__init__.py",
"chars": 24,
"preview": "\nfrom .adam import Adam\n"
},
{
"path": "paddlepalm/optimizer/adam.py",
"chars": 1829,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/optimizer/base_optimizer.py",
"chars": 475,
"preview": "\nclass Optimizer(object):\n\n def __init__(self, loss_var, lr, lr_schedualer=None):\n self._prog = None\n s"
},
{
"path": "paddlepalm/reader/__init__.py",
"chars": 164,
"preview": "\nfrom .cls import ClassifyReader\nfrom .match import MatchReader\nfrom .seq_label import SequenceLabelReader\nfrom .mrc imp"
},
{
"path": "paddlepalm/reader/base_reader.py",
"chars": 4223,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/reader/cls.py",
"chars": 5997,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/reader/match.py",
"chars": 7733,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/reader/mlm.py",
"chars": 3588,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/reader/mrc.py",
"chars": 7792,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/reader/seq_label.py",
"chars": 4750,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/reader/utils/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "paddlepalm/reader/utils/batching4bert.py",
"chars": 7026,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/reader/utils/batching4ernie.py",
"chars": 6584,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/reader/utils/mlm_batching.py",
"chars": 7444,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/reader/utils/mrqa_helper.py",
"chars": 2986,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/reader/utils/reader4ernie.py",
"chars": 43173,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/tokenizer/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "paddlepalm/tokenizer/bert_tokenizer.py",
"chars": 12735,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/tokenizer/ernie_tokenizer.py",
"chars": 14597,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/trainer.py",
"chars": 31556,
"preview": "# -*- coding: utf-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/utils/__init__.py",
"chars": 57,
"preview": "\nfrom . import basic_helper\nfrom . import config_helper\n\n"
},
{
"path": "paddlepalm/utils/basic_helper.py",
"chars": 3545,
"preview": "# coding=utf-8\nimport os\nimport json\nimport yaml\nfrom .config_helper import PDConfig\nimport logging\nfrom paddle import f"
},
{
"path": "paddlepalm/utils/config_helper.py",
"chars": 10932,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/utils/plot_helper.py",
"chars": 0,
"preview": ""
},
{
"path": "paddlepalm/utils/print_helper.py",
"chars": 1081,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/utils/reader_helper.py",
"chars": 14566,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/utils/saver.py",
"chars": 3104,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "paddlepalm/utils/textprocess_helper.py",
"chars": 772,
"preview": "# -*- coding: UTF-8 -*-\n# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.\n#\n# Licensed under the Apache "
},
{
"path": "setup.cfg",
"chars": 914,
"preview": "[metadata]\n\nname = paddlepalm\n\nauthor = zhangyiming\nauthor_email = zhangyiming04@baidu.com\n\nversion = 2.1.0\n\ndescription"
},
{
"path": "setup.py",
"chars": 3043,
"preview": "# -*- coding: UTF-8 -*-\n################################################################################\n#\n# Copyright"
},
{
"path": "test/test2/config.yaml",
"chars": 434,
"preview": "ask_instance: \"mrqa, mlm4mrqa, match4mrqa\"\ntarget_tag: 1, 0, 0\nmix_ratio: 1.0, 0.5, 0.5\n\nsave_path: \"output_model/second"
},
{
"path": "test/test2/run.py",
"chars": 2849,
"preview": "# coding=utf-8\nimport paddlepalm as palm\nimport json\n\nif __name__ == '__main__':\n\n max_seqlen = 512\n batch_size = "
},
{
"path": "test/test2/run.sh",
"chars": 46,
"preview": "export CUDA_VISIBLE_DEVICES=3\npython run.py \n\n"
},
{
"path": "test/test3/config.yaml",
"chars": 426,
"preview": "task_instance: \"cls1, cls2, cls3, cls4, cls5, cls6\"\n\ntask_reuse_tag: 0,0,1,1,0,2\n\nsave_path: \"output_model/thirdrun\"\n\nba"
},
{
"path": "test/test3/run.py",
"chars": 5594,
"preview": "# coding=utf-8\nimport paddlepalm as palm\nimport json\n\nif __name__ == '__main__':\n\n max_seqlen = 512\n batch_size = "
},
{
"path": "test/test3/run.sh",
"chars": 46,
"preview": "export CUDA_VISIBLE_DEVICES=3\n\npython run.py\n\n"
}
]
About this extraction
This page contains the full source code of the PaddlePaddle/PALM GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 98 files (443.8 KB), approximately 115.3k tokens, and a symbol index with 406 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.