Repository: brightmart/albert_zh
Branch: master
Commit: 52149e82faf3
Files: 40
Total size: 620.3 KB
Directory structure:
gitextract_fwt0rbxl/
├── README.md
├── albert_config/
│ ├── albert_config_base.json
│ ├── albert_config_base_google_fast.json
│ ├── albert_config_large.json
│ ├── albert_config_small_google.json
│ ├── albert_config_tiny.json
│ ├── albert_config_tiny_google.json
│ ├── albert_config_tiny_google_fast.json
│ ├── albert_config_xlarge.json
│ ├── albert_config_xxlarge.json
│ ├── bert_config.json
│ └── vocab.txt
├── args.py
├── bert_utils.py
├── classifier_utils.py
├── create_pretrain_data.sh
├── create_pretraining_data.py
├── create_pretraining_data_google.py
├── data/
│ └── news_zh_1.txt
├── lamb_optimizer_google.py
├── modeling.py
├── modeling_google.py
├── modeling_google_fast.py
├── optimization.py
├── optimization_finetuning.py
├── optimization_google.py
├── resources/
│ ├── create_pretraining_data_roberta.py
│ └── shell_scripts/
│ └── create_pretrain_data_batch_webtext.sh
├── run_classifier.py
├── run_classifier_clue.py
├── run_classifier_clue.sh
├── run_classifier_lcqmc.sh
├── run_classifier_sp_google.py
├── run_pretraining.py
├── run_pretraining_google.py
├── run_pretraining_google_fast.py
├── similarity.py
├── test_changes.py
├── tokenization.py
└── tokenization_google.py
================================================
FILE CONTENTS
================================================
================================================
FILE: README.md
================================================
# albert_zh
An Implementation of A Lite Bert For Self-Supervised Learning Language Representations with TensorFlow
ALBert is based on Bert, but with some improvements. It achieves state of the art performance on main benchmarks with 30% parameters less.
For albert_base_zh it only has ten percentage parameters compare of original bert model, and main accuracy is retained.
Different version of ALBERT pre-trained model for Chinese, including TensorFlow, PyTorch and Keras, is available now.
海量中文语料上预训练ALBERT模型:参数更少,效果更好。预训练小模型也能拿下13项NLP任务,ALBERT三大改造登顶GLUE基准
clueai工具包: 三行代码,三分钟定制一个NLP的API(零样本学习)
一键运行10个数据集、9个基线模型、不同任务上模型效果的详细对比,见CLUE benchmark
一键运行CLUE中文任务:6个中文分类或句子对任务(新)
---------------------------------------------------------------------
使用方式:
1、克隆项目
git clone https://github.com/brightmart/albert_zh.git
2、运行一键运行脚本(GPU方式): 会自动下载模型和所有任务数据并开始运行。
bash run_classifier_clue.sh
执行该一键运行脚本将会自动下载所有任务数据,并为所有任务找到最优模型,然后测试得到提交结果
模型下载 Download Pre-trained Models of Chinese
-----------------------------------------------
1、albert_tiny_zh, albert_tiny_zh(训练更久,累积学习20亿个样本),文件大小16M、参数为4M
训练和推理预测速度提升约10倍,精度基本保留,模型大小为bert的1/25;语义相似度数据集LCQMC测试集上达到85.4%,相比bert_base仅下降1.5个点。
lcqmc训练使用如下参数: --max_seq_length=128 --train_batch_size=64 --learning_rate=1e-4 --num_train_epochs=5
albert_tiny使用同样的大规模中文语料数据,层数仅为4层、hidden size等向量维度大幅减少; 尝试使用如下学习率来获得更好效果:{2e-5, 6e-5, 1e-4}
【使用场景】任务相对比较简单一些或实时性要求高的任务,如语义相似度等句子对任务、分类任务;比较难的任务如阅读理解等,可以使用其他大模型。
例如,可以使用[Tensorflow Lite](https://www.tensorflow.org/lite)在移动端进行部署,本文[随后](#use_tflite)针对这一点进行了介绍,包括如何把模型转换成Tensorflow Lite格式和对其进行性能测试等。
一键运行albert_tiny_zh(linux,lcqmc任务):
1) git clone https://github.com/brightmart/albert_zh
2) cd albert_zh
3) bash run_classifier_lcqmc.sh
1.1、albert_tiny_google_zh(累积学习10亿个样本,google版本),模型大小16M、性能与albert_tiny_zh一致
1.2、albert_small_google_zh(累积学习10亿个样本,google版本),
速度比bert_base快4倍;LCQMC测试集上比Bert下降仅0.9个点;去掉adam后模型大小18.5M;使用方法,见 #下游任务 Fine-tuning on Downstream Task
2、albert_large_zh,参数量,层数24,文件大小为64M
参数量和模型大小为bert_base的六分之一;在口语化描述相似性数据集LCQMC的测试集上相比bert_base上升0.2个点
3、albert_base_zh(额外训练了1.5亿个实例即 36k steps * batch_size 4096); albert_base_zh(小模型体验版), 参数量12M, 层数12,大小为40M
参数量为bert_base的十分之一,模型大小也十分之一;在口语化描述相似性数据集LCQMC的测试集上相比bert_base下降约0.6~1个点;
相比未预训练,albert_base提升14个点
4、albert_xlarge_zh_177k ;
albert_xlarge_zh_183k(优先尝试)参数量,层数24,文件大小为230M
参数量和模型大小为bert_base的二分之一;需要一张大的显卡;完整测试对比将后续添加;batch_size不能太小,否则可能影响精度
### 快速加载
依托于[Huggingface-Transformers 2.2.2](https://github.com/huggingface/transformers),可轻松调用以上模型。
```
tokenizer = AutoTokenizer.from_pretrained("MODEL_NAME")
model = AutoModel.from_pretrained("MODEL_NAME")
```
其中`MODEL_NAME`对应列表如下:
| 模型名 | MODEL_NAME |
| - | - |
| albert_tiny_google_zh | voidful/albert_chinese_tiny |
| albert_small_google_zh | voidful/albert_chinese_small |
| albert_base_zh (from google) | voidful/albert_chinese_base |
| albert_large_zh (from google) | voidful/albert_chinese_large |
| albert_xlarge_zh (from google) | voidful/albert_chinese_xlarge |
| albert_xxlarge_zh (from google) | voidful/albert_chinese_xxlarge |
更多通过transformers使用albert的示例
预训练 Pre-training
-----------------------------------------------
#### 生成特定格式的文件(tfrecords) Generate tfrecords Files
Run following command 运行以下命令即可。项目自动了一个示例的文本文件(data/news_zh_1.txt)
bash create_pretrain_data.sh
如果你有很多文本文件,可以通过传入参数的方式,生成多个特定格式的文件(tfrecords)
###### Support English and Other Non-Chinese Language:
If you are doing pre-train for english or other language,which is not chinese,
you should set hyperparameter of non_chinese to True on create_pretraining_data.py;
otherwise, by default it is doing chinese pre-train using whole word mask of chinese.
#### 执行预训练 pre-training on GPU/TPU using the command
GPU(brightmart版, tiny模型):
export BERT_BASE_DIR=./albert_tiny_zh
nohup python3 run_pretraining.py --input_file=./data/tf*.tfrecord \
--output_dir=./my_new_model_path --do_train=True --do_eval=True --bert_config_file=$BERT_BASE_DIR/albert_config_tiny.json \
--train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=51 \
--num_train_steps=125000 --num_warmup_steps=12500 --learning_rate=0.00176 \
--save_checkpoints_steps=2000 --init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt &
GPU(Google版本, small模型):
export BERT_BASE_DIR=./albert_small_zh_google
nohup python3 run_pretraining_google.py --input_file=./data/tf*.tfrecord --eval_batch_size=64 \
--output_dir=./my_new_model_path --do_train=True --do_eval=True --albert_config_file=$BERT_BASE_DIR/albert_config_small_google.json --export_dir=./my_new_model_path_export \
--train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=20 \
--num_train_steps=125000 --num_warmup_steps=12500 --learning_rate=0.00176 \
--save_checkpoints_steps=2000 --init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt
TPU, add something like this:
--use_tpu=True --tpu_name=grpc://10.240.1.66:8470 --tpu_zone=us-central1-a
注:如果你重头开始训练,可以不指定init_checkpoint;
如果你从现有的模型基础上训练,指定一下BERT_BASE_DIR的路径,并确保bert_config_file和init_checkpoint两个参数的值能对应到相应的文件上;
领域上的预训练,根据数据的大小,可以不用训练特别久。
环境 Environment
-----------------------------------------------
Use Python3 + Tensorflow 1.x
e.g. Tensorflow 1.4 or 1.5
下游任务 Fine-tuning on Downstream Task
-----------------------------------------------
##### 使用TensorFlow:
以使用albert_base做LCQMC任务为例。LCQMC任务是在口语化描述的数据集上做文本的相似性预测。
We will use LCQMC dataset for fine-tuning, it is oral language corpus, it is used to train and predict semantic similarity of a pair of sentences.
下载LCQMC数据集,包含训练、验证和测试集,训练集包含24万口语化描述的中文句子对,标签为1或0。1为句子语义相似,0为语义不相似。
通过运行下列命令做LCQMC数据集上的fine-tuning:
1. Clone this project:
git clone https://github.com/brightmart/albert_zh.git
2. Fine-tuning by running the following command.
brightmart版本的tiny模型
export BERT_BASE_DIR=./albert_tiny_zh
export TEXT_DIR=./lcqmc
nohup python3 run_classifier.py --task_name=lcqmc_pair --do_train=true --do_eval=true --data_dir=$TEXT_DIR --vocab_file=./albert_config/vocab.txt \
--bert_config_file=./albert_config/albert_config_tiny.json --max_seq_length=128 --train_batch_size=64 --learning_rate=1e-4 --num_train_epochs=5 \
--output_dir=./albert_lcqmc_checkpoints --init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt &
google版本的small模型
export BERT_BASE_DIR=./albert_small_zh
export TEXT_DIR=./lcqmc
nohup python3 run_classifier_sp_google.py --task_name=lcqmc_pair --do_train=true --do_eval=true --data_dir=$TEXT_DIR --vocab_file=./albert_config/vocab.txt \
--albert_config_file=./$BERT_BASE_DIR/albert_config_small_google.json --max_seq_length=128 --train_batch_size=64 --learning_rate=1e-4 --num_train_epochs=5 \
--output_dir=./albert_lcqmc_checkpoints --init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt &
Notice/注:
1) you need to download pre-trained chinese albert model, and also download LCQMC dataset
你需要下载预训练的模型,并放入到项目当前项目,假设目录名称为albert_tiny_zh; 需要下载LCQMC数据集,并放入到当前项目,
假设数据集目录名称为lcqmc
2) for Fine-tuning, you can try to add small percentage of dropout(e.g. 0.1) by changing parameters of
attention_probs_dropout_prob & hidden_dropout_prob on albert_config_xxx.json. By default, we set dropout as zero.
3) you can try different learning rate {2e-5, 6e-5, 1e-4} for better performance
Updates
-----------------------------------------------
**\*\*\*\*\* 2019-11-03: add google version of albert_small, albert_tiny;
add method to deploy ablert_tiny to mobile devices with only 0.1 second inference time for sequence length 128, 60M memory \*\*\*\*\***
**\*\*\*\*\* 2019-10-30: add a simple guide about converting the model to Tensorflow Lite for edge deployment \*\*\*\*\***
**\*\*\*\*\* 2019-10-15: albert_tiny_zh, 10 times fast than bert base for training and inference, accuracy remains \*\*\*\*\***
**\*\*\*\*\* 2019-10-07: more models of albert \*\*\*\*\***
add albert_xlarge_zh; albert_base_zh_additional_steps, training with more instances
**\*\*\*\*\* 2019-10-04: PyTorch and Keras versions of albert were supported \*\*\*\*\***
a.Convert to PyTorch version and do your tasks through albert_pytorch
b.Load pre-trained model with keras using one line of codes through bert4keras
c.Use albert with TensorFlow 2.0: Use or load pre-trained model with tf2.0 through bert-for-tf2
Releasing albert_xlarge on 6th Oct
**\*\*\*\*\* 2019-10-02: albert_large_zh,albert_base_zh \*\*\*\*\***
Relesed albert_base_zh with only 10% parameters of bert_base, a small model(40M) & training can be very fast.
Relased albert_large_zh with only 16% parameters of bert_base(64M)
**\*\*\*\*\* 2019-09-28: codes and test functions \*\*\*\*\***
Add codes and test functions for three main changes of albert from bert
ALBERT模型介绍 Introduction of ALBERT
-----------------------------------------------
ALBERT模型是BERT的改进版,与最近其他State of the art的模型不同的是,这次是预训练小模型,效果更好、参数更少。
它对BERT进行了三个改造 Three main changes of ALBert from Bert:
1)词嵌入向量参数的因式分解 Factorized embedding parameterization
O(V * H) to O(V * E + E * H)
如以ALBert_xxlarge为例,V=30000, H=4096, E=128
那么原先参数为V * H= 30000 * 4096 = 1.23亿个参数,现在则为V * E + E * H = 30000*128+128*4096 = 384万 + 52万 = 436万,
词嵌入相关的参数变化前是变换后的28倍。
2)跨层参数共享 Cross-Layer Parameter Sharing
参数共享能显著减少参数。共享可以分为全连接层、注意力层的参数共享;注意力层的参数对效果的减弱影响小一点。
3)段落连续性任务 Inter-sentence coherence loss.
使用段落连续性任务。正例,使用从一个文档中连续的两个文本段落;负例,使用从一个文档中连续的两个文本段落,但位置调换了。
避免使用原有的NSP任务,原有的任务包含隐含了预测主题这类过于简单的任务。
We maintain that inter-sentence modeling is an important aspect of language understanding, but we propose a loss
based primarily on coherence. That is, for ALBERT, we use a sentence-order prediction (SOP) loss, which avoids topic
prediction and instead focuses on modeling inter-sentence coherence. The SOP loss uses as positive examples the
same technique as BERT (two consecutive segments from the same document), and as negative examples the same two
consecutive segments but with their order swapped. This forces the model to learn finer-grained distinctions about
discourse-level coherence properties.
其他变化,还有 Other changes:
1)去掉了dropout Remove dropout to enlarge capacity of model.
最大的模型,训练了1百万步后,还是没有过拟合训练数据。说明模型的容量还可以更大,就移除了dropout
(dropout可以认为是随机的去掉网络中的一部分,同时使网络变小一些)
We also note that, even after training for 1M steps, our largest models still do not overfit to their training data.
As a result, we decide to remove dropout to further increase our model capacity.
其他型号的模型,在我们的实现中我们还是会保留原始的dropout的比例,防止模型对训练数据的过拟合。
2)为加快训练速度,使用LAMB做为优化器 Use LAMB as optimizer, to train with big batch size
使用了大的batch_size来训练(4096)。 LAMB优化器使得我们可以训练,特别大的批次batch_size,如高达6万。
3)使用n-gram(uni-gram,bi-gram, tri-gram)来做遮蔽语言模型 Use n-gram as make language model
即以不同的概率使用n-gram,uni-gram的概率最大,bi-gram其次,tri-gram概率最小。
本项目中目前使用的是在中文上做whole word mask,稍后会更新一下与n-gram mask的效果对比。n-gram从spanBERT中来。
训练语料/训练配置 Training Data & Configuration
-----------------------------------------------
30g中文语料,超过100亿汉字,包括多个百科、新闻、互动社区。
预训练序列长度sequence_length设置为512,批次batch_size为4096,训练产生了3.5亿个训练数据(instance);每一个模型默认会训练125k步,albert_xxlarge将训练更久。
作为比较,roberta_zh预训练产生了2.5亿个训练数据、序列长度为256。由于albert_zh预训练生成的训练数据更多、使用的序列长度更长,
我们预计albert_zh会有比roberta_zh更好的性能表现,并且能更好处理较长的文本。
训练使用TPU v3 Pod,我们使用的是v3-256,它包含32个v3-8。每个v3-8机器,含有128G的显存。
模型性能与对比(英文) Performance and Comparision
-----------------------------------------------
中文任务集上效果对比测试 Performance on Chinese datasets
-----------------------------------------------
### 问题匹配语任务:LCQMC(Sentence Pair Matching)
| 模型 | 开发集(Dev) | 测试集(Test) |
| :------- | :---------: | :---------: |
| BERT | 89.4(88.4) | 86.9(86.4) |
| ERNIE | 89.8 (89.6) | 87.2 (87.0) |
| BERT-wwm |89.4 (89.2) | 87.0 (86.8) |
| BERT-wwm-ext | - |- |
| RoBERTa-zh-base | 88.7 | 87.0 |
| RoBERTa-zh-Large | ***89.9(89.6)*** | 87.2(86.7) |
| RoBERTa-zh-Large(20w_steps) | 89.7| 87.0 |
| ALBERT-zh-tiny | -- | 85.4 |
| ALBERT-zh-small | -- | 86.0 |
| ALBERT-zh-small(Pytorch) | -- | 86.8 |
| ALBERT-zh-base-additional-36k-steps | 87.8 | 86.3 |
| ALBERT-zh-base | 87.2 | 86.3 |
| ALBERT-large | 88.7 | 87.1 |
| ALBERT-xlarge | 87.3 | ***87.7*** |
注:只跑了一次ALBERT-xlarge,效果还可能提升
### 自然语言推断:XNLI of Chinese Version
| 模型 | 开发集 | 测试集 |
| :------- | :---------: | :---------: |
| BERT | 77.8 (77.4) | 77.8 (77.5) |
| ERNIE | 79.7 (79.4) | 78.6 (78.2) |
| BERT-wwm | 79.0 (78.4) | 78.2 (78.0) |
| BERT-wwm-ext | 79.4 (78.6) | 78.7 (78.3) |
| XLNet | 79.2 | 78.7 |
| RoBERTa-zh-base | 79.8 |78.8 |
| RoBERTa-zh-Large | 80.2 (80.0) | 79.9 (79.5) |
| ALBERT-base | 77.0 | 77.1 |
| ALBERT-large | 78.0 | 77.5 |
| ALBERT-xlarge | ? | ? |
注:BERT-wwm-ext来自于这里;XLNet来自于这里; RoBERTa-zh-base,指12层RoBERTa中文模型
### 阅读理解任务:CRMC2018
### 语言模型、文本段预测准确性、训练时间 Mask Language Model Accuarcy & Training Time
| Model | MLM eval acc | SOP eval acc | Training(Hours) | Loss eval |
| :------- | :---------: | :---------: | :---------: |:---------: |
| albert_zh_base | 79.1% | 99.0% | 6h | 1.01|
| albert_zh_large | 80.9% | 98.6% | 22.5h | 0.93|
| albert_zh_xlarge | ? | ? | 53h(预估) | ? |
| albert_zh_xxlarge | ? | ? | 106h(预估) | ? |
注:? 将很快替换
模型参数和配置 Configuration of Models
-----------------------------------------------
代码实现和测试 Implementation and Code Testing
-----------------------------------------------
通过运行以下命令测试主要的改进点,包括但不限于词嵌入向量参数的因式分解、跨层参数共享、段落连续性任务等。
python test_changes.py
##### 使用TensorFlow Lite(TFLite)在移动端进行部署:
这里我们主要介绍TFLite模型格式转换和性能测试。转换成TFLite模型后,对于如何在移
动端使用该模型,可以参考TFLite提供的[Android/iOS应用完整开发案例教程页面](https://www.tensorflow.org/lite/examples)。
该页面目前已经包含了[文本分类](https://github.com/tensorflow/examples/blob/master/lite/examples/text_classification/android),
[文本问答](https://github.com/tensorflow/examples/blob/master/lite/examples/bert_qa/android)两个Android案例。
下面以albert_tiny_zh
为例来介绍TFLite模型格式转换和性能测试:
1. Freeze graph from the checkpoint
Ensure to have >=1.14 1.x installed to use the freeze_graph tool as it is removed from 2.x distribution
pip install tensorflow==1.15
freeze_graph --input_checkpoint=./albert_model.ckpt \
--output_graph=/tmp/albert_tiny_zh.pb \
--output_node_names=cls/predictions/truediv \
--checkpoint_version=1 --input_meta_graph=./albert_model.ckpt.meta --input_binary=true
2. Convert to TFLite format
We are going to use the new experimental tf->tflite converter that's distributed with the Tensorflow nightly build.
pip install tf-nightly
tflite_convert --graph_def_file=/tmp/albert_tiny_zh.pb \
--input_arrays='input_ids,input_mask,segment_ids,masked_lm_positions,masked_lm_ids,masked_lm_weights' \
--output_arrays='cls/predictions/truediv' \
--input_shapes=1,128:1,128:128:1,128:1,128:1,128 \
--output_file=/tmp/albert_tiny_zh.tflite \
--enable_v1_converter --experimental_new_converter
3. Benchmark the performance of the TFLite model
See [here](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark)
for details about the performance benchmark tools in TFLite. For example: after
building the benchmark tool binary for an Android phone, do the following to
get an idea of how the TFLite model performs on the phone
adb push /tmp/albert_tiny_zh.tflite /data/local/tmp/
adb shell /data/local/tmp/benchmark_model_performance_options --graph=/data/local/tmp/albert_tiny_zh.tflite --perf_options_list=cpu
On an Android phone w/ Qualcomm's SD845 SoC, via the above benchmark tool, as
of 2019/11/01, the inference latency is ~120ms w/ this converted TFLite model
using 4 threads on CPU, and the memory usage is ~60MB for the model during
inference. Note the performance will improve further with future TFLite
implementation optimizations.
##### 使用PyTorch版本:
download pre-trained model, and convert to PyTorch using:
python convert_albert_tf_checkpoint_to_pytorch.py
using albert_pytorch
##### 使用Keras加载:
bert4keras 适配albert,能成功加载albert_zh的权重,只需要在load_pretrained_model函数里加上albert=True
load pre-trained model with bert4keras
##### 使用tf2.0加载:
bert-for-tf2
使用案例-基于用户输入预测文本相似性 Use Case-Text Similarity Based on User Input
-------------------------------------------------
功能说明:用户可以通过本例了解如何加载训训练集实现基于用户输入的短文本相似度判断。可以基于该代码将程序灵活地拓展为后台服务或增加文本分类等示例。
涉及代码:similarity.py、args.py
步骤:
1、使用本模型进行文本相似性训练,保存模型文件至相应目录下
2、根据实际情况,修改args.py中的参数,参数说明如下:
```python
#模型目录,存放ckpt文件
model_dir = os.path.join(file_path, 'albert_lcqmc_checkpoints/')
#config文件,存放模型的json文件
config_name = os.path.join(file_path, 'albert_config/albert_config_tiny.json')
#ckpt文件名称
ckpt_name = os.path.join(model_dir, 'model.ckpt')
#输出文件目录,训练时的模型输出目录
output_dir = os.path.join(file_path, 'albert_lcqmc_checkpoints/')
#vocab文件目录
vocab_file = os.path.join(file_path, 'albert_config/vocab.txt')
#数据目录,训练使用的数据集存放目录
data_dir = os.path.join(file_path, 'data/')
```
本例中的文件结构为:
|__args.py
|__similarity.py
|__data
|__albert_config
|__albert_lcqmc_checkpoints
|__lcqmc
3、修改用户输入单词
打开similarity.py,最底部如下代码:
```python
if __name__ == '__main__':
sim = BertSim()
sim.start_model()
sim.predict_sentences([("我喜欢妈妈做的汤", "妈妈做的汤我很喜欢喝")])
```
其中sim.start_model()表示加载模型,sim.predict_sentences的输入为一个元组数组,元组中包含两个元素分别为需要判定相似的句子。
4、运行python文件:similarity.py
支持的序列长度与批次大小的关系,12G显存 Trade off between batch Size and sequence length
-------------------------------------------------
System | Seq Length | Max Batch Size
------------ | ---------- | --------------
`albert-base` | 64 | 64
... | 128 | 32
... | 256 | 16
... | 320 | 14
... | 384 | 12
... | 512 | 6
`albert-large` | 64 | 12
... | 128 | 6
... | 256 | 2
... | 320 | 1
... | 384 | 0
... | 512 | 0
`albert-xlarge` | - | -
学习曲线 Training Loss of xlarge of albert_zh
-------------------------------------------------
所有的参数 Parameters of albert_xlarge
-------------------------------------------------
#### 技术交流与问题讨论QQ群: 836811304 Join us on QQ group
If you have any question, you can raise an issue, or send me an email: brightmart@hotmail.com;
Currently how to use PyTorch version of albert is not clear yet, if you know how to do that, just email us or open an issue.
You can also send pull request to report you performance on your task or add methods on how to load models for PyTorch and so on.
If you have ideas for generate best performance pre-training Chinese model, please also let me know.
##### Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)
Cite Us
-----------------------------------------------
Bright Liang Xu, albert_zh, (2019), GitHub repository, https://github.com/brightmart/albert_zh
Reference
-----------------------------------------------
1、ALBERT: A Lite BERT For Self-Supervised Learning Of Language Representations
2、BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
3、SpanBERT: Improving Pre-training by Representing and Predicting Spans
4、RoBERTa: A Robustly Optimized BERT Pretraining Approach
5、Large Batch Optimization for Deep Learning: Training BERT in 76 minutes(LAMB)
6、LAMB Optimizer,TensorFlow version
7、预训练小模型也能拿下13项NLP任务,ALBERT三大改造登顶GLUE基准
8、 albert_pytorch
9、load albert with keras
10、load albert with tf2.0
11、repo of albert from google
12、chineseGLUE-中文任务基准测评:公开可用多个任务、基线模型、广泛测评与效果对比
================================================
FILE: albert_config/albert_config_base.json
================================================
{
"attention_probs_dropout_prob": 0.0,
"directionality": "bidi",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.0,
"hidden_size": 768,
"embedding_size": 128,
"initializer_range": 0.02,
"intermediate_size": 3072 ,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"type_vocab_size": 2,
"vocab_size": 21128,
"ln_type":"postln"
}
================================================
FILE: albert_config/albert_config_base_google_fast.json
================================================
{
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"embedding_size": 128,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"num_hidden_groups": 12,
"net_structure_type": 0,
"gap_size": 0,
"num_memory_blocks": 0,
"inner_group_num": 1,
"down_scale_factor": 1,
"type_vocab_size": 2,
"vocab_size": 21128
}
================================================
FILE: albert_config/albert_config_large.json
================================================
{
"attention_probs_dropout_prob": 0.0,
"directionality": "bidi",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.0,
"hidden_size": 1024,
"embedding_size": 128,
"initializer_range": 0.02,
"intermediate_size": 4096,
"max_position_embeddings": 512,
"num_attention_heads": 16,
"num_hidden_layers": 24,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"type_vocab_size": 2,
"vocab_size": 21128,
"ln_type":"postln"
}
================================================
FILE: albert_config/albert_config_small_google.json
================================================
{
"attention_probs_dropout_prob": 0.0,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.0,
"embedding_size": 128,
"hidden_size": 384,
"initializer_range": 0.02,
"intermediate_size": 1536,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 6,
"num_hidden_groups": 1,
"net_structure_type": 0,
"gap_size": 0,
"num_memory_blocks": 0,
"inner_group_num": 1,
"down_scale_factor": 1,
"type_vocab_size": 2,
"vocab_size": 21128
}
================================================
FILE: albert_config/albert_config_tiny.json
================================================
{
"attention_probs_dropout_prob": 0.0,
"directionality": "bidi",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.0,
"hidden_size": 312,
"embedding_size": 128,
"initializer_range": 0.02,
"intermediate_size": 1248 ,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 4,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"type_vocab_size": 2,
"vocab_size": 21128,
"ln_type":"postln"
}
================================================
FILE: albert_config/albert_config_tiny_google.json
================================================
{
"attention_probs_dropout_prob": 0.0,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.0,
"embedding_size": 128,
"hidden_size": 312,
"initializer_range": 0.02,
"intermediate_size": 1248,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 4,
"num_hidden_groups": 1,
"net_structure_type": 0,
"gap_size": 0,
"num_memory_blocks": 0,
"inner_group_num": 1,
"down_scale_factor": 1,
"type_vocab_size": 2,
"vocab_size": 21128
}
================================================
FILE: albert_config/albert_config_tiny_google_fast.json
================================================
{
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"embedding_size": 128,
"hidden_size": 336,
"initializer_range": 0.02,
"intermediate_size": 1344,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 4,
"num_hidden_groups": 12,
"net_structure_type": 0,
"gap_size": 0,
"num_memory_blocks": 0,
"inner_group_num": 1,
"down_scale_factor": 1,
"type_vocab_size": 2,
"vocab_size": 21128
}
================================================
FILE: albert_config/albert_config_xlarge.json
================================================
{
"attention_probs_dropout_prob": 0.0,
"directionality": "bidi",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.0,
"hidden_size": 2048,
"embedding_size": 128,
"initializer_range": 0.02,
"intermediate_size": 8192,
"max_position_embeddings": 512,
"num_attention_heads": 32,
"num_hidden_layers": 24,
"pooler_fc_size": 1024,
"pooler_num_attention_heads": 64,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"type_vocab_size": 2,
"vocab_size": 21128,
"ln_type":"postln"
}
================================================
FILE: albert_config/albert_config_xxlarge.json
================================================
{
"attention_probs_dropout_prob": 0.0,
"directionality": "bidi",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.0,
"hidden_size": 4096,
"embedding_size": 128,
"initializer_range": 0.02,
"intermediate_size": 16384,
"max_position_embeddings": 512,
"num_attention_heads": 64,
"num_hidden_layers": 12,
"pooler_fc_size": 1024,
"pooler_num_attention_heads": 64,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"type_vocab_size": 2,
"vocab_size": 21128,
"ln_type":"preln"
}
================================================
FILE: albert_config/bert_config.json
================================================
{
"attention_probs_dropout_prob": 0.0,
"directionality": "bidi",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.0,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"type_vocab_size": 2,
"vocab_size": 21128
}
================================================
FILE: albert_config/vocab.txt
================================================
[PAD]
[unused1]
[unused2]
[unused3]
[unused4]
[unused5]
[unused6]
[unused7]
[unused8]
[unused9]
[unused10]
[unused11]
[unused12]
[unused13]
[unused14]
[unused15]
[unused16]
[unused17]
[unused18]
[unused19]
[unused20]
[unused21]
[unused22]
[unused23]
[unused24]
[unused25]
[unused26]
[unused27]
[unused28]
[unused29]
[unused30]
[unused31]
[unused32]
[unused33]
[unused34]
[unused35]
[unused36]
[unused37]
[unused38]
[unused39]
[unused40]
[unused41]
[unused42]
[unused43]
[unused44]
[unused45]
[unused46]
[unused47]
[unused48]
[unused49]
[unused50]
[unused51]
[unused52]
[unused53]
[unused54]
[unused55]
[unused56]
[unused57]
[unused58]
[unused59]
[unused60]
[unused61]
[unused62]
[unused63]
[unused64]
[unused65]
[unused66]
[unused67]
[unused68]
[unused69]
[unused70]
[unused71]
[unused72]
[unused73]
[unused74]
[unused75]
[unused76]
[unused77]
[unused78]
[unused79]
[unused80]
[unused81]
[unused82]
[unused83]
[unused84]
[unused85]
[unused86]
[unused87]
[unused88]
[unused89]
[unused90]
[unused91]
[unused92]
[unused93]
[unused94]
[unused95]
[unused96]
[unused97]
[unused98]
[unused99]
[UNK]
[CLS]
[SEP]
[MASK]
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
@
[
\
]
^
_
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
£
¤
¥
§
©
«
®
°
±
²
³
µ
·
¹
º
»
¼
×
ß
æ
÷
ø
đ
ŋ
ɔ
ə
ɡ
ʰ
ˇ
ˈ
ˊ
ˋ
ˍ
ː
˙
˚
ˢ
α
β
γ
δ
ε
η
θ
ι
κ
λ
μ
ν
ο
π
ρ
ς
σ
τ
υ
φ
χ
ψ
ω
а
б
в
г
д
е
ж
з
и
к
л
м
н
о
п
р
с
т
у
ф
х
ц
ч
ш
ы
ь
я
і
ا
ب
ة
ت
د
ر
س
ع
ل
م
ن
ه
و
ي
۩
ก
ง
น
ม
ย
ร
อ
า
เ
๑
་
ღ
ᄀ
ᄁ
ᄂ
ᄃ
ᄅ
ᄆ
ᄇ
ᄈ
ᄉ
ᄋ
ᄌ
ᄎ
ᄏ
ᄐ
ᄑ
ᄒ
ᅡ
ᅢ
ᅣ
ᅥ
ᅦ
ᅧ
ᅨ
ᅩ
ᅪ
ᅬ
ᅭ
ᅮ
ᅯ
ᅲ
ᅳ
ᅴ
ᅵ
ᆨ
ᆫ
ᆯ
ᆷ
ᆸ
ᆺ
ᆻ
ᆼ
ᗜ
ᵃ
ᵉ
ᵍ
ᵏ
ᵐ
ᵒ
ᵘ
‖
„
†
•
‥
‧
‰
′
″
‹
›
※
‿
⁄
ⁱ
⁺
ⁿ
₁
₂
₃
₄
€
℃
№
™
ⅰ
ⅱ
ⅲ
ⅳ
ⅴ
←
↑
→
↓
↔
↗
↘
⇒
∀
−
∕
∙
√
∞
∟
∠
∣
∥
∩
∮
∶
∼
∽
≈
≒
≡
≤
≥
≦
≧
≪
≫
⊙
⋅
⋈
⋯
⌒
①
②
③
④
⑤
⑥
⑦
⑧
⑨
⑩
⑴
⑵
⑶
⑷
⑸
⒈
⒉
⒊
⒋
ⓒ
ⓔ
ⓘ
─
━
│
┃
┅
┆
┊
┌
└
├
┣
═
║
╚
╞
╠
╭
╮
╯
╰
╱
╳
▂
▃
▅
▇
█
▉
▋
▌
▍
▎
■
□
▪
▫
▬
▲
△
▶
►
▼
▽
◆
◇
○
◎
●
◕
◠
◢
◤
☀
★
☆
☕
☞
☺
☼
♀
♂
♠
♡
♣
♥
♦
♪
♫
♬
✈
✔
✕
✖
✦
✨
✪
✰
✿
❀
❤
➜
➤
⦿
、
。
〃
々
〇
〈
〉
《
》
「
」
『
』
【
】
〓
〔
〕
〖
〗
〜
〝
〞
ぁ
あ
ぃ
い
う
ぇ
え
お
か
き
く
け
こ
さ
し
す
せ
そ
た
ち
っ
つ
て
と
な
に
ぬ
ね
の
は
ひ
ふ
へ
ほ
ま
み
む
め
も
ゃ
や
ゅ
ゆ
ょ
よ
ら
り
る
れ
ろ
わ
を
ん
゜
ゝ
ァ
ア
ィ
イ
ゥ
ウ
ェ
エ
ォ
オ
カ
キ
ク
ケ
コ
サ
シ
ス
セ
ソ
タ
チ
ッ
ツ
テ
ト
ナ
ニ
ヌ
ネ
ノ
ハ
ヒ
フ
ヘ
ホ
マ
ミ
ム
メ
モ
ャ
ヤ
ュ
ユ
ョ
ヨ
ラ
リ
ル
レ
ロ
ワ
ヲ
ン
ヶ
・
ー
ヽ
ㄅ
ㄆ
ㄇ
ㄉ
ㄋ
ㄌ
ㄍ
ㄎ
ㄏ
ㄒ
ㄚ
ㄛ
ㄞ
ㄟ
ㄢ
ㄤ
ㄥ
ㄧ
ㄨ
ㆍ
㈦
㊣
㎡
㗎
一
丁
七
万
丈
三
上
下
不
与
丐
丑
专
且
丕
世
丘
丙
业
丛
东
丝
丞
丟
両
丢
两
严
並
丧
丨
个
丫
中
丰
串
临
丶
丸
丹
为
主
丼
丽
举
丿
乂
乃
久
么
义
之
乌
乍
乎
乏
乐
乒
乓
乔
乖
乗
乘
乙
乜
九
乞
也
习
乡
书
乩
买
乱
乳
乾
亀
亂
了
予
争
事
二
于
亏
云
互
五
井
亘
亙
亚
些
亜
亞
亟
亡
亢
交
亥
亦
产
亨
亩
享
京
亭
亮
亲
亳
亵
人
亿
什
仁
仃
仄
仅
仆
仇
今
介
仍
从
仏
仑
仓
仔
仕
他
仗
付
仙
仝
仞
仟
代
令
以
仨
仪
们
仮
仰
仲
件
价
任
份
仿
企
伉
伊
伍
伎
伏
伐
休
伕
众
优
伙
会
伝
伞
伟
传
伢
伤
伦
伪
伫
伯
估
伴
伶
伸
伺
似
伽
佃
但
佇
佈
位
低
住
佐
佑
体
佔
何
佗
佘
余
佚
佛
作
佝
佞
佟
你
佢
佣
佤
佥
佩
佬
佯
佰
佳
併
佶
佻
佼
使
侃
侄
來
侈
例
侍
侏
侑
侖
侗
供
依
侠
価
侣
侥
侦
侧
侨
侬
侮
侯
侵
侶
侷
便
係
促
俄
俊
俎
俏
俐
俑
俗
俘
俚
保
俞
俟
俠
信
俨
俩
俪
俬
俭
修
俯
俱
俳
俸
俺
俾
倆
倉
個
倌
倍
倏
們
倒
倔
倖
倘
候
倚
倜
借
倡
値
倦
倩
倪
倫
倬
倭
倶
债
值
倾
偃
假
偈
偉
偌
偎
偏
偕
做
停
健
側
偵
偶
偷
偻
偽
偿
傀
傅
傍
傑
傘
備
傚
傢
傣
傥
储
傩
催
傭
傲
傳
債
傷
傻
傾
僅
働
像
僑
僕
僖
僚
僥
僧
僭
僮
僱
僵
價
僻
儀
儂
億
儆
儉
儋
儒
儕
儘
償
儡
優
儲
儷
儼
儿
兀
允
元
兄
充
兆
兇
先
光
克
兌
免
児
兑
兒
兔
兖
党
兜
兢
入
內
全
兩
八
公
六
兮
兰
共
兲
关
兴
兵
其
具
典
兹
养
兼
兽
冀
内
円
冇
冈
冉
冊
册
再
冏
冒
冕
冗
写
军
农
冠
冢
冤
冥
冨
冪
冬
冯
冰
冲
决
况
冶
冷
冻
冼
冽
冾
净
凄
准
凇
凈
凉
凋
凌
凍
减
凑
凛
凜
凝
几
凡
凤
処
凪
凭
凯
凰
凱
凳
凶
凸
凹
出
击
函
凿
刀
刁
刃
分
切
刈
刊
刍
刎
刑
划
列
刘
则
刚
创
初
删
判
別
刨
利
刪
别
刮
到
制
刷
券
刹
刺
刻
刽
剁
剂
剃
則
剉
削
剋
剌
前
剎
剐
剑
剔
剖
剛
剜
剝
剣
剤
剥
剧
剩
剪
副
割
創
剷
剽
剿
劃
劇
劈
劉
劊
劍
劏
劑
力
劝
办
功
加
务
劣
动
助
努
劫
劭
励
劲
劳
労
劵
効
劾
势
勁
勃
勇
勉
勋
勐
勒
動
勖
勘
務
勛
勝
勞
募
勢
勤
勧
勳
勵
勸
勺
勻
勾
勿
匀
包
匆
匈
匍
匐
匕
化
北
匙
匝
匠
匡
匣
匪
匮
匯
匱
匹
区
医
匾
匿
區
十
千
卅
升
午
卉
半
卍
华
协
卑
卒
卓
協
单
卖
南
単
博
卜
卞
卟
占
卡
卢
卤
卦
卧
卫
卮
卯
印
危
即
却
卵
卷
卸
卻
卿
厂
厄
厅
历
厉
压
厌
厕
厘
厚
厝
原
厢
厥
厦
厨
厩
厭
厮
厲
厳
去
县
叁
参
參
又
叉
及
友
双
反
収
发
叔
取
受
变
叙
叛
叟
叠
叡
叢
口
古
句
另
叨
叩
只
叫
召
叭
叮
可
台
叱
史
右
叵
叶
号
司
叹
叻
叼
叽
吁
吃
各
吆
合
吉
吊
吋
同
名
后
吏
吐
向
吒
吓
吕
吖
吗
君
吝
吞
吟
吠
吡
否
吧
吨
吩
含
听
吭
吮
启
吱
吳
吴
吵
吶
吸
吹
吻
吼
吽
吾
呀
呂
呃
呆
呈
告
呋
呎
呐
呓
呕
呗
员
呛
呜
呢
呤
呦
周
呱
呲
味
呵
呷
呸
呻
呼
命
咀
咁
咂
咄
咆
咋
和
咎
咏
咐
咒
咔
咕
咖
咗
咘
咙
咚
咛
咣
咤
咦
咧
咨
咩
咪
咫
咬
咭
咯
咱
咲
咳
咸
咻
咽
咿
哀
品
哂
哄
哆
哇
哈
哉
哋
哌
响
哎
哏
哐
哑
哒
哔
哗
哟
員
哥
哦
哧
哨
哩
哪
哭
哮
哲
哺
哼
哽
唁
唄
唆
唇
唉
唏
唐
唑
唔
唠
唤
唧
唬
售
唯
唰
唱
唳
唷
唸
唾
啃
啄
商
啉
啊
問
啓
啕
啖
啜
啞
啟
啡
啤
啥
啦
啧
啪
啫
啬
啮
啰
啱
啲
啵
啶
啷
啸
啻
啼
啾
喀
喂
喃
善
喆
喇
喉
喊
喋
喎
喏
喔
喘
喙
喚
喜
喝
喟
喧
喪
喫
喬
單
喰
喱
喲
喳
喵
営
喷
喹
喺
喻
喽
嗅
嗆
嗇
嗎
嗑
嗒
嗓
嗔
嗖
嗚
嗜
嗝
嗟
嗡
嗣
嗤
嗦
嗨
嗪
嗬
嗯
嗰
嗲
嗳
嗶
嗷
嗽
嘀
嘅
嘆
嘈
嘉
嘌
嘍
嘎
嘔
嘖
嘗
嘘
嘚
嘛
嘜
嘞
嘟
嘢
嘣
嘤
嘧
嘩
嘭
嘮
嘯
嘰
嘱
嘲
嘴
嘶
嘸
嘹
嘻
嘿
噁
噌
噎
噓
噔
噗
噙
噜
噠
噢
噤
器
噩
噪
噬
噱
噴
噶
噸
噹
噻
噼
嚀
嚇
嚎
嚏
嚐
嚓
嚕
嚟
嚣
嚥
嚨
嚮
嚴
嚷
嚼
囂
囉
囊
囍
囑
囔
囗
囚
四
囝
回
囟
因
囡
团
団
囤
囧
囪
囫
园
困
囱
囲
図
围
囹
固
国
图
囿
圃
圄
圆
圈
國
圍
圏
園
圓
圖
團
圜
土
圣
圧
在
圩
圭
地
圳
场
圻
圾
址
坂
均
坊
坍
坎
坏
坐
坑
块
坚
坛
坝
坞
坟
坠
坡
坤
坦
坨
坪
坯
坳
坵
坷
垂
垃
垄
型
垒
垚
垛
垠
垢
垣
垦
垩
垫
垭
垮
垵
埂
埃
埋
城
埔
埕
埗
域
埠
埤
埵
執
埸
培
基
埼
堀
堂
堃
堅
堆
堇
堑
堕
堙
堡
堤
堪
堯
堰
報
場
堵
堺
堿
塊
塌
塑
塔
塗
塘
塚
塞
塢
塩
填
塬
塭
塵
塾
墀
境
墅
墉
墊
墒
墓
増
墘
墙
墜
增
墟
墨
墩
墮
墳
墻
墾
壁
壅
壆
壇
壊
壑
壓
壕
壘
壞
壟
壢
壤
壩
士
壬
壮
壯
声
売
壳
壶
壹
壺
壽
处
备
変
复
夏
夔
夕
外
夙
多
夜
够
夠
夢
夥
大
天
太
夫
夭
央
夯
失
头
夷
夸
夹
夺
夾
奂
奄
奇
奈
奉
奋
奎
奏
奐
契
奔
奕
奖
套
奘
奚
奠
奢
奥
奧
奪
奬
奮
女
奴
奶
奸
她
好
如
妃
妄
妆
妇
妈
妊
妍
妒
妓
妖
妘
妙
妝
妞
妣
妤
妥
妨
妩
妪
妮
妲
妳
妹
妻
妾
姆
姉
姊
始
姍
姐
姑
姒
姓
委
姗
姚
姜
姝
姣
姥
姦
姨
姪
姫
姬
姹
姻
姿
威
娃
娄
娅
娆
娇
娉
娑
娓
娘
娛
娜
娟
娠
娣
娥
娩
娱
娲
娴
娶
娼
婀
婁
婆
婉
婊
婕
婚
婢
婦
婧
婪
婭
婴
婵
婶
婷
婺
婿
媒
媚
媛
媞
媧
媲
媳
媽
媾
嫁
嫂
嫉
嫌
嫑
嫔
嫖
嫘
嫚
嫡
嫣
嫦
嫩
嫲
嫵
嫻
嬅
嬉
嬌
嬗
嬛
嬢
嬤
嬪
嬰
嬴
嬷
嬸
嬿
孀
孃
子
孑
孔
孕
孖
字
存
孙
孚
孛
孜
孝
孟
孢
季
孤
学
孩
孪
孫
孬
孰
孱
孳
孵
學
孺
孽
孿
宁
它
宅
宇
守
安
宋
完
宏
宓
宕
宗
官
宙
定
宛
宜
宝
实
実
宠
审
客
宣
室
宥
宦
宪
宫
宮
宰
害
宴
宵
家
宸
容
宽
宾
宿
寂
寄
寅
密
寇
富
寐
寒
寓
寛
寝
寞
察
寡
寢
寥
實
寧
寨
審
寫
寬
寮
寰
寵
寶
寸
对
寺
寻
导
対
寿
封
専
射
将
將
專
尉
尊
尋
對
導
小
少
尔
尕
尖
尘
尚
尝
尤
尧
尬
就
尴
尷
尸
尹
尺
尻
尼
尽
尾
尿
局
屁
层
屄
居
屆
屈
屉
届
屋
屌
屍
屎
屏
屐
屑
展
屜
属
屠
屡
屢
層
履
屬
屯
山
屹
屿
岀
岁
岂
岌
岐
岑
岔
岖
岗
岘
岙
岚
岛
岡
岩
岫
岬
岭
岱
岳
岷
岸
峇
峋
峒
峙
峡
峤
峥
峦
峨
峪
峭
峯
峰
峴
島
峻
峽
崁
崂
崆
崇
崎
崑
崔
崖
崗
崙
崛
崧
崩
崭
崴
崽
嵇
嵊
嵋
嵌
嵐
嵘
嵩
嵬
嵯
嶂
嶄
嶇
嶋
嶙
嶺
嶼
嶽
巅
巍
巒
巔
巖
川
州
巡
巢
工
左
巧
巨
巩
巫
差
己
已
巳
巴
巷
巻
巽
巾
巿
币
市
布
帅
帆
师
希
帐
帑
帕
帖
帘
帚
帛
帜
帝
帥
带
帧
師
席
帮
帯
帰
帳
帶
帷
常
帼
帽
幀
幂
幄
幅
幌
幔
幕
幟
幡
幢
幣
幫
干
平
年
并
幸
幹
幺
幻
幼
幽
幾
广
庁
広
庄
庆
庇
床
序
庐
库
应
底
庖
店
庙
庚
府
庞
废
庠
度
座
庫
庭
庵
庶
康
庸
庹
庾
廁
廂
廃
廈
廉
廊
廓
廖
廚
廝
廟
廠
廢
廣
廬
廳
延
廷
建
廿
开
弁
异
弃
弄
弈
弊
弋
式
弑
弒
弓
弔
引
弗
弘
弛
弟
张
弥
弦
弧
弩
弭
弯
弱
張
強
弹
强
弼
弾
彅
彆
彈
彌
彎
归
当
录
彗
彙
彝
形
彤
彥
彦
彧
彩
彪
彫
彬
彭
彰
影
彷
役
彻
彼
彿
往
征
径
待
徇
很
徉
徊
律
後
徐
徑
徒
従
徕
得
徘
徙
徜
從
徠
御
徨
復
循
徬
微
徳
徴
徵
德
徹
徼
徽
心
必
忆
忌
忍
忏
忐
忑
忒
忖
志
忘
忙
応
忠
忡
忤
忧
忪
快
忱
念
忻
忽
忿
怀
态
怂
怅
怆
怎
怏
怒
怔
怕
怖
怙
怜
思
怠
怡
急
怦
性
怨
怪
怯
怵
总
怼
恁
恃
恆
恋
恍
恐
恒
恕
恙
恚
恢
恣
恤
恥
恨
恩
恪
恫
恬
恭
息
恰
恳
恵
恶
恸
恺
恻
恼
恿
悄
悅
悉
悌
悍
悔
悖
悚
悟
悠
患
悦
您
悩
悪
悬
悯
悱
悲
悴
悵
悶
悸
悻
悼
悽
情
惆
惇
惊
惋
惑
惕
惘
惚
惜
惟
惠
惡
惦
惧
惨
惩
惫
惬
惭
惮
惯
惰
惱
想
惴
惶
惹
惺
愁
愆
愈
愉
愍
意
愕
愚
愛
愜
感
愣
愤
愧
愫
愷
愿
慄
慈
態
慌
慎
慑
慕
慘
慚
慟
慢
慣
慧
慨
慫
慮
慰
慳
慵
慶
慷
慾
憂
憊
憋
憎
憐
憑
憔
憚
憤
憧
憨
憩
憫
憬
憲
憶
憾
懂
懇
懈
應
懊
懋
懑
懒
懦
懲
懵
懶
懷
懸
懺
懼
懾
懿
戀
戈
戊
戌
戍
戎
戏
成
我
戒
戕
或
战
戚
戛
戟
戡
戦
截
戬
戮
戰
戲
戳
戴
戶
户
戸
戻
戾
房
所
扁
扇
扈
扉
手
才
扎
扑
扒
打
扔
払
托
扛
扣
扦
执
扩
扪
扫
扬
扭
扮
扯
扰
扱
扳
扶
批
扼
找
承
技
抄
抉
把
抑
抒
抓
投
抖
抗
折
抚
抛
抜
択
抟
抠
抡
抢
护
报
抨
披
抬
抱
抵
抹
押
抽
抿
拂
拄
担
拆
拇
拈
拉
拋
拌
拍
拎
拐
拒
拓
拔
拖
拗
拘
拙
拚
招
拜
拟
拡
拢
拣
拥
拦
拧
拨
择
括
拭
拮
拯
拱
拳
拴
拷
拼
拽
拾
拿
持
挂
指
挈
按
挎
挑
挖
挙
挚
挛
挝
挞
挟
挠
挡
挣
挤
挥
挨
挪
挫
振
挲
挹
挺
挽
挾
捂
捅
捆
捉
捋
捌
捍
捎
捏
捐
捕
捞
损
捡
换
捣
捧
捨
捩
据
捱
捲
捶
捷
捺
捻
掀
掂
掃
掇
授
掉
掌
掏
掐
排
掖
掘
掙
掛
掠
採
探
掣
接
控
推
掩
措
掬
掰
掲
掳
掴
掷
掸
掺
揀
揃
揄
揆
揉
揍
描
提
插
揖
揚
換
握
揣
揩
揪
揭
揮
援
揶
揸
揹
揽
搀
搁
搂
搅
損
搏
搐
搓
搔
搖
搗
搜
搞
搡
搪
搬
搭
搵
搶
携
搽
摀
摁
摄
摆
摇
摈
摊
摒
摔
摘
摞
摟
摧
摩
摯
摳
摸
摹
摺
摻
撂
撃
撅
撇
撈
撐
撑
撒
撓
撕
撚
撞
撤
撥
撩
撫
撬
播
撮
撰
撲
撵
撷
撸
撻
撼
撿
擀
擁
擂
擄
擅
擇
擊
擋
操
擎
擒
擔
擘
據
擞
擠
擡
擢
擦
擬
擰
擱
擲
擴
擷
擺
擼
擾
攀
攏
攒
攔
攘
攙
攜
攝
攞
攢
攣
攤
攥
攪
攫
攬
支
收
攸
改
攻
放
政
故
效
敌
敍
敎
敏
救
敕
敖
敗
敘
教
敛
敝
敞
敢
散
敦
敬
数
敲
整
敵
敷
數
斂
斃
文
斋
斌
斎
斐
斑
斓
斗
料
斛
斜
斟
斡
斤
斥
斧
斩
斫
斬
断
斯
新
斷
方
於
施
旁
旃
旅
旋
旌
旎
族
旖
旗
无
既
日
旦
旧
旨
早
旬
旭
旮
旱
时
旷
旺
旻
昀
昂
昆
昇
昉
昊
昌
明
昏
易
昔
昕
昙
星
映
春
昧
昨
昭
是
昱
昴
昵
昶
昼
显
晁
時
晃
晉
晋
晌
晏
晒
晓
晔
晕
晖
晗
晚
晝
晞
晟
晤
晦
晨
晩
普
景
晰
晴
晶
晷
智
晾
暂
暄
暇
暈
暉
暌
暐
暑
暖
暗
暝
暢
暧
暨
暫
暮
暱
暴
暸
暹
曄
曆
曇
曉
曖
曙
曜
曝
曠
曦
曬
曰
曲
曳
更
書
曹
曼
曾
替
最
會
月
有
朋
服
朐
朔
朕
朗
望
朝
期
朦
朧
木
未
末
本
札
朮
术
朱
朴
朵
机
朽
杀
杂
权
杆
杈
杉
李
杏
材
村
杓
杖
杜
杞
束
杠
条
来
杨
杭
杯
杰
東
杳
杵
杷
杼
松
板
极
构
枇
枉
枋
析
枕
林
枚
果
枝
枢
枣
枪
枫
枭
枯
枰
枱
枳
架
枷
枸
柄
柏
某
柑
柒
染
柔
柘
柚
柜
柞
柠
柢
查
柩
柬
柯
柱
柳
柴
柵
査
柿
栀
栃
栄
栅
标
栈
栉
栋
栎
栏
树
栓
栖
栗
校
栩
株
样
核
根
格
栽
栾
桀
桁
桂
桃
桅
框
案
桉
桌
桎
桐
桑
桓
桔
桜
桠
桡
桢
档
桥
桦
桧
桨
桩
桶
桿
梁
梅
梆
梏
梓
梗
條
梟
梢
梦
梧
梨
梭
梯
械
梳
梵
梶
检
棂
棄
棉
棋
棍
棒
棕
棗
棘
棚
棟
棠
棣
棧
森
棱
棲
棵
棹
棺
椁
椅
椋
植
椎
椒
検
椪
椭
椰
椹
椽
椿
楂
楊
楓
楔
楚
楝
楞
楠
楣
楨
楫
業
楮
極
楷
楸
楹
楼
楽
概
榄
榆
榈
榉
榔
榕
榖
榛
榜
榨
榫
榭
榮
榱
榴
榷
榻
槁
槃
構
槌
槍
槎
槐
槓
様
槛
槟
槤
槭
槲
槳
槻
槽
槿
樁
樂
樊
樑
樓
標
樞
樟
模
樣
権
横
樫
樯
樱
樵
樸
樹
樺
樽
樾
橄
橇
橋
橐
橘
橙
機
橡
橢
橫
橱
橹
橼
檀
檄
檎
檐
檔
檗
檜
檢
檬
檯
檳
檸
檻
櫃
櫚
櫛
櫥
櫸
櫻
欄
權
欒
欖
欠
次
欢
欣
欧
欲
欸
欺
欽
款
歆
歇
歉
歌
歎
歐
歓
歙
歛
歡
止
正
此
步
武
歧
歩
歪
歯
歲
歳
歴
歷
歸
歹
死
歼
殁
殃
殆
殇
殉
殊
残
殒
殓
殖
殘
殞
殡
殤
殭
殯
殲
殴
段
殷
殺
殼
殿
毀
毁
毂
毅
毆
毋
母
毎
每
毒
毓
比
毕
毗
毘
毙
毛
毡
毫
毯
毽
氈
氏
氐
民
氓
气
氖
気
氙
氛
氟
氡
氢
氣
氤
氦
氧
氨
氪
氫
氮
氯
氰
氲
水
氷
永
氹
氾
汀
汁
求
汆
汇
汉
汎
汐
汕
汗
汙
汛
汝
汞
江
池
污
汤
汨
汩
汪
汰
汲
汴
汶
汹
決
汽
汾
沁
沂
沃
沅
沈
沉
沌
沏
沐
沒
沓
沖
沙
沛
沟
没
沢
沣
沥
沦
沧
沪
沫
沭
沮
沱
河
沸
油
治
沼
沽
沾
沿
況
泄
泉
泊
泌
泓
法
泗
泛
泞
泠
泡
波
泣
泥
注
泪
泫
泮
泯
泰
泱
泳
泵
泷
泸
泻
泼
泽
泾
洁
洄
洋
洒
洗
洙
洛
洞
津
洩
洪
洮
洱
洲
洵
洶
洸
洹
活
洼
洽
派
流
浃
浄
浅
浆
浇
浊
测
济
浏
浑
浒
浓
浔
浙
浚
浜
浣
浦
浩
浪
浬
浮
浯
浴
海
浸
涂
涅
涇
消
涉
涌
涎
涓
涔
涕
涙
涛
涝
涞
涟
涠
涡
涣
涤
润
涧
涨
涩
涪
涮
涯
液
涵
涸
涼
涿
淀
淄
淅
淆
淇
淋
淌
淑
淒
淖
淘
淙
淚
淞
淡
淤
淦
淨
淩
淪
淫
淬
淮
深
淳
淵
混
淹
淺
添
淼
清
済
渉
渊
渋
渍
渎
渐
渔
渗
渙
渚
減
渝
渠
渡
渣
渤
渥
渦
温
測
渭
港
渲
渴
游
渺
渾
湃
湄
湊
湍
湖
湘
湛
湟
湧
湫
湮
湯
湳
湾
湿
満
溃
溅
溉
溏
源
準
溜
溝
溟
溢
溥
溧
溪
溫
溯
溱
溴
溶
溺
溼
滁
滂
滄
滅
滇
滋
滌
滑
滓
滔
滕
滙
滚
滝
滞
滟
满
滢
滤
滥
滦
滨
滩
滬
滯
滲
滴
滷
滸
滾
滿
漁
漂
漆
漉
漏
漓
演
漕
漠
漢
漣
漩
漪
漫
漬
漯
漱
漲
漳
漸
漾
漿
潆
潇
潋
潍
潑
潔
潘
潛
潜
潞
潟
潢
潤
潦
潧
潭
潮
潰
潴
潸
潺
潼
澀
澄
澆
澈
澍
澎
澗
澜
澡
澤
澧
澱
澳
澹
激
濁
濂
濃
濑
濒
濕
濘
濛
濟
濠
濡
濤
濫
濬
濮
濯
濱
濺
濾
瀅
瀆
瀉
瀋
瀏
瀑
瀕
瀘
瀚
瀛
瀝
瀞
瀟
瀧
瀨
瀬
瀰
瀾
灌
灏
灑
灘
灝
灞
灣
火
灬
灭
灯
灰
灵
灶
灸
灼
災
灾
灿
炀
炁
炅
炉
炊
炎
炒
炔
炕
炖
炙
炜
炫
炬
炭
炮
炯
炳
炷
炸
点
為
炼
炽
烁
烂
烃
烈
烊
烏
烘
烙
烛
烟
烤
烦
烧
烨
烩
烫
烬
热
烯
烷
烹
烽
焉
焊
焕
焖
焗
焘
焙
焚
焜
無
焦
焯
焰
焱
然
焼
煅
煉
煊
煌
煎
煒
煖
煙
煜
煞
煤
煥
煦
照
煨
煩
煮
煲
煸
煽
熄
熊
熏
熒
熔
熙
熟
熠
熨
熬
熱
熵
熹
熾
燁
燃
燄
燈
燉
燊
燎
燒
燔
燕
燙
燜
營
燥
燦
燧
燭
燮
燴
燻
燼
燿
爆
爍
爐
爛
爪
爬
爭
爰
爱
爲
爵
父
爷
爸
爹
爺
爻
爽
爾
牆
片
版
牌
牍
牒
牙
牛
牝
牟
牠
牡
牢
牦
牧
物
牯
牲
牴
牵
特
牺
牽
犀
犁
犄
犊
犍
犒
犢
犧
犬
犯
状
犷
犸
犹
狀
狂
狄
狈
狎
狐
狒
狗
狙
狞
狠
狡
狩
独
狭
狮
狰
狱
狸
狹
狼
狽
猎
猕
猖
猗
猙
猛
猜
猝
猥
猩
猪
猫
猬
献
猴
猶
猷
猾
猿
獄
獅
獎
獐
獒
獗
獠
獣
獨
獭
獰
獲
獵
獷
獸
獺
獻
獼
獾
玄
率
玉
王
玑
玖
玛
玟
玠
玥
玩
玫
玮
环
现
玲
玳
玷
玺
玻
珀
珂
珅
珈
珉
珊
珍
珏
珐
珑
珙
珞
珠
珣
珥
珩
珪
班
珮
珲
珺
現
球
琅
理
琇
琉
琊
琍
琏
琐
琛
琢
琥
琦
琨
琪
琬
琮
琰
琲
琳
琴
琵
琶
琺
琼
瑀
瑁
瑄
瑋
瑕
瑗
瑙
瑚
瑛
瑜
瑞
瑟
瑠
瑣
瑤
瑩
瑪
瑯
瑰
瑶
瑾
璀
璁
璃
璇
璉
璋
璎
璐
璜
璞
璟
璧
璨
環
璽
璿
瓊
瓏
瓒
瓜
瓢
瓣
瓤
瓦
瓮
瓯
瓴
瓶
瓷
甄
甌
甕
甘
甙
甚
甜
生
產
産
甥
甦
用
甩
甫
甬
甭
甯
田
由
甲
申
电
男
甸
町
画
甾
畀
畅
界
畏
畑
畔
留
畜
畝
畢
略
畦
番
畫
異
畲
畳
畴
當
畸
畹
畿
疆
疇
疊
疏
疑
疔
疖
疗
疙
疚
疝
疟
疡
疣
疤
疥
疫
疮
疯
疱
疲
疳
疵
疸
疹
疼
疽
疾
痂
病
症
痈
痉
痊
痍
痒
痔
痕
痘
痙
痛
痞
痠
痢
痣
痤
痧
痨
痪
痫
痰
痱
痴
痹
痺
痼
痿
瘀
瘁
瘋
瘍
瘓
瘘
瘙
瘟
瘠
瘡
瘢
瘤
瘦
瘧
瘩
瘪
瘫
瘴
瘸
瘾
療
癇
癌
癒
癖
癜
癞
癡
癢
癣
癥
癫
癬
癮
癱
癲
癸
発
登
發
白
百
皂
的
皆
皇
皈
皋
皎
皑
皓
皖
皙
皚
皮
皰
皱
皴
皺
皿
盂
盃
盅
盆
盈
益
盎
盏
盐
监
盒
盔
盖
盗
盘
盛
盜
盞
盟
盡
監
盤
盥
盧
盪
目
盯
盱
盲
直
相
盹
盼
盾
省
眈
眉
看
県
眙
眞
真
眠
眦
眨
眩
眯
眶
眷
眸
眺
眼
眾
着
睁
睇
睏
睐
睑
睛
睜
睞
睡
睢
督
睥
睦
睨
睪
睫
睬
睹
睽
睾
睿
瞄
瞅
瞇
瞋
瞌
瞎
瞑
瞒
瞓
瞞
瞟
瞠
瞥
瞧
瞩
瞪
瞬
瞭
瞰
瞳
瞻
瞼
瞿
矇
矍
矗
矚
矛
矜
矢
矣
知
矩
矫
短
矮
矯
石
矶
矽
矾
矿
码
砂
砌
砍
砒
研
砖
砗
砚
砝
砣
砥
砧
砭
砰
砲
破
砷
砸
砺
砼
砾
础
硅
硐
硒
硕
硝
硫
硬
确
硯
硼
碁
碇
碉
碌
碍
碎
碑
碓
碗
碘
碚
碛
碟
碣
碧
碩
碰
碱
碳
碴
確
碼
碾
磁
磅
磊
磋
磐
磕
磚
磡
磨
磬
磯
磲
磷
磺
礁
礎
礙
礡
礦
礪
礫
礴
示
礼
社
祀
祁
祂
祇
祈
祉
祎
祐
祕
祖
祗
祚
祛
祜
祝
神
祟
祠
祢
祥
票
祭
祯
祷
祸
祺
祿
禀
禁
禄
禅
禍
禎
福
禛
禦
禧
禪
禮
禱
禹
禺
离
禽
禾
禿
秀
私
秃
秆
秉
秋
种
科
秒
秘
租
秣
秤
秦
秧
秩
秭
积
称
秸
移
秽
稀
稅
程
稍
税
稔
稗
稚
稜
稞
稟
稠
稣
種
稱
稲
稳
稷
稹
稻
稼
稽
稿
穀
穂
穆
穌
積
穎
穗
穢
穩
穫
穴
究
穷
穹
空
穿
突
窃
窄
窈
窍
窑
窒
窓
窕
窖
窗
窘
窜
窝
窟
窠
窥
窦
窨
窩
窪
窮
窯
窺
窿
竄
竅
竇
竊
立
竖
站
竜
竞
竟
章
竣
童
竭
端
競
竹
竺
竽
竿
笃
笆
笈
笋
笏
笑
笔
笙
笛
笞
笠
符
笨
第
笹
笺
笼
筆
等
筊
筋
筍
筏
筐
筑
筒
答
策
筛
筝
筠
筱
筲
筵
筷
筹
签
简
箇
箋
箍
箏
箐
箔
箕
算
箝
管
箩
箫
箭
箱
箴
箸
節
篁
範
篆
篇
築
篑
篓
篙
篝
篠
篡
篤
篩
篪
篮
篱
篷
簇
簌
簍
簡
簦
簧
簪
簫
簷
簸
簽
簾
簿
籁
籃
籌
籍
籐
籟
籠
籤
籬
籮
籲
米
类
籼
籽
粄
粉
粑
粒
粕
粗
粘
粟
粤
粥
粧
粪
粮
粱
粲
粳
粵
粹
粼
粽
精
粿
糅
糊
糍
糕
糖
糗
糙
糜
糞
糟
糠
糧
糬
糯
糰
糸
系
糾
紀
紂
約
紅
紉
紊
紋
納
紐
紓
純
紗
紘
紙
級
紛
紜
素
紡
索
紧
紫
紮
累
細
紳
紹
紺
終
絃
組
絆
経
結
絕
絞
絡
絢
給
絨
絮
統
絲
絳
絵
絶
絹
綁
綏
綑
經
継
続
綜
綠
綢
綦
綫
綬
維
綱
網
綴
綵
綸
綺
綻
綽
綾
綿
緊
緋
総
緑
緒
緘
線
緝
緞
締
緣
編
緩
緬
緯
練
緹
緻
縁
縄
縈
縛
縝
縣
縫
縮
縱
縴
縷
總
績
繁
繃
繆
繇
繋
織
繕
繚
繞
繡
繩
繪
繫
繭
繳
繹
繼
繽
纂
續
纍
纏
纓
纔
纖
纜
纠
红
纣
纤
约
级
纨
纪
纫
纬
纭
纯
纰
纱
纲
纳
纵
纶
纷
纸
纹
纺
纽
纾
线
绀
练
组
绅
细
织
终
绊
绍
绎
经
绑
绒
结
绔
绕
绘
给
绚
绛
络
绝
绞
统
绡
绢
绣
绥
绦
继
绩
绪
绫
续
绮
绯
绰
绳
维
绵
绶
绷
绸
绻
综
绽
绾
绿
缀
缄
缅
缆
缇
缈
缉
缎
缓
缔
缕
编
缘
缙
缚
缜
缝
缠
缢
缤
缥
缨
缩
缪
缭
缮
缰
缱
缴
缸
缺
缽
罂
罄
罌
罐
网
罔
罕
罗
罚
罡
罢
罩
罪
置
罰
署
罵
罷
罹
羁
羅
羈
羊
羌
美
羔
羚
羞
羟
羡
羣
群
羥
羧
羨
義
羯
羲
羸
羹
羽
羿
翁
翅
翊
翌
翎
習
翔
翘
翟
翠
翡
翦
翩
翰
翱
翳
翹
翻
翼
耀
老
考
耄
者
耆
耋
而
耍
耐
耒
耕
耗
耘
耙
耦
耨
耳
耶
耷
耸
耻
耽
耿
聂
聆
聊
聋
职
聒
联
聖
聘
聚
聞
聪
聯
聰
聲
聳
聴
聶
職
聽
聾
聿
肃
肄
肅
肆
肇
肉
肋
肌
肏
肓
肖
肘
肚
肛
肝
肠
股
肢
肤
肥
肩
肪
肮
肯
肱
育
肴
肺
肽
肾
肿
胀
胁
胃
胄
胆
背
胍
胎
胖
胚
胛
胜
胝
胞
胡
胤
胥
胧
胫
胭
胯
胰
胱
胳
胴
胶
胸
胺
能
脂
脅
脆
脇
脈
脉
脊
脍
脏
脐
脑
脓
脖
脘
脚
脛
脣
脩
脫
脯
脱
脲
脳
脸
脹
脾
腆
腈
腊
腋
腌
腎
腐
腑
腓
腔
腕
腥
腦
腩
腫
腭
腮
腰
腱
腳
腴
腸
腹
腺
腻
腼
腾
腿
膀
膈
膊
膏
膑
膘
膚
膛
膜
膝
膠
膦
膨
膩
膳
膺
膻
膽
膾
膿
臀
臂
臃
臆
臉
臊
臍
臓
臘
臟
臣
臥
臧
臨
自
臬
臭
至
致
臺
臻
臼
臾
舀
舂
舅
舆
與
興
舉
舊
舌
舍
舎
舐
舒
舔
舖
舗
舛
舜
舞
舟
航
舫
般
舰
舱
舵
舶
舷
舸
船
舺
舾
艇
艋
艘
艙
艦
艮
良
艰
艱
色
艳
艷
艹
艺
艾
节
芃
芈
芊
芋
芍
芎
芒
芙
芜
芝
芡
芥
芦
芩
芪
芫
芬
芭
芮
芯
花
芳
芷
芸
芹
芻
芽
芾
苁
苄
苇
苋
苍
苏
苑
苒
苓
苔
苕
苗
苛
苜
苞
苟
苡
苣
若
苦
苫
苯
英
苷
苹
苻
茁
茂
范
茄
茅
茉
茎
茏
茗
茜
茧
茨
茫
茬
茭
茯
茱
茲
茴
茵
茶
茸
茹
茼
荀
荃
荆
草
荊
荏
荐
荒
荔
荖
荘
荚
荞
荟
荠
荡
荣
荤
荥
荧
荨
荪
荫
药
荳
荷
荸
荻
荼
荽
莅
莆
莉
莊
莎
莒
莓
莖
莘
莞
莠
莢
莧
莪
莫
莱
莲
莴
获
莹
莺
莽
莿
菀
菁
菅
菇
菈
菊
菌
菏
菓
菖
菘
菜
菟
菠
菡
菩
華
菱
菲
菸
菽
萁
萃
萄
萊
萋
萌
萍
萎
萘
萝
萤
营
萦
萧
萨
萩
萬
萱
萵
萸
萼
落
葆
葉
著
葚
葛
葡
董
葦
葩
葫
葬
葭
葯
葱
葳
葵
葷
葺
蒂
蒋
蒐
蒔
蒙
蒜
蒞
蒟
蒡
蒨
蒲
蒸
蒹
蒻
蒼
蒿
蓁
蓄
蓆
蓉
蓋
蓑
蓓
蓖
蓝
蓟
蓦
蓬
蓮
蓼
蓿
蔑
蔓
蔔
蔗
蔘
蔚
蔡
蔣
蔥
蔫
蔬
蔭
蔵
蔷
蔺
蔻
蔼
蔽
蕁
蕃
蕈
蕉
蕊
蕎
蕙
蕤
蕨
蕩
蕪
蕭
蕲
蕴
蕻
蕾
薄
薅
薇
薈
薊
薏
薑
薔
薙
薛
薦
薨
薩
薪
薬
薯
薰
薹
藉
藍
藏
藐
藓
藕
藜
藝
藤
藥
藩
藹
藻
藿
蘆
蘇
蘊
蘋
蘑
蘚
蘭
蘸
蘼
蘿
虎
虏
虐
虑
虔
處
虚
虛
虜
虞
號
虢
虧
虫
虬
虱
虹
虻
虽
虾
蚀
蚁
蚂
蚊
蚌
蚓
蚕
蚜
蚝
蚣
蚤
蚩
蚪
蚯
蚱
蚵
蛀
蛆
蛇
蛊
蛋
蛎
蛐
蛔
蛙
蛛
蛟
蛤
蛭
蛮
蛰
蛳
蛹
蛻
蛾
蜀
蜂
蜃
蜆
蜇
蜈
蜊
蜍
蜒
蜓
蜕
蜗
蜘
蜚
蜜
蜡
蜢
蜥
蜱
蜴
蜷
蜻
蜿
蝇
蝈
蝉
蝌
蝎
蝕
蝗
蝙
蝟
蝠
蝦
蝨
蝴
蝶
蝸
蝼
螂
螃
融
螞
螢
螨
螯
螳
螺
蟀
蟄
蟆
蟋
蟎
蟑
蟒
蟠
蟬
蟲
蟹
蟻
蟾
蠅
蠍
蠔
蠕
蠛
蠟
蠡
蠢
蠣
蠱
蠶
蠹
蠻
血
衄
衅
衆
行
衍
術
衔
街
衙
衛
衝
衞
衡
衢
衣
补
表
衩
衫
衬
衮
衰
衲
衷
衹
衾
衿
袁
袂
袄
袅
袈
袋
袍
袒
袖
袜
袞
袤
袪
被
袭
袱
裁
裂
装
裆
裊
裏
裔
裕
裘
裙
補
裝
裟
裡
裤
裨
裱
裳
裴
裸
裹
製
裾
褂
複
褐
褒
褓
褔
褚
褥
褪
褫
褲
褶
褻
襁
襄
襟
襠
襪
襬
襯
襲
西
要
覃
覆
覇
見
規
覓
視
覚
覦
覧
親
覬
観
覷
覺
覽
觀
见
观
规
觅
视
览
觉
觊
觎
觐
觑
角
觞
解
觥
触
觸
言
訂
計
訊
討
訓
訕
訖
託
記
訛
訝
訟
訣
訥
訪
設
許
訳
訴
訶
診
註
証
詆
詐
詔
評
詛
詞
詠
詡
詢
詣
試
詩
詫
詬
詭
詮
詰
話
該
詳
詹
詼
誅
誇
誉
誌
認
誓
誕
誘
語
誠
誡
誣
誤
誥
誦
誨
說
説
読
誰
課
誹
誼
調
諄
談
請
諏
諒
論
諗
諜
諡
諦
諧
諫
諭
諮
諱
諳
諷
諸
諺
諾
謀
謁
謂
謄
謊
謎
謐
謔
謗
謙
講
謝
謠
謨
謬
謹
謾
譁
證
譎
譏
識
譙
譚
譜
警
譬
譯
議
譲
譴
護
譽
讀
變
讓
讚
讞
计
订
认
讥
讧
讨
让
讪
讫
训
议
讯
记
讲
讳
讴
讶
讷
许
讹
论
讼
讽
设
访
诀
证
诃
评
诅
识
诈
诉
诊
诋
词
诏
译
试
诗
诘
诙
诚
诛
话
诞
诟
诠
诡
询
诣
诤
该
详
诧
诩
诫
诬
语
误
诰
诱
诲
说
诵
诶
请
诸
诺
读
诽
课
诿
谀
谁
调
谄
谅
谆
谈
谊
谋
谌
谍
谎
谏
谐
谑
谒
谓
谔
谕
谗
谘
谙
谚
谛
谜
谟
谢
谣
谤
谥
谦
谧
谨
谩
谪
谬
谭
谯
谱
谲
谴
谶
谷
豁
豆
豇
豈
豉
豊
豌
豎
豐
豔
豚
象
豢
豪
豫
豬
豹
豺
貂
貅
貌
貓
貔
貘
貝
貞
負
財
貢
貧
貨
販
貪
貫
責
貯
貰
貳
貴
貶
買
貸
費
貼
貽
貿
賀
賁
賂
賃
賄
資
賈
賊
賑
賓
賜
賞
賠
賡
賢
賣
賤
賦
質
賬
賭
賴
賺
購
賽
贅
贈
贊
贍
贏
贓
贖
贛
贝
贞
负
贡
财
责
贤
败
账
货
质
贩
贪
贫
贬
购
贮
贯
贰
贱
贲
贴
贵
贷
贸
费
贺
贻
贼
贾
贿
赁
赂
赃
资
赅
赈
赊
赋
赌
赎
赏
赐
赓
赔
赖
赘
赚
赛
赝
赞
赠
赡
赢
赣
赤
赦
赧
赫
赭
走
赳
赴
赵
赶
起
趁
超
越
趋
趕
趙
趟
趣
趨
足
趴
趵
趸
趺
趾
跃
跄
跆
跋
跌
跎
跑
跖
跚
跛
距
跟
跡
跤
跨
跩
跪
路
跳
践
跷
跹
跺
跻
踉
踊
踌
踏
踐
踝
踞
踟
踢
踩
踪
踮
踱
踴
踵
踹
蹂
蹄
蹇
蹈
蹉
蹊
蹋
蹑
蹒
蹙
蹟
蹣
蹤
蹦
蹩
蹬
蹭
蹲
蹴
蹶
蹺
蹼
蹿
躁
躇
躉
躊
躋
躍
躏
躪
身
躬
躯
躲
躺
軀
車
軋
軌
軍
軒
軟
転
軸
軼
軽
軾
較
載
輒
輓
輔
輕
輛
輝
輟
輩
輪
輯
輸
輻
輾
輿
轄
轅
轆
轉
轍
轎
轟
车
轧
轨
轩
转
轭
轮
软
轰
轲
轴
轶
轻
轼
载
轿
较
辄
辅
辆
辇
辈
辉
辊
辍
辐
辑
输
辕
辖
辗
辘
辙
辛
辜
辞
辟
辣
辦
辨
辩
辫
辭
辮
辯
辰
辱
農
边
辺
辻
込
辽
达
迁
迂
迄
迅
过
迈
迎
运
近
返
还
这
进
远
违
连
迟
迢
迤
迥
迦
迩
迪
迫
迭
述
迴
迷
迸
迹
迺
追
退
送
适
逃
逅
逆
选
逊
逍
透
逐
递
途
逕
逗
這
通
逛
逝
逞
速
造
逢
連
逮
週
進
逵
逶
逸
逻
逼
逾
遁
遂
遅
遇
遊
運
遍
過
遏
遐
遑
遒
道
達
違
遗
遙
遛
遜
遞
遠
遢
遣
遥
遨
適
遭
遮
遲
遴
遵
遶
遷
選
遺
遼
遽
避
邀
邁
邂
邃
還
邇
邈
邊
邋
邏
邑
邓
邕
邛
邝
邢
那
邦
邨
邪
邬
邮
邯
邰
邱
邳
邵
邸
邹
邺
邻
郁
郅
郊
郎
郑
郜
郝
郡
郢
郤
郦
郧
部
郫
郭
郴
郵
郷
郸
都
鄂
鄉
鄒
鄔
鄙
鄞
鄢
鄧
鄭
鄰
鄱
鄲
鄺
酉
酊
酋
酌
配
酐
酒
酗
酚
酝
酢
酣
酥
酩
酪
酬
酮
酯
酰
酱
酵
酶
酷
酸
酿
醃
醇
醉
醋
醍
醐
醒
醚
醛
醜
醞
醣
醪
醫
醬
醮
醯
醴
醺
釀
釁
采
釉
释
釋
里
重
野
量
釐
金
釗
釘
釜
針
釣
釦
釧
釵
鈀
鈉
鈍
鈎
鈔
鈕
鈞
鈣
鈦
鈪
鈴
鈺
鈾
鉀
鉄
鉅
鉉
鉑
鉗
鉚
鉛
鉤
鉴
鉻
銀
銃
銅
銑
銓
銖
銘
銜
銬
銭
銮
銳
銷
銹
鋁
鋅
鋒
鋤
鋪
鋰
鋸
鋼
錄
錐
錘
錚
錠
錢
錦
錨
錫
錮
錯
録
錳
錶
鍊
鍋
鍍
鍛
鍥
鍰
鍵
鍺
鍾
鎂
鎊
鎌
鎏
鎔
鎖
鎗
鎚
鎧
鎬
鎮
鎳
鏈
鏖
鏗
鏘
鏞
鏟
鏡
鏢
鏤
鏽
鐘
鐮
鐲
鐳
鐵
鐸
鐺
鑄
鑊
鑑
鑒
鑣
鑫
鑰
鑲
鑼
鑽
鑾
鑿
针
钉
钊
钎
钏
钒
钓
钗
钙
钛
钜
钝
钞
钟
钠
钡
钢
钣
钤
钥
钦
钧
钨
钩
钮
钯
钰
钱
钳
钴
钵
钺
钻
钼
钾
钿
铀
铁
铂
铃
铄
铅
铆
铉
铎
铐
铛
铜
铝
铠
铡
铢
铣
铤
铨
铩
铬
铭
铮
铰
铲
铵
银
铸
铺
链
铿
销
锁
锂
锄
锅
锆
锈
锉
锋
锌
锏
锐
锑
错
锚
锟
锡
锢
锣
锤
锥
锦
锭
键
锯
锰
锲
锵
锹
锺
锻
镀
镁
镂
镇
镉
镌
镍
镐
镑
镕
镖
镗
镛
镜
镣
镭
镯
镰
镳
镶
長
长
門
閃
閉
開
閎
閏
閑
閒
間
閔
閘
閡
関
閣
閥
閨
閩
閱
閲
閹
閻
閾
闆
闇
闊
闌
闍
闔
闕
闖
闘
關
闡
闢
门
闪
闫
闭
问
闯
闰
闲
间
闵
闷
闸
闹
闺
闻
闽
闾
阀
阁
阂
阅
阆
阇
阈
阉
阎
阐
阑
阔
阕
阖
阙
阚
阜
队
阡
阪
阮
阱
防
阳
阴
阵
阶
阻
阿
陀
陂
附
际
陆
陇
陈
陋
陌
降
限
陕
陛
陝
陞
陟
陡
院
陣
除
陨
险
陪
陰
陲
陳
陵
陶
陷
陸
険
陽
隅
隆
隈
隊
隋
隍
階
随
隐
隔
隕
隘
隙
際
障
隠
隣
隧
隨
險
隱
隴
隶
隸
隻
隼
隽
难
雀
雁
雄
雅
集
雇
雉
雋
雌
雍
雎
雏
雑
雒
雕
雖
雙
雛
雜
雞
離
難
雨
雪
雯
雰
雲
雳
零
雷
雹
電
雾
需
霁
霄
霆
震
霈
霉
霊
霍
霎
霏
霑
霓
霖
霜
霞
霧
霭
霰
露
霸
霹
霽
霾
靂
靄
靈
青
靓
靖
静
靚
靛
靜
非
靠
靡
面
靥
靦
革
靳
靴
靶
靼
鞅
鞋
鞍
鞏
鞑
鞘
鞠
鞣
鞦
鞭
韆
韋
韌
韓
韜
韦
韧
韩
韬
韭
音
韵
韶
韻
響
頁
頂
頃
項
順
須
頌
預
頑
頒
頓
頗
領
頜
頡
頤
頫
頭
頰
頷
頸
頹
頻
頼
顆
題
額
顎
顏
顔
願
顛
類
顧
顫
顯
顱
顴
页
顶
顷
项
顺
须
顼
顽
顾
顿
颁
颂
预
颅
领
颇
颈
颉
颊
颌
颍
颐
频
颓
颔
颖
颗
题
颚
颛
颜
额
颞
颠
颡
颢
颤
颦
颧
風
颯
颱
颳
颶
颼
飄
飆
风
飒
飓
飕
飘
飙
飚
飛
飞
食
飢
飨
飩
飪
飯
飲
飼
飽
飾
餃
餅
餉
養
餌
餐
餒
餓
餘
餚
餛
餞
餡
館
餮
餵
餾
饅
饈
饋
饌
饍
饑
饒
饕
饗
饞
饥
饨
饪
饬
饭
饮
饯
饰
饱
饲
饴
饵
饶
饷
饺
饼
饽
饿
馀
馁
馄
馅
馆
馈
馋
馍
馏
馒
馔
首
馗
香
馥
馨
馬
馭
馮
馳
馴
駁
駄
駅
駆
駐
駒
駕
駛
駝
駭
駱
駿
騁
騎
騏
験
騙
騨
騰
騷
驀
驅
驊
驍
驒
驕
驗
驚
驛
驟
驢
驥
马
驭
驮
驯
驰
驱
驳
驴
驶
驷
驸
驹
驻
驼
驾
驿
骁
骂
骄
骅
骆
骇
骈
骊
骋
验
骏
骐
骑
骗
骚
骛
骜
骞
骠
骡
骤
骥
骧
骨
骯
骰
骶
骷
骸
骼
髂
髅
髋
髏
髒
髓
體
髖
高
髦
髪
髮
髯
髻
鬃
鬆
鬍
鬓
鬚
鬟
鬢
鬣
鬥
鬧
鬱
鬼
魁
魂
魄
魅
魇
魍
魏
魔
魘
魚
魯
魷
鮑
鮨
鮪
鮭
鮮
鯉
鯊
鯖
鯛
鯨
鯰
鯽
鰍
鰓
鰭
鰲
鰻
鰾
鱈
鱉
鱔
鱗
鱷
鱸
鱼
鱿
鲁
鲈
鲍
鲑
鲛
鲜
鲟
鲢
鲤
鲨
鲫
鲱
鲲
鲶
鲷
鲸
鳃
鳄
鳅
鳌
鳍
鳕
鳖
鳗
鳝
鳞
鳥
鳩
鳳
鳴
鳶
鴉
鴕
鴛
鴦
鴨
鴻
鴿
鵑
鵜
鵝
鵡
鵬
鵰
鵲
鶘
鶩
鶯
鶴
鷗
鷲
鷹
鷺
鸚
鸞
鸟
鸠
鸡
鸢
鸣
鸥
鸦
鸨
鸪
鸭
鸯
鸳
鸵
鸽
鸾
鸿
鹂
鹃
鹄
鹅
鹈
鹉
鹊
鹌
鹏
鹑
鹕
鹘
鹜
鹞
鹤
鹦
鹧
鹫
鹭
鹰
鹳
鹵
鹹
鹼
鹽
鹿
麂
麋
麒
麓
麗
麝
麟
麥
麦
麩
麴
麵
麸
麺
麻
麼
麽
麾
黃
黄
黍
黎
黏
黑
黒
黔
默
黛
黜
黝
點
黠
黨
黯
黴
鼋
鼎
鼐
鼓
鼠
鼬
鼹
鼻
鼾
齁
齊
齋
齐
齒
齡
齢
齣
齦
齿
龄
龅
龈
龊
龋
龌
龍
龐
龔
龕
龙
龚
龛
龜
龟
︰
︱
︶
︿
﹁
﹂
﹍
﹏
﹐
﹑
﹒
﹔
﹕
﹖
﹗
﹙
﹚
﹝
﹞
﹡
﹣
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
@
[
\
]
^
_
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
。
「
」
、
・
ッ
ー
イ
ク
シ
ス
ト
ノ
フ
ラ
ル
ン
゙
゚
 ̄
¥
👍
🔥
😂
😎
...
yam
10
2017
12
11
2016
20
30
15
06
lofter
##s
2015
by
16
14
18
13
24
17
2014
21
##0
22
19
25
23
com
100
00
05
2013
##a
03
09
08
28
##2
50
01
04
##1
27
02
2012
##3
26
##e
07
##8
##5
##6
##4
##9
##7
29
2011
40
##t
2010
##o
##d
##i
2009
##n
app
www
the
##m
31
##c
##l
##y
##r
##g
2008
60
http
200
qq
##p
80
##f
google
pixnet
90
cookies
tripadvisor
500
##er
##k
35
##h
facebook
2007
2000
70
##b
of
##x
##u
45
300
iphone
32
1000
2006
48
ip
36
in
38
3d
##w
##ing
55
ctrip
##on
##v
33
##の
to
34
400
id
2005
it
37
windows
llc
top
99
42
39
000
led
at
##an
41
51
52
46
49
43
53
44
##z
android
58
and
59
2004
56
vr
##か
5000
2003
47
blogthis
twitter
54
##le
150
ok
2018
57
75
cn
no
ios
##in
##mm
##00
800
on
te
3000
65
2001
360
95
ig
lv
120
##ng
##を
##us
##に
pc
てす
──
600
##te
85
2002
88
##ed
html
ncc
wifi
email
64
blog
is
##10
##て
mail
online
##al
dvd
##ic
studio
##は
##℃
##ia
##と
line
vip
72
##q
98
##ce
##en
for
##is
##ra
##es
##j
usb
net
cp
1999
asia
4g
##cm
diy
new
3c
##お
ta
66
language
vs
apple
tw
86
web
##ne
ipad
62
you
##re
101
68
##tion
ps
de
bt
pony
atm
##2017
1998
67
##ch
ceo
##or
go
##na
av
pro
cafe
96
pinterest
97
63
pixstyleme3c
##ta
more
said
##2016
1997
mp3
700
##ll
nba
jun
##20
92
tv
1995
pm
61
76
nbsp
250
##ie
linux
##ma
cd
110
hd
##17
78
##ion
77
6000
am
##th
##st
94
##se
##et
69
180
gdp
my
105
81
abc
89
flash
79
one
93
1990
1996
##ck
gps
##も
##ly
web885
106
2020
91
##ge
4000
1500
xd
boss
isbn
1994
org
##ry
me
love
##11
0fork
73
##12
3g
##ter
##ar
71
82
##la
hotel
130
1970
pk
83
87
140
ie
##os
##30
##el
74
##50
seo
cpu
##ml
p2p
84
may
##る
sun
tue
internet
cc
posted
youtube
##at
##ン
##man
ii
##ル
##15
abs
nt
pdf
yahoo
ago
1980
##it
news
mac
104
##てす
##me
##り
java
1992
spa
##de
##nt
hk
all
plus
la
1993
##mb
##16
##ve
west
##da
160
air
##い
##ps
から
##to
1989
logo
htc
php
https
fi
momo
##son
sat
##ke
##80
ebd
suv
wi
day
apk
##88
##um
mv
galaxy
wiki
or
brake
##ス
1200
する
this
1991
mon
##こ
❤2017
po
##ない
javascript
life
home
june
##ss
system
900
##ー
##0
pp
1988
world
fb
4k
br
##as
ic
ai
leonardo
safari
##60
live
free
xx
wed
win7
kiehl
##co
lg
o2o
##go
us
235
1949
mm
しい
vfm
kanye
##90
##2015
##id
jr
##ey
123
rss
##sa
##ro
##am
##no
thu
fri
350
##sh
##ki
103
comments
name
##のて
##pe
##ine
max
1987
8000
uber
##mi
##ton
wordpress
office
1986
1985
##ment
107
bd
win10
##ld
##li
gmail
bb
dior
##rs
##ri
##rd
##ます
up
cad
##®
dr
して
read
##21
をお
##io
##99
url
1984
pvc
paypal
show
policy
##40
##ty
##18
with
##★
##01
txt
102
##ba
dna
from
post
mini
ar
taiwan
john
##ga
privacy
agoda
##13
##ny
word
##24
##22
##by
##ur
##hz
1982
##ang
265
cookie
netscape
108
##ka
##~
##ad
house
share
note
ibm
code
hello
nike
sim
survey
##016
1979
1950
wikia
##32
##017
5g
cbc
##tor
##kg
1983
##rt
##14
campaign
store
2500
os
##ct
##ts
##°
170
api
##ns
365
excel
##な
##ao
##ら
##し
~~
##nd
university
163
には
518
##70
##ya
##il
##25
pierre
ipo
0020
897
##23
hotels
##ian
のお
125
years
6606
##ers
##26
high
##day
time
##ay
bug
##line
##く
##す
##be
xp
talk2yam
yamservice
10000
coco
##dy
sony
##ies
1978
microsoft
david
people
##ha
1960
instagram
intel
その
##ot
iso
1981
##va
115
##mo
##land
xxx
man
co
ltxsw
##ation
baby
220
##pa
##ol
1945
7000
tag
450
##ue
msn
##31
oppo
##ト
##ca
control
##om
st
chrome
##ure
##ん
be
##き
lol
##19
した
##bo
240
lady
##100
##way
##から
4600
##ko
##do
##un
4s
corporation
168
##ni
herme
##28
cp
978
##up
##06
ui
##ds
ppt
admin
three
します
bbc
re
128
##48
ca
##015
##35
hp
##ee
tpp
##た
##ive
××
root
##cc
##ました
##ble
##ity
adobe
park
114
et
oled
city
##ex
##ler
##ap
china
##book
20000
view
##ice
global
##km
your
hong
##mg
out
##ms
ng
ebay
##29
menu
ubuntu
##cy
rom
##view
open
ktv
do
server
##lo
if
english
##ね
##5
##oo
1600
##02
step1
kong
club
135
july
inc
1976
mr
hi
##net
touch
##ls
##ii
michael
lcd
##05
##33
phone
james
step2
1300
ios9
##box
dc
##2
##ley
samsung
111
280
pokemon
css
##ent
##les
いいえ
##1
s8
atom
play
bmw
##said
sa
etf
ctrl
♥yoyo♥
##55
2025
##2014
##66
adidas
amazon
1958
##ber
##ner
visa
##77
##der
1800
connectivity
##hi
firefox
109
118
hr
so
style
mark
pop
ol
skip
1975
as
##27
##ir
##61
190
mba
##う
##ai
le
##ver
1900
cafe2017
lte
super
113
129
##ron
amd
like
##☆
are
##ster
we
##sk
paul
data
international
##ft
longchamp
ssd
good
##ート
##ti
reply
##my
↓↓↓
apr
star
##ker
source
136
js
112
get
force
photo
##one
126
##2013
##ow
link
bbs
1972
goods
##lin
python
119
##ip
game
##ics
##ません
blue
##●
520
##45
page
itunes
##03
1955
260
1968
gt
gif
618
##ff
##47
group
くたさい
about
bar
ganji
##nce
music
lee
not
1977
1971
1973
##per
an
faq
comment
##って
days
##ock
116
##bs
1974
1969
v1
player
1956
xbox
sql
fm
f1
139
##ah
210
##lv
##mp
##000
melody
1957
##3
550
17life
199
1966
xml
market
##au
##71
999
##04
what
gl
##95
##age
tips
##68
book
##ting
mysql
can
1959
230
##ung
wonderland
watch
10℃
##ction
9000
mar
mobile
1946
1962
article
##db
part
▲top
party
って
1967
1964
1948
##07
##ore
##op
この
dj
##78
##38
010
main
225
1965
##ong
art
320
ad
134
020
##73
117
pm2
japan
228
##08
ts
1963
##ica
der
sm
##36
2019
##wa
ct
##7
##や
##64
1937
homemesh
search
##85
##れは
##tv
##di
macbook
##9
##くたさい
service
##♥
type
った
750
##ier
##si
##75
##います
##ok
best
##ット
goris
lock
##った
cf
3m
big
##ut
ftp
carol
##vi
10
1961
happy
sd
##ac
122
anti
pe
cnn
iii
1920
138
##ラ
1940
esp
jan
tags
##98
##51
august
vol
##86
154
##™
##fs
##れ
##sion
design
ac
##ム
press
jordan
ppp
that
key
check
##6
##tt
##㎡
1080p
##lt
power
##42
1952
##bc
vivi
##ック
he
133
121
jpg
##rry
201
175
3500
1947
nb
##ted
##rn
しています
1954
usd
##t00
master
##ンク
001
model
##58
al
##09
1953
##34
ram
goo
ても
##ui
127
1930
red
##ary
rpg
item
##pm
##41
270
##za
project
##2012
hot
td
blogabstract
##ger
##62
650
##44
gr2
##します
##m
black
electronic
nfc
year
asus
また
html5
cindy
##hd
m3
132
esc
##od
booking
##53
fed
tvb
##81
##ina
mit
165
##いる
chan
192
distribution
next
になる
peter
bios
steam
cm
1941
にも
pk10
##ix
##65
##91
dec
nasa
##ana
icecat
00z
b1
will
##46
li
se
##ji
##み
##ard
oct
##ain
jp
##ze
##bi
cio
##56
smart
h5
##39
##port
curve
vpn
##nm
##dia
utc
##あり
12345678910
##52
rmvb
chanel
a4
miss
##and
##im
media
who
##63
she
girl
5s
124
vera
##して
class
vivo
king
##フ
##ei
national
ab
1951
5cm
888
145
ipod
ap
1100
5mm
211
ms
2756
##69
mp4
msci
##po
##89
131
mg
index
380
##bit
##out
##zz
##97
##67
158
apec
##8
photoshop
opec
¥799
ては
##96
##tes
##ast
2g
○○
##ール
¥2899
##ling
##よ
##ory
1938
##ical
kitty
content
##43
step3
##cn
win8
155
vc
1400
iphone7
robert
##した
tcl
137
beauty
##87
en
dollars
##ys
##oc
step
pay
yy
a1
##2011
##lly
##ks
##♪
1939
188
download
1944
sep
exe
ph
います
school
gb
center
pr
street
##board
uv
##37
##lan
winrar
##que
##ua
##com
1942
1936
480
gpu
##4
ettoday
fu
tom
##54
##ren
##via
149
##72
b2b
144
##79
##tch
rose
arm
mb
##49
##ial
##nn
nvidia
step4
mvp
00㎡
york
156
##イ
how
cpi
591
2765
gov
kg
joe
##xx
mandy
pa
##ser
copyright
fashion
1935
don
##け
ecu
##ist
##art
erp
wap
have
##lm
talk
##ek
##ning
##if
ch
##ite
video
1943
cs
san
iot
look
##84
##2010
##ku
october
##ux
trump
##hs
##ide
box
141
first
##ins
april
##ight
##83
185
angel
protected
aa
151
162
x1
m2
##fe
##×
##ho
size
143
min
ofo
fun
gomaji
ex
hdmi
food
dns
march
chris
kevin
##のか
##lla
##pp
##ec
ag
ems
6s
720p
##rm
##ham
off
##92
asp
team
fandom
ed
299
▌♥
##ell
info
されています
##82
sina
4066
161
##able
##ctor
330
399
315
dll
rights
ltd
idc
jul
3kg
1927
142
ma
surface
##76
##ク
~~~
304
mall
eps
146
green
##59
map
space
donald
v2
sodu
##light
1931
148
1700
まて
310
reserved
htm
##han
##57
2d
178
mod
##ise
##tions
152
ti
##shi
doc
1933
icp
055
wang
##ram
shopping
aug
##pi
##well
now
wam
b2
からお
##hu
236
1928
##gb
266
f2
##93
153
mix
##ef
##uan
bwl
##plus
##res
core
##ess
tea
5℃
hktvmall
nhk
##ate
list
##ese
301
feb
4m
inn
ての
nov
159
12345
daniel
##ci
pass
##bet
##nk
coffee
202
ssl
airbnb
##ute
fbi
woshipm
skype
ea
cg
sp
##fc
##www
yes
edge
alt
007
##94
fpga
##ght
##gs
iso9001
さい
##ile
##wood
##uo
image
lin
icon
american
##em
1932
set
says
##king
##tive
blogger
##74
なと
256
147
##ox
##zy
##red
##ium
##lf
nokia
claire
##リ
##ding
november
lohas
##500
##tic
##マ
##cs
##ある
##che
##ire
##gy
##ult
db
january
win
##カ
166
road
ptt
##ま
##つ
198
##fa
##mer
anna
pchome
はい
udn
ef
420
##time
##tte
2030
##ア
g20
white
かかります
1929
308
garden
eleven
di
##おります
chen
309b
777
172
young
cosplay
ちてない
4500
bat
##123
##tra
##ては
kindle
npc
steve
etc
##ern
##|
call
xperia
ces
travel
sk
s7
##ous
1934
##int
みいたたけます
183
edu
file
cho
qr
##car
##our
186
##ant
##d
eric
1914
rends
##jo
##する
mastercard
##2000
kb
##min
290
##ino
vista
##ris
##ud
jack
2400
##set
169
pos
1912
##her
##ou
taipei
しく
205
beta
##ませんか
232
##fi
express
255
body
##ill
aphojoy
user
december
meiki
##ick
tweet
richard
##av
##ᆫ
iphone6
##dd
ちてすか
views
##mark
321
pd
##00
times
##▲
level
##ash
10g
point
5l
##ome
208
koreanmall
##ak
george
q2
206
wma
tcp
##200
スタッフ
full
mlb
##lle
##watch
tm
run
179
911
smith
business
##und
1919
color
##tal
222
171
##less
moon
4399
##rl
update
pcb
shop
499
157
little
なし
end
##mhz
van
dsp
easy
660
##house
##key
history
##o
oh
##001
##hy
##web
oem
let
was
##2009
##gg
review
##wan
182
##°c
203
uc
title
##val
united
233
2021
##ons
doi
trivago
overdope
sbs
##ance
##ち
grand
special
573032185
imf
216
wx17house
##so
##ーム
audi
##he
london
william
##rp
##ake
science
beach
cfa
amp
ps4
880
##800
##link
##hp
crm
ferragamo
bell
make
##eng
195
under
zh
photos
2300
##style
##ント
via
176
da
##gi
company
i7
##ray
thomas
370
ufo
i5
##max
plc
ben
back
research
8g
173
mike
##pc
##ッフ
september
189
##ace
vps
february
167
pantos
wp
lisa
1921
★★
jquery
night
long
offer
##berg
##news
1911
##いて
ray
fks
wto
せます
over
164
340
##all
##rus
1924
##888
##works
blogtitle
loftpermalink
##→
187
martin
test
ling
km
##め
15000
fda
v3
##ja
##ロ
wedding
かある
outlet
family
##ea
をこ
##top
story
##ness
salvatore
##lu
204
swift
215
room
している
oracle
##ul
1925
sam
b2c
week
pi
rock
##のは
##a
##けと
##ean
##300
##gle
cctv
after
chinese
##back
powered
x2
##tan
1918
##nes
##イン
canon
only
181
##zi
##las
say
##oe
184
##sd
221
##bot
##world
##zo
sky
made
top100
just
1926
pmi
802
234
gap
##vr
177
les
174
▲topoct
ball
vogue
vi
ing
ofweek
cos
##list
##ort
▲topmay
##なら
##lon
として
last
##tc
##of
##bus
##gen
real
eva
##コ
a3
nas
##lie
##ria
##coin
##bt
▲topapr
his
212
cat
nata
vive
health
⋯⋯
drive
sir
▲topmar
du
cup
##カー
##ook
##よう
##sy
alex
msg
tour
しました
3ce
##word
193
ebooks
r8
block
318
##より
2200
nice
pvp
207
months
1905
rewards
##ther
1917
0800
##xi
##チ
##sc
micro
850
gg
blogfp
op
1922
daily
m1
264
true
##bb
ml
##tar
##のお
##ky
anthony
196
253
##yo
state
218
##ara
##aa
##rc
##tz
##ston
より
gear
##eo
##ade
ge
see
1923
##win
##ura
ss
heart
##den
##ita
down
##sm
el
png
2100
610
rakuten
whatsapp
bay
dream
add
##use
680
311
pad
gucci
mpv
##ode
##fo
island
▲topjun
##▼
223
jason
214
chicago
##❤
しの
##hone
io
##れる
##ことか
sogo
be2
##ology
990
cloud
vcd
##con
2~3
##ford
##joy
##kb
##こさいます
##rade
but
##ach
docker
##ful
rfid
ul
##ase
hit
ford
##star
580
##○
11
a2
sdk
reading
edited
##are
cmos
##mc
238
siri
light
##ella
##ため
bloomberg
##read
pizza
##ison
jimmy
##vm
college
node
journal
ba
18k
##play
245
##cer
20
magic
##yu
191
jump
288
tt
##ings
asr
##lia
3200
step5
network
##cd
mc
いします
1234
pixstyleme
273
##600
2800
money
★★★★★
1280
12
430
bl
みの
act
##tus
tokyo
##rial
##life
emba
##ae
saas
tcs
##rk
##wang
summer
##sp
ko
##ving
390
premium
##その
netflix
##ヒ
uk
mt
##lton
right
frank
two
209
える
##ple
##cal
021
##んな
##sen
##ville
hold
nexus
dd
##ius
てお
##mah
##なく
tila
zero
820
ce
##tin
resort
##ws
charles
old
p10
5d
report
##360
##ru
##には
bus
vans
lt
##est
pv
##レ
links
rebecca
##ツ
##dm
azure
##365
きな
limited
bit
4gb
##mon
1910
moto
##eam
213
1913
var
eos
なとの
226
blogspot
された
699
e3
dos
dm
fc
##ments
##ik
##kw
boy
##bin
##ata
960
er
##せ
219
##vin
##tu
##ula
194
##∥
station
##ろ
##ature
835
files
zara
hdr
top10
nature
950
magazine
s6
marriott
##シ
avira
case
##っと
tab
##ran
tony
##home
oculus
im
##ral
jean
saint
cry
307
rosie
##force
##ini
ice
##bert
のある
##nder
##mber
pet
2600
##◆
plurk
▲topdec
##sis
00kg
▲topnov
720
##ence
tim
##ω
##nc
##ても
##name
log
ips
great
ikea
malaysia
unix
##イト
3600
##ncy
##nie
12000
akb48
##ye
##oid
404
##chi
##いた
oa
xuehai
##1000
##orm
##rf
275
さん
##ware
##リー
980
ho
##pro
text
##era
560
bob
227
##ub
##2008
8891
scp
avi
##zen
2022
mi
wu
museum
qvod
apache
lake
jcb
▲topaug
★★★
ni
##hr
hill
302
ne
weibo
490
ruby
##ーシ
##ヶ
##row
4d
▲topjul
iv
##ish
github
306
mate
312
##スト
##lot
##ane
andrew
のハイト
##tina
t1
rf
ed2k
##vel
##900
way
final
りの
ns
5a
705
197
##メ
sweet
bytes
##ene
▲topjan
231
##cker
##2007
##px
100g
topapp
229
helpapp
rs
low
14k
g4g
care
630
ldquo
あり
##fork
leave
rm
edition
##gan
##zon
##qq
▲topsep
##google
##ism
gold
224
explorer
##zer
toyota
category
select
visual
##labels
restaurant
##md
posts
s1
##ico
もっと
angelababy
123456
217
sports
s3
mbc
1915
してくたさい
shell
x86
candy
##new
kbs
face
xl
470
##here
4a
swissinfo
v8
▲topfeb
dram
##ual
##vice
3a
##wer
sport
q1
ios10
public
int
card
##c
ep
au
rt
##れた
1080
bill
##mll
kim
30
460
wan
##uk
##ミ
x3
298
0t
scott
##ming
239
e5
##3d
h7n9
worldcat
brown
##あります
##vo
##led
##580
##ax
249
410
##ert
paris
##~6
polo
925
##lr
599
##ナ
capital
##hing
bank
cv
1g
##chat
##s
##たい
adc
##ule
2m
##e
digital
hotmail
268
##pad
870
bbq
quot
##ring
before
wali
##まて
mcu
2k
2b
という
costco
316
north
333
switch
##city
##p
philips
##mann
management
panasonic
##cl
##vd
##ping
##rge
alice
##lk
##ましょう
css3
##ney
vision
alpha
##ular
##400
##tter
lz
にお
##ありません
mode
gre
1916
pci
##tm
237
1~2
##yan
##そ
について
##let
##キ
work
war
coach
ah
mary
##ᅵ
huang
##pt
a8
pt
follow
##berry
1895
##ew
a5
ghost
##ション
##wn
##og
south
##code
girls
##rid
action
villa
git
r11
table
games
##cket
error
##anonymoussaid
##ag
here
##ame
##gc
qa
##■
##lis
gmp
##gin
vmalife
##cher
yu
wedding
##tis
demo
dragon
530
soho
social
bye
##rant
river
orz
acer
325
##↑
##ース
##ats
261
del
##ven
440
ups
##ように
##ター
305
value
macd
yougou
##dn
661
##ano
ll
##urt
##rent
continue
script
##wen
##ect
paper
263
319
shift
##chel
##フト
##cat
258
x5
fox
243
##さん
car
aaa
##blog
loading
##yn
##tp
kuso
799
si
sns
イカせるテンマ
ヒンクテンマ3
rmb
vdc
forest
central
prime
help
ultra
##rmb
##ような
241
square
688
##しい
のないフロクに
##field
##reen
##ors
##ju
c1
start
510
##air
##map
cdn
##wo
cba
stephen
m8
100km
##get
opera
##base
##ood
vsa
com™
##aw
##ail
251
なのて
count
t2
##ᅡ
##een
2700
hop
##gp
vsc
tree
##eg
##ose
816
285
##ories
##shop
alphago
v4
1909
simon
##ᆼ
fluke62max
zip
スホンサー
##sta
louis
cr
bas
##~10
bc
##yer
hadoop
##ube
##wi
1906
0755
hola
##low
place
centre
5v
d3
##fer
252
##750
##media
281
540
0l
exchange
262
series
##ハー
##san
eb
##bank
##k
q3
##nge
##mail
take
##lp
259
1888
client
east
cache
event
vincent
##ールを
きを
##nse
sui
855
adchoice
##и
##stry
##なたの
246
##zone
ga
apps
sea
##ab
248
cisco
##タ
##rner
kymco
##care
dha
##pu
##yi
minkoff
royal
p1
への
annie
269
collection
kpi
playstation
257
になります
866
bh
##bar
queen
505
radio
1904
andy
armani
##xy
manager
iherb
##ery
##share
spring
raid
johnson
1908
##ob
volvo
hall
##ball
v6
our
taylor
##hk
bi
242
##cp
kate
bo
water
technology
##rie
サイトは
277
##ona
##sl
hpv
303
gtx
hip
rdquo
jayz
stone
##lex
##rum
namespace
##やり
620
##ale
##atic
des
##erson
##ql
##ves
##type
enter
##この
##てきます
d2
##168
##mix
##bian
との
a9
jj
ky
##lc
access
movie
##hc
リストに
tower
##ration
##mit
ます
##nch
ua
tel
prefix
##o2
1907
##point
1901
ott
~10
##http
##ury
baidu
##ink
member
##logy
bigbang
nownews
##js
##shot
##tb
##こと
247
eba
##tics
##lus
ける
v5
spark
##ama
there
##ions
god
##lls
##down
hiv
##ress
burberry
day2
##kv
◆◆
jeff
related
film
edit
joseph
283
##ark
cx
32gb
order
g9
30000
##ans
##tty
s5
##bee
かあります
thread
xr
buy
sh
005
land
spotify
mx
##ari
276
##verse
×email
sf
why
##ことて
244
7headlines
nego
sunny
dom
exo
401
666
positioning
fit
rgb
##tton
278
kiss
alexa
adam
lp
みリストを
##g
mp
##ties
##llow
amy
##du
np
002
institute
271
##rth
##lar
2345
590
##des
sidebar
15
imax
site
##cky
##kit
##ime
##009
season
323
##fun
##ンター
##ひ
gogoro
a7
pu
lily
fire
twd600
##ッセーシを
いて
##vis
30ml
##cture
##をお
information
##オ
close
friday
##くれる
yi
nick
てすか
##tta
##tel
6500
##lock
cbd
economy
254
かお
267
tinker
double
375
8gb
voice
##app
oops
channel
today
985
##right
raw
xyz
##+
jim
edm
##cent
7500
supreme
814
ds
##its
##asia
dropbox
##てすか
##tti
books
272
100ml
##tle
##ller
##ken
##more
##boy
sex
309
##dom
t3
##ider
##なります
##unch
1903
810
feel
5500
##かった
##put
により
s2
mo
##gh
men
ka
amoled
div
##tr
##n1
port
howard
##tags
ken
dnf
##nus
adsense
##а
ide
##へ
buff
thunder
##town
##ique
has
##body
auto
pin
##erry
tee
てした
295
number
##the
##013
object
psp
cool
udnbkk
16gb
##mic
miui
##tro
most
r2
##alk
##nity
1880
±0
##いました
428
s4
law
version
##oa
n1
sgs
docomo
##tf
##ack
henry
fc2
##ded
##sco
##014
##rite
286
0mm
linkedin
##ada
##now
wii
##ndy
ucbug
##◎
sputniknews
legalminer
##ika
##xp
2gb
##bu
q10
oo
b6
come
##rman
cheese
ming
maker
##gm
nikon
##fig
ppi
kelly
##ります
jchere
てきます
ted
md
003
fgo
tech
##tto
dan
soc
##gl
##len
hair
earth
640
521
img
##pper
##a1
##てきる
##ロク
acca
##ition
##ference
suite
##ig
outlook
##mond
##cation
398
##pr
279
101vip
358
##999
282
64gb
3800
345
airport
##over
284
##おり
jones
##ith
lab
##su
##いるのて
co2
town
piece
##llo
no1
vmware
24h
##qi
focus
reader
##admin
##ora
tb
false
##log
1898
know
lan
838
##ces
f4
##ume
motel
stop
##oper
na
flickr
netcomponents
##af
##─
pose
williams
local
##ound
##cg
##site
##iko
いお
274
5m
gsm
con
##ath
1902
friends
##hip
cell
317
##rey
780
cream
##cks
012
##dp
facebooktwitterpinterestgoogle
sso
324
shtml
song
swiss
##mw
##キンク
lumia
xdd
string
tiffany
522
marc
られた
insee
russell
sc
dell
##ations
ok
camera
289
##vs
##flow
##late
classic
287
##nter
stay
g1
mtv
512
##ever
##lab
##nger
qe
sata
ryan
d1
50ml
cms
##cing
su
292
3300
editor
296
##nap
security
sunday
association
##ens
##700
##bra
acg
##かり
sofascore
とは
mkv
##ign
jonathan
gary
build
labels
##oto
tesla
moba
qi
gohappy
general
ajax
1024
##かる
サイト
society
##test
##urs
wps
fedora
##ich
mozilla
328
##480
##dr
usa
urn
##lina
##r
grace
##die
##try
##ader
1250
##なり
elle
570
##chen
##ᆯ
price
##ten
uhz
##ough
eq
##hen
states
push
session
balance
wow
506
##cus
##py
when
##ward
##ep
34e
wong
library
prada
##サイト
##cle
running
##ree
313
ck
date
q4
##ctive
##ool
##>
mk
##ira
##163
388
die
secret
rq
dota
buffet
は1ヶ
e6
##ez
pan
368
ha
##card
##cha
2a
##さ
alan
day3
eye
f3
##end
france
keep
adi
rna
tvbs
##ala
solo
nova
##え
##tail
##ょう
support
##ries
##なる
##ved
base
copy
iis
fps
##ways
hero
hgih
profile
fish
mu
ssh
entertainment
chang
##wd
click
cake
##ond
pre
##tom
kic
pixel
##ov
##fl
product
6a
##pd
dear
##gate
es
yumi
audio
##²
##sky
echo
bin
where
##ture
329
##ape
find
sap
isis
##なと
nand
##101
##load
##ream
band
a6
525
never
##post
festival
50cm
##we
555
guide
314
zenfone
##ike
335
gd
forum
jessica
strong
alexander
##ould
software
allen
##ious
program
360°
else
lohasthree
##gar
することかてきます
please
##れます
rc
##ggle
##ric
bim
50000
##own
eclipse
355
brian
3ds
##side
061
361
##other
##ける
##tech
##ator
485
engine
##ged
##t
plaza
##fit
cia
ngo
westbrook
shi
tbs
50mm
##みませんか
sci
291
reuters
##ily
contextlink
##hn
af
##cil
bridge
very
##cel
1890
cambridge
##ize
15g
##aid
##data
790
frm
##head
award
butler
##sun
meta
##mar
america
ps3
puma
pmid
##すか
lc
670
kitchen
##lic
オーフン5
きなしソフトサーヒス
そして
day1
future
★★★★
##text
##page
##rris
pm1
##ket
fans
##っています
1001
christian
bot
kids
trackback
##hai
c3
display
##hl
n2
1896
idea
さんも
##sent
airmail
##ug
##men
pwm
けます
028
##lution
369
852
awards
schemas
354
asics
wikipedia
font
##tional
##vy
c2
293
##れている
##dget
##ein
っている
contact
pepper
スキル
339
##~5
294
##uel
##ument
730
##hang
みてす
q5
##sue
rain
##ndi
wei
swatch
##cept
わせ
331
popular
##ste
##tag
p2
501
trc
1899
##west
##live
justin
honda
ping
messenger
##rap
v9
543
##とは
unity
appqq
はすへて
025
leo
##tone
##テ
##ass
uniqlo
##010
502
her
jane
memory
moneydj
##tical
human
12306
していると
##m2
coc
miacare
##mn
tmt
##core
vim
kk
##may
fan
target
use
too
338
435
2050
867
737
fast
##2c
services
##ope
omega
energy
##わ
pinkoi
1a
##なから
##rain
jackson
##ement
##シャンルの
374
366
そんな
p9
rd
##ᆨ
1111
##tier
##vic
zone
##│
385
690
dl
isofix
cpa
m4
322
kimi
めて
davis
##lay
lulu
##uck
050
weeks
qs
##hop
920
##n
ae
##ear
~5
eia
405
##fly
korea
jpeg
boost
##ship
small
##リア
1860
eur
297
425
valley
##iel
simple
##ude
rn
k2
##ena
されます
non
patrick
しているから
##ナー
feed
5757
30g
process
well
qqmei
##thing
they
aws
lu
pink
##ters
##kin
または
board
##vertisement
wine
##ien
unicode
##dge
r1
359
##tant
いを
##twitter
##3c
cool1
される
##れて
##l
isp
##012
standard
45㎡2
402
##150
matt
##fu
326
##iner
googlemsn
pixnetfacebookyahoo
##ラン
x7
886
##uce
メーカー
sao
##ev
##きました
##file
9678
403
xddd
shirt
6l
##rio
##hat
3mm
givenchy
ya
bang
##lio
monday
crystal
ロクイン
##abc
336
head
890
ubuntuforumwikilinuxpastechat
##vc
##~20
##rity
cnc
7866
ipv6
null
1897
##ost
yang
imsean
tiger
##fet
##ンス
352
##=
dji
327
ji
maria
##come
##んて
foundation
3100
##beth
##なった
1m
601
active
##aft
##don
3p
sr
349
emma
##khz
living
415
353
1889
341
709
457
sas
x6
##face
pptv
x4
##mate
han
sophie
##jing
337
fifa
##mand
other
sale
inwedding
##gn
てきちゃいます
##mmy
##pmlast
bad
nana
nbc
してみてくたさいね
なとはお
##wu
##かあります
##あ
note7
single
##340
せからこ
してくたさい♪この
しにはとんとんワークケートを
するとあなたにもっとマッチした
ならワークケートへ
もみつかっちゃうかも
ワークケートの
##bel
window
##dio
##ht
union
age
382
14
##ivity
##y
コメント
domain
neo
##isa
##lter
5k
f5
steven
##cts
powerpoint
tft
self
g2
ft
##テル
zol
##act
mwc
381
343
もう
nbapop
408
てある
eds
ace
##room
previous
author
tomtom
il
##ets
hu
financial
☆☆☆
っています
bp
5t
chi
1gb
##hg
fairmont
cross
008
gay
h2
function
##けて
356
also
1b
625
##ータ
##raph
1894
3~5
##ils
i3
334
avenue
##host
による
##bon
##tsu
message
navigation
50g
fintech
h6
##ことを
8cm
##ject
##vas
##firm
credit
##wf
xxxx
form
##nor
##space
huawei
plan
json
sbl
##dc
machine
921
392
wish
##120
##sol
windows7
edward
##ために
development
washington
##nsis
lo
818
##sio
##ym
##bor
planet
##~8
##wt
ieee
gpa
##めて
camp
ann
gm
##tw
##oka
connect
##rss
##work
##atus
wall
chicken
soul
2mm
##times
fa
##ather
##cord
009
##eep
hitachi
gui
harry
##pan
e1
disney
##press
##ーション
wind
386
frigidaire
##tl
liu
hsu
332
basic
von
ev
いた
てきる
スホンサーサイト
learning
##ull
expedia
archives
change
##wei
santa
cut
ins
6gb
turbo
brand
cf1
508
004
return
747
##rip
h1
##nis
##をこ
128gb
##にお
3t
application
しており
emc
rx
##oon
384
quick
412
15058
wilson
wing
chapter
##bug
beyond
##cms
##dar
##oh
zoom
e2
trip
sb
##nba
rcep
342
aspx
ci
080
gc
gnu
める
##count
advanced
dance
dv
##url
##ging
367
8591
am09
shadow
battle
346
##i
##cia
##という
emily
##のてす
##tation
host
ff
techorz
sars
##mini
##mporary
##ering
nc
4200
798
##next
cma
##mbps
##gas
##ift
##dot
##ィ
455
##~17
amana
##りの
426
##ros
ir
00㎡1
##eet
##ible
##↓
710
ˋ▽ˊ
##aka
dcs
iq
##v
l1
##lor
maggie
##011
##iu
588
##~1
830
##gt
1tb
articles
create
##burg
##iki
database
fantasy
##rex
##cam
dlc
dean
##you
hard
path
gaming
victoria
maps
cb
##lee
##itor
overchicstoretvhome
systems
##xt
416
p3
sarah
760
##nan
407
486
x9
install
second
626
##ann
##ph
##rcle
##nic
860
##nar
ec
##とう
768
metro
chocolate
##rian
~4
##table
##しています
skin
##sn
395
mountain
##0mm
inparadise
6m
7x24
ib
4800
##jia
eeworld
creative
g5
g3
357
parker
ecfa
village
からの
18000
sylvia
サーヒス
hbl
##ques
##onsored
##x2
##きます
##v4
##tein
ie6
383
##stack
389
ver
##ads
##baby
sound
bbe
##110
##lone
##uid
ads
022
gundam
351
thinkpad
006
scrum
match
##ave
mems
##470
##oy
##なりました
##talk
glass
lamigo
span
##eme
job
##a5
jay
wade
kde
498
##lace
ocean
tvg
##covery
##r3
##ners
##rea
junior
think
##aine
cover
##ision
##sia
↓↓
##bow
msi
413
458
406
##love
711
801
soft
z2
##pl
456
1840
mobil
mind
##uy
427
nginx
##oi
めた
##rr
6221
##mple
##sson
##ーシてす
371
##nts
91tv
comhd
crv3000
##uard
1868
397
deep
lost
field
gallery
##bia
rate
spf
redis
traction
930
icloud
011
なら
fe
jose
372
##tory
into
sohu
fx
899
379
kicstart2
##hia
すく
##~3
##sit
ra
24
##walk
##xure
500g
##pact
pacific
xa
natural
carlo
##250
##walker
1850
##can
cto
gigi
516
##サー
pen
##hoo
ob
matlab
##b
##yy
13913459
##iti
mango
##bbs
sense
c5
oxford
##ニア
walker
jennifer
##ola
course
##bre
701
##pus
##rder
lucky
075
##ぁ
ivy
なお
##nia
sotheby
side
##ugh
joy
##orage
##ush
##bat
##dt
364
r9
##2d
##gio
511
country
wear
##lax
##~7
##moon
393
seven
study
411
348
lonzo
8k
##ェ
evolution
##イフ
##kk
gs
kd
##レス
arduino
344
b12
##lux
arpg
##rdon
cook
##x5
dark
five
##als
##ida
とても
sign
362
##ちの
something
20mm
##nda
387
##posted
fresh
tf
1870
422
cam
##mine
##skip
##form
##ssion
education
394
##tee
dyson
stage
##jie
want
##night
epson
pack
あります
##ppy
テリヘル
##█
wd
##eh
##rence
left
##lvin
golden
mhz
discovery
##trix
##n2
loft
##uch
##dra
##sse
speed
~1
1mdb
sorry
welcome
##urn
wave
gaga
##lmer
teddy
##160
トラックハック
せよ
611
##f2016
378
rp
##sha
rar
##あなたに
##きた
840
holiday
##ュー
373
074
##vg
##nos
##rail
gartner
gi
6p
##dium
kit
488
b3
eco
##ろう
20g
sean
##stone
autocad
nu
##np
f16
write
029
m5
##ias
images
atp
##dk
fsm
504
1350
ve
52kb
##xxx
##のに
##cake
414
unit
lim
ru
1v
##ification
published
angela
16g
analytics
ak
##q
##nel
gmt
##icon
again
##₂
##bby
ios11
445
かこさいます
waze
いてす
##ハ
9985
##ust
##ティー
framework
##007
iptv
delete
52sykb
cl
wwdc
027
30cm
##fw
##ての
1389
##xon
brandt
##ses
##dragon
tc
vetements
anne
monte
modern
official
##へて
##ere
##nne
##oud
もちろん
50
etnews
##a2
##graphy
421
863
##ちゃん
444
##rtex
##てお
l2
##gma
mount
ccd
たと
archive
morning
tan
ddos
e7
##ホ
day4
##ウ
gis
453
its
495
factory
bruce
pg
##ito
ってくたさい
guest
cdma
##lling
536
n3
しかし
3~4
mega
eyes
ro
13
women
dac
church
##jun
singapore
##facebook
6991
starbucks
##tos
##stin
##shine
zen
##mu
tina
20℃
1893
##たけて
503
465
request
##gence
qt
##っ
1886
347
363
q7
##zzi
diary
##tore
409
##ead
468
cst
##osa
canada
agent
va
##jiang
##ちは
##ーク
##lam
sg
##nix
##sday
##よって
g6
##master
bing
##zl
charlie
16
8mm
nb40
##ーン
thai
##ルフ
ln284ct
##itz
##2f
bonnie
##food
##lent
originals
##stro
##lts
418
∟∣
##bscribe
children
ntd
yesstyle
##かも
hmv
##tment
d5
2cm
arts
sms
##pn
##я
##いい
topios9
539
lifestyle
virtual
##ague
xz
##deo
muji
024
unt
##nnis
##ᅩ
faq1
1884
396
##ette
fly
64㎡
はしめまして
441
curry
##pop
のこ
release
##←
##◆◆
##cast
073
ありな
500ml
##ews
5c
##stle
ios7
##ima
787
dog
lenovo
##r4
roger
013
cbs
vornado
100m
417
##desk
##クok
##ald
1867
9595
2900
##van
oil
##x
some
break
common
##jy
##lines
g7
twice
419
ella
nano
belle
にこ
##mes
##self
##note
jb
##ことかてきます
benz
##との
##ova
451
save
##wing
##ますのて
kai
りは
##hua
##rect
rainer
##unge
448
##0m
adsl
##かな
guestname
##uma
##kins
##zu
tokichoi
##price
county
##med
##mus
rmk
391
address
vm
えて
openload
##group
##hin
##iginal
amg
urban
##oz
jobs
emi
##public
beautiful
##sch
album
##dden
##bell
jerry
works
hostel
miller
##drive
##rmin
##10
376
boot
828
##370
##fx
##cm~
1885
##nome
##ctionary
##oman
##lish
##cr
##hm
433
##how
432
francis
xi
c919
b5
evernote
##uc
vga
##3000
coupe
##urg
##cca
##uality
019
6g
れる
multi
##また
##ett
em
hey
##ani
##tax
##rma
inside
than
740
leonnhurt
##jin
ict
れた
bird
notes
200mm
くの
##dical
##lli
result
442
iu
ee
438
smap
gopro
##last
yin
pure
998
32g
けた
5kg
##dan
##rame
mama
##oot
bean
marketing
##hur
2l
bella
sync
xuite
##ground
515
discuz
##getrelax
##ince
##bay
##5s
cj
##イス
gmat
apt
##pass
jing
##rix
c4
rich
##とても
niusnews
##ello
bag
770
##eting
##mobile
18
culture
015
##のてすか
377
1020
area
##ience
616
details
gp
universal
silver
dit
はお
private
ddd
u11
kanshu
##ified
fung
##nny
dx
##520
tai
475
023
##fr
##lean
3s
##pin
429
##rin
25000
ly
rick
##bility
usb3
banner
##baru
##gion
metal
dt
vdf
1871
karl
qualcomm
bear
1010
oldid
ian
jo
##tors
population
##ernel
1882
mmorpg
##mv
##bike
603
##©
ww
friend
##ager
exhibition
##del
##pods
fpx
structure
##free
##tings
kl
##rley
##copyright
##mma
california
3400
orange
yoga
4l
canmake
honey
##anda
##コメント
595
nikkie
##ルハイト
dhl
publishing
##mall
##gnet
20cm
513
##クセス
##┅
e88
970
##dog
fishbase
##!
##"
###
##$
##%
##&
##'
##(
##)
##*
##+
##,
##-
##.
##/
##:
##;
##<
##=
##>
##?
##@
##[
##\
##]
##^
##_
##{
##|
##}
##~
##£
##¤
##¥
##§
##«
##±
##³
##µ
##·
##¹
##º
##»
##¼
##ß
##æ
##÷
##ø
##đ
##ŋ
##ɔ
##ə
##ɡ
##ʰ
##ˇ
##ˈ
##ˊ
##ˋ
##ˍ
##ː
##˙
##˚
##ˢ
##α
##β
##γ
##δ
##ε
##η
##θ
##ι
##κ
##λ
##μ
##ν
##ο
##π
##ρ
##ς
##σ
##τ
##υ
##φ
##χ
##ψ
##б
##в
##г
##д
##е
##ж
##з
##к
##л
##м
##н
##о
##п
##р
##с
##т
##у
##ф
##х
##ц
##ч
##ш
##ы
##ь
##і
##ا
##ب
##ة
##ت
##د
##ر
##س
##ع
##ل
##م
##ن
##ه
##و
##ي
##۩
##ก
##ง
##น
##ม
##ย
##ร
##อ
##า
##เ
##๑
##་
##ღ
##ᄀ
##ᄁ
##ᄂ
##ᄃ
##ᄅ
##ᄆ
##ᄇ
##ᄈ
##ᄉ
##ᄋ
##ᄌ
##ᄎ
##ᄏ
##ᄐ
##ᄑ
##ᄒ
##ᅢ
##ᅣ
##ᅥ
##ᅦ
##ᅧ
##ᅨ
##ᅪ
##ᅬ
##ᅭ
##ᅮ
##ᅯ
##ᅲ
##ᅳ
##ᅴ
##ᆷ
##ᆸ
##ᆺ
##ᆻ
##ᗜ
##ᵃ
##ᵉ
##ᵍ
##ᵏ
##ᵐ
##ᵒ
##ᵘ
##‖
##„
##†
##•
##‥
##‧
##
##‰
##′
##″
##‹
##›
##※
##‿
##⁄
##ⁱ
##⁺
##ⁿ
##₁
##₃
##₄
##€
##№
##ⅰ
##ⅱ
##ⅲ
##ⅳ
##ⅴ
##↔
##↗
##↘
##⇒
##∀
##−
##∕
##∙
##√
##∞
##∟
##∠
##∣
##∩
##∮
##∶
##∼
##∽
##≈
##≒
##≡
##≤
##≥
##≦
##≧
##≪
##≫
##⊙
##⋅
##⋈
##⋯
##⌒
##①
##②
##③
##④
##⑤
##⑥
##⑦
##⑧
##⑨
##⑩
##⑴
##⑵
##⑶
##⑷
##⑸
##⒈
##⒉
##⒊
##⒋
##ⓒ
##ⓔ
##ⓘ
##━
##┃
##┆
##┊
##┌
##└
##├
##┣
##═
##║
##╚
##╞
##╠
##╭
##╮
##╯
##╰
##╱
##╳
##▂
##▃
##▅
##▇
##▉
##▋
##▌
##▍
##▎
##□
##▪
##▫
##▬
##△
##▶
##►
##▽
##◇
##◕
##◠
##◢
##◤
##☀
##☕
##☞
##☺
##☼
##♀
##♂
##♠
##♡
##♣
##♦
##♫
##♬
##✈
##✔
##✕
##✖
##✦
##✨
##✪
##✰
##✿
##❀
##➜
##➤
##⦿
##、
##。
##〃
##々
##〇
##〈
##〉
##《
##》
##「
##」
##『
##』
##【
##】
##〓
##〔
##〕
##〖
##〗
##〜
##〝
##〞
##ぃ
##ぇ
##ぬ
##ふ
##ほ
##む
##ゃ
##ゅ
##ゆ
##ょ
##゜
##ゝ
##ァ
##ゥ
##エ
##ォ
##ケ
##サ
##セ
##ソ
##ッ
##ニ
##ヌ
##ネ
##ノ
##ヘ
##モ
##ャ
##ヤ
##ュ
##ユ
##ョ
##ヨ
##ワ
##ヲ
##・
##ヽ
##ㄅ
##ㄆ
##ㄇ
##ㄉ
##ㄋ
##ㄌ
##ㄍ
##ㄎ
##ㄏ
##ㄒ
##ㄚ
##ㄛ
##ㄞ
##ㄟ
##ㄢ
##ㄤ
##ㄥ
##ㄧ
##ㄨ
##ㆍ
##㈦
##㊣
##㗎
##一
##丁
##七
##万
##丈
##三
##上
##下
##不
##与
##丐
##丑
##专
##且
##丕
##世
##丘
##丙
##业
##丛
##东
##丝
##丞
##丟
##両
##丢
##两
##严
##並
##丧
##丨
##个
##丫
##中
##丰
##串
##临
##丶
##丸
##丹
##为
##主
##丼
##丽
##举
##丿
##乂
##乃
##久
##么
##义
##之
##乌
##乍
##乎
##乏
##乐
##乒
##乓
##乔
##乖
##乗
##乘
##乙
##乜
##九
##乞
##也
##习
##乡
##书
##乩
##买
##乱
##乳
##乾
##亀
##亂
##了
##予
##争
##事
##二
##于
##亏
##云
##互
##五
##井
##亘
##亙
##亚
##些
##亜
##亞
##亟
##亡
##亢
##交
##亥
##亦
##产
##亨
##亩
##享
##京
##亭
##亮
##亲
##亳
##亵
##人
##亿
##什
##仁
##仃
##仄
##仅
##仆
##仇
##今
##介
##仍
##从
##仏
##仑
##仓
##仔
##仕
##他
##仗
##付
##仙
##仝
##仞
##仟
##代
##令
##以
##仨
##仪
##们
##仮
##仰
##仲
##件
##价
##任
##份
##仿
##企
##伉
##伊
##伍
##伎
##伏
##伐
##休
##伕
##众
##优
##伙
##会
##伝
##伞
##伟
##传
##伢
##伤
##伦
##伪
##伫
##伯
##估
##伴
##伶
##伸
##伺
##似
##伽
##佃
##但
##佇
##佈
##位
##低
##住
##佐
##佑
##体
##佔
##何
##佗
##佘
##余
##佚
##佛
##作
##佝
##佞
##佟
##你
##佢
##佣
##佤
##佥
##佩
##佬
##佯
##佰
##佳
##併
##佶
##佻
##佼
##使
##侃
##侄
##來
##侈
##例
##侍
##侏
##侑
##侖
##侗
##供
##依
##侠
##価
##侣
##侥
##侦
##侧
##侨
##侬
##侮
##侯
##侵
##侶
##侷
##便
##係
##促
##俄
##俊
##俎
##俏
##俐
##俑
##俗
##俘
##俚
##保
##俞
##俟
##俠
##信
##俨
##俩
##俪
##俬
##俭
##修
##俯
##俱
##俳
##俸
##俺
##俾
##倆
##倉
##個
##倌
##倍
##倏
##們
##倒
##倔
##倖
##倘
##候
##倚
##倜
##借
##倡
##値
##倦
##倩
##倪
##倫
##倬
##倭
##倶
##债
##值
##倾
##偃
##假
##偈
##偉
##偌
##偎
##偏
##偕
##做
##停
##健
##側
##偵
##偶
##偷
##偻
##偽
##偿
##傀
##傅
##傍
##傑
##傘
##備
##傚
##傢
##傣
##傥
##储
##傩
##催
##傭
##傲
##傳
##債
##傷
##傻
##傾
##僅
##働
##像
##僑
##僕
##僖
##僚
##僥
##僧
##僭
##僮
##僱
##僵
##價
##僻
##儀
##儂
##億
##儆
##儉
##儋
##儒
##儕
##儘
##償
##儡
##優
##儲
##儷
##儼
##儿
##兀
##允
##元
##兄
##充
##兆
##兇
##先
##光
##克
##兌
##免
##児
##兑
##兒
##兔
##兖
##党
##兜
##兢
##入
##內
##全
##兩
##八
##公
##六
##兮
##兰
##共
##兲
##关
##兴
##兵
##其
##具
##典
##兹
##养
##兼
##兽
##冀
##内
##円
##冇
##冈
##冉
##冊
##册
##再
##冏
##冒
##冕
##冗
##写
##军
##农
##冠
##冢
##冤
##冥
##冨
##冪
##冬
##冯
##冰
##冲
##决
##况
##冶
##冷
##冻
##冼
##冽
##冾
##净
##凄
##准
##凇
##凈
##凉
##凋
##凌
##凍
##减
##凑
##凛
##凜
##凝
##几
##凡
##凤
##処
##凪
##凭
##凯
##凰
##凱
##凳
##凶
##凸
##凹
##出
##击
##函
##凿
##刀
##刁
##刃
##分
##切
##刈
##刊
##刍
##刎
##刑
##划
##列
##刘
##则
##刚
##创
##初
##删
##判
##別
##刨
##利
##刪
##别
##刮
##到
##制
##刷
##券
##刹
##刺
##刻
##刽
##剁
##剂
##剃
##則
##剉
##削
##剋
##剌
##前
##剎
##剐
##剑
##剔
##剖
##剛
##剜
##剝
##剣
##剤
##剥
##剧
##剩
##剪
##副
##割
##創
##剷
##剽
##剿
##劃
##劇
##劈
##劉
##劊
##劍
##劏
##劑
##力
##劝
##办
##功
##加
##务
##劣
##动
##助
##努
##劫
##劭
##励
##劲
##劳
##労
##劵
##効
##劾
##势
##勁
##勃
##勇
##勉
##勋
##勐
##勒
##動
##勖
##勘
##務
##勛
##勝
##勞
##募
##勢
##勤
##勧
##勳
##勵
##勸
##勺
##勻
##勾
##勿
##匀
##包
##匆
##匈
##匍
##匐
##匕
##化
##北
##匙
##匝
##匠
##匡
##匣
##匪
##匮
##匯
##匱
##匹
##区
##医
##匾
##匿
##區
##十
##千
##卅
##升
##午
##卉
##半
##卍
##华
##协
##卑
##卒
##卓
##協
##单
##卖
##南
##単
##博
##卜
##卞
##卟
##占
##卡
##卢
##卤
##卦
##卧
##卫
##卮
##卯
##印
##危
##即
##却
##卵
##卷
##卸
##卻
##卿
##厂
##厄
##厅
##历
##厉
##压
##厌
##厕
##厘
##厚
##厝
##原
##厢
##厥
##厦
##厨
##厩
##厭
##厮
##厲
##厳
##去
##县
##叁
##参
##參
##又
##叉
##及
##友
##双
##反
##収
##发
##叔
##取
##受
##变
##叙
##叛
##叟
##叠
##叡
##叢
##口
##古
##句
##另
##叨
##叩
##只
##叫
##召
##叭
##叮
##可
##台
##叱
##史
##右
##叵
##叶
##号
##司
##叹
##叻
##叼
##叽
##吁
##吃
##各
##吆
##合
##吉
##吊
##吋
##同
##名
##后
##吏
##吐
##向
##吒
##吓
##吕
##吖
##吗
##君
##吝
##吞
##吟
##吠
##吡
##否
##吧
##吨
##吩
##含
##听
##吭
##吮
##启
##吱
##吳
##吴
##吵
##吶
##吸
##吹
##吻
##吼
##吽
##吾
##呀
##呂
##呃
##呆
##呈
##告
##呋
##呎
##呐
##呓
##呕
##呗
##员
##呛
##呜
##呢
##呤
##呦
##周
##呱
##呲
##味
##呵
##呷
##呸
##呻
##呼
##命
##咀
##咁
##咂
##咄
##咆
##咋
##和
##咎
##咏
##咐
##咒
##咔
##咕
##咖
##咗
##咘
##咙
##咚
##咛
##咣
##咤
##咦
##咧
##咨
##咩
##咪
##咫
##咬
##咭
##咯
##咱
##咲
##咳
##咸
##咻
##咽
##咿
##哀
##品
##哂
##哄
##哆
##哇
##哈
##哉
##哋
##哌
##响
##哎
##哏
##哐
##哑
##哒
##哔
##哗
##哟
##員
##哥
##哦
##哧
##哨
##哩
##哪
##哭
##哮
##哲
##哺
##哼
##哽
##唁
##唄
##唆
##唇
##唉
##唏
##唐
##唑
##唔
##唠
##唤
##唧
##唬
##售
##唯
##唰
##唱
##唳
##唷
##唸
##唾
##啃
##啄
##商
##啉
##啊
##問
##啓
##啕
##啖
##啜
##啞
##啟
##啡
##啤
##啥
##啦
##啧
##啪
##啫
##啬
##啮
##啰
##啱
##啲
##啵
##啶
##啷
##啸
##啻
##啼
##啾
##喀
##喂
##喃
##善
##喆
##喇
##喉
##喊
##喋
##喎
##喏
##喔
##喘
##喙
##喚
##喜
##喝
##喟
##喧
##喪
##喫
##喬
##單
##喰
##喱
##喲
##喳
##喵
##営
##喷
##喹
##喺
##喻
##喽
##嗅
##嗆
##嗇
##嗎
##嗑
##嗒
##嗓
##嗔
##嗖
##嗚
##嗜
##嗝
##嗟
##嗡
##嗣
##嗤
##嗦
##嗨
##嗪
##嗬
##嗯
##嗰
##嗲
##嗳
##嗶
##嗷
##嗽
##嘀
##嘅
##嘆
##嘈
##嘉
##嘌
##嘍
##嘎
##嘔
##嘖
##嘗
##嘘
##嘚
##嘛
##嘜
##嘞
##嘟
##嘢
##嘣
##嘤
##嘧
##嘩
##嘭
##嘮
##嘯
##嘰
##嘱
##嘲
##嘴
##嘶
##嘸
##嘹
##嘻
##嘿
##噁
##噌
##噎
##噓
##噔
##噗
##噙
##噜
##噠
##噢
##噤
##器
##噩
##噪
##噬
##噱
##噴
##噶
##噸
##噹
##噻
##噼
##嚀
##嚇
##嚎
##嚏
##嚐
##嚓
##嚕
##嚟
##嚣
##嚥
##嚨
##嚮
##嚴
##嚷
##嚼
##囂
##囉
##囊
##囍
##囑
##囔
##囗
##囚
##四
##囝
##回
##囟
##因
##囡
##团
##団
##囤
##囧
##囪
##囫
##园
##困
##囱
##囲
##図
##围
##囹
##固
##国
##图
##囿
##圃
##圄
##圆
##圈
##國
##圍
##圏
##園
##圓
##圖
##團
##圜
##土
##圣
##圧
##在
##圩
##圭
##地
##圳
##场
##圻
##圾
##址
##坂
##均
##坊
##坍
##坎
##坏
##坐
##坑
##块
##坚
##坛
##坝
##坞
##坟
##坠
##坡
##坤
##坦
##坨
##坪
##坯
##坳
##坵
##坷
##垂
##垃
##垄
##型
##垒
##垚
##垛
##垠
##垢
##垣
##垦
##垩
##垫
##垭
##垮
##垵
##埂
##埃
##埋
##城
##埔
##埕
##埗
##域
##埠
##埤
##埵
##執
##埸
##培
##基
##埼
##堀
##堂
##堃
##堅
##堆
##堇
##堑
##堕
##堙
##堡
##堤
##堪
##堯
##堰
##報
##場
##堵
##堺
##堿
##塊
##塌
##塑
##塔
##塗
##塘
##塚
##塞
##塢
##塩
##填
##塬
##塭
##塵
##塾
##墀
##境
##墅
##墉
##墊
##墒
##墓
##増
##墘
##墙
##墜
##增
##墟
##墨
##墩
##墮
##墳
##墻
##墾
##壁
##壅
##壆
##壇
##壊
##壑
##壓
##壕
##壘
##壞
##壟
##壢
##壤
##壩
##士
##壬
##壮
##壯
##声
##売
##壳
##壶
##壹
##壺
##壽
##处
##备
##変
##复
##夏
##夔
##夕
##外
##夙
##多
##夜
##够
##夠
##夢
##夥
##大
##天
##太
##夫
##夭
##央
##夯
##失
##头
##夷
##夸
##夹
##夺
##夾
##奂
##奄
##奇
##奈
##奉
##奋
##奎
##奏
##奐
##契
##奔
##奕
##奖
##套
##奘
##奚
##奠
##奢
##奥
##奧
##奪
##奬
##奮
##女
##奴
##奶
##奸
##她
##好
##如
##妃
##妄
##妆
##妇
##妈
##妊
##妍
##妒
##妓
##妖
##妘
##妙
##妝
##妞
##妣
##妤
##妥
##妨
##妩
##妪
##妮
##妲
##妳
##妹
##妻
##妾
##姆
##姉
##姊
##始
##姍
##姐
##姑
##姒
##姓
##委
##姗
##姚
##姜
##姝
##姣
##姥
##姦
##姨
##姪
##姫
##姬
##姹
##姻
##姿
##威
##娃
##娄
##娅
##娆
##娇
##娉
##娑
##娓
##娘
##娛
##娜
##娟
##娠
##娣
##娥
##娩
##娱
##娲
##娴
##娶
##娼
##婀
##婁
##婆
##婉
##婊
##婕
##婚
##婢
##婦
##婧
##婪
##婭
##婴
##婵
##婶
##婷
##婺
##婿
##媒
##媚
##媛
##媞
##媧
##媲
##媳
##媽
##媾
##嫁
##嫂
##嫉
##嫌
##嫑
##嫔
##嫖
##嫘
##嫚
##嫡
##嫣
##嫦
##嫩
##嫲
##嫵
##嫻
##嬅
##嬉
##嬌
##嬗
##嬛
##嬢
##嬤
##嬪
##嬰
##嬴
##嬷
##嬸
##嬿
##孀
##孃
##子
##孑
##孔
##孕
##孖
##字
##存
##孙
##孚
##孛
##孜
##孝
##孟
##孢
##季
##孤
##学
##孩
##孪
##孫
##孬
##孰
##孱
##孳
##孵
##學
##孺
##孽
##孿
##宁
##它
##宅
##宇
##守
##安
##宋
##完
##宏
##宓
##宕
##宗
##官
##宙
##定
##宛
##宜
##宝
##实
##実
##宠
##审
##客
##宣
##室
##宥
##宦
##宪
##宫
##宮
##宰
##害
##宴
##宵
##家
##宸
##容
##宽
##宾
##宿
##寂
##寄
##寅
##密
##寇
##富
##寐
##寒
##寓
##寛
##寝
##寞
##察
##寡
##寢
##寥
##實
##寧
##寨
##審
##寫
##寬
##寮
##寰
##寵
##寶
##寸
##对
##寺
##寻
##导
##対
##寿
##封
##専
##射
##将
##將
##專
##尉
##尊
##尋
##對
##導
##小
##少
##尔
##尕
##尖
##尘
##尚
##尝
##尤
##尧
##尬
##就
##尴
##尷
##尸
##尹
##尺
##尻
##尼
##尽
##尾
##尿
##局
##屁
##层
##屄
##居
##屆
##屈
##屉
##届
##屋
##屌
##屍
##屎
##屏
##屐
##屑
##展
##屜
##属
##屠
##屡
##屢
##層
##履
##屬
##屯
##山
##屹
##屿
##岀
##岁
##岂
##岌
##岐
##岑
##岔
##岖
##岗
##岘
##岙
##岚
##岛
##岡
##岩
##岫
##岬
##岭
##岱
##岳
##岷
##岸
##峇
##峋
##峒
##峙
##峡
##峤
##峥
##峦
##峨
##峪
##峭
##峯
##峰
##峴
##島
##峻
##峽
##崁
##崂
##崆
##崇
##崎
##崑
##崔
##崖
##崗
##崙
##崛
##崧
##崩
##崭
##崴
##崽
##嵇
##嵊
##嵋
##嵌
##嵐
##嵘
##嵩
##嵬
##嵯
##嶂
##嶄
##嶇
##嶋
##嶙
##嶺
##嶼
##嶽
##巅
##巍
##巒
##巔
##巖
##川
##州
##巡
##巢
##工
##左
##巧
##巨
##巩
##巫
##差
##己
##已
##巳
##巴
##巷
##巻
##巽
##巾
##巿
##币
##市
##布
##帅
##帆
##师
##希
##帐
##帑
##帕
##帖
##帘
##帚
##帛
##帜
##帝
##帥
##带
##帧
##師
##席
##帮
##帯
##帰
##帳
##帶
##帷
##常
##帼
##帽
##幀
##幂
##幄
##幅
##幌
##幔
##幕
##幟
##幡
##幢
##幣
##幫
##干
##平
##年
##并
##幸
##幹
##幺
##幻
##幼
##幽
##幾
##广
##庁
##広
##庄
##庆
##庇
##床
##序
##庐
##库
##应
##底
##庖
##店
##庙
##庚
##府
##庞
##废
##庠
##度
##座
##庫
##庭
##庵
##庶
##康
##庸
##庹
##庾
##廁
##廂
##廃
##廈
##廉
##廊
##廓
##廖
##廚
##廝
##廟
##廠
##廢
##廣
##廬
##廳
##延
##廷
##建
##廿
##开
##弁
##异
##弃
##弄
##弈
##弊
##弋
##式
##弑
##弒
##弓
##弔
##引
##弗
##弘
##弛
##弟
##张
##弥
##弦
##弧
##弩
##弭
##弯
##弱
##張
##強
##弹
##强
##弼
##弾
##彅
##彆
##彈
##彌
##彎
##归
##当
##录
##彗
##彙
##彝
##形
##彤
##彥
##彦
##彧
##彩
##彪
##彫
##彬
##彭
##彰
##影
##彷
##役
##彻
##彼
##彿
##往
##征
##径
##待
##徇
##很
##徉
##徊
##律
##後
##徐
##徑
##徒
##従
##徕
##得
##徘
##徙
##徜
##從
##徠
##御
##徨
##復
##循
##徬
##微
##徳
##徴
##徵
##德
##徹
##徼
##徽
##心
##必
##忆
##忌
##忍
##忏
##忐
##忑
##忒
##忖
##志
##忘
##忙
##応
##忠
##忡
##忤
##忧
##忪
##快
##忱
##念
##忻
##忽
##忿
##怀
##态
##怂
##怅
##怆
##怎
##怏
##怒
##怔
##怕
##怖
##怙
##怜
##思
##怠
##怡
##急
##怦
##性
##怨
##怪
##怯
##怵
##总
##怼
##恁
##恃
##恆
##恋
##恍
##恐
##恒
##恕
##恙
##恚
##恢
##恣
##恤
##恥
##恨
##恩
##恪
##恫
##恬
##恭
##息
##恰
##恳
##恵
##恶
##恸
##恺
##恻
##恼
##恿
##悄
##悅
##悉
##悌
##悍
##悔
##悖
##悚
##悟
##悠
##患
##悦
##您
##悩
##悪
##悬
##悯
##悱
##悲
##悴
##悵
##悶
##悸
##悻
##悼
##悽
##情
##惆
##惇
##惊
##惋
##惑
##惕
##惘
##惚
##惜
##惟
##惠
##惡
##惦
##惧
##惨
##惩
##惫
##惬
##惭
##惮
##惯
##惰
##惱
##想
##惴
##惶
##惹
##惺
##愁
##愆
##愈
##愉
##愍
##意
##愕
##愚
##愛
##愜
##感
##愣
##愤
##愧
##愫
##愷
##愿
##慄
##慈
##態
##慌
##慎
##慑
##慕
##慘
##慚
##慟
##慢
##慣
##慧
##慨
##慫
##慮
##慰
##慳
##慵
##慶
##慷
##慾
##憂
##憊
##憋
##憎
##憐
##憑
##憔
##憚
##憤
##憧
##憨
##憩
##憫
##憬
##憲
##憶
##憾
##懂
##懇
##懈
##應
##懊
##懋
##懑
##懒
##懦
##懲
##懵
##懶
##懷
##懸
##懺
##懼
##懾
##懿
##戀
##戈
##戊
##戌
##戍
##戎
##戏
##成
##我
##戒
##戕
##或
##战
##戚
##戛
##戟
##戡
##戦
##截
##戬
##戮
##戰
##戲
##戳
##戴
##戶
##户
##戸
##戻
##戾
##房
##所
##扁
##扇
##扈
##扉
##手
##才
##扎
##扑
##扒
##打
##扔
##払
##托
##扛
##扣
##扦
##执
##扩
##扪
##扫
##扬
##扭
##扮
##扯
##扰
##扱
##扳
##扶
##批
##扼
##找
##承
##技
##抄
##抉
##把
##抑
##抒
##抓
##投
##抖
##抗
##折
##抚
##抛
##抜
##択
##抟
##抠
##抡
##抢
##护
##报
##抨
##披
##抬
##抱
##抵
##抹
##押
##抽
##抿
##拂
##拄
##担
##拆
##拇
##拈
##拉
##拋
##拌
##拍
##拎
##拐
##拒
##拓
##拔
##拖
##拗
##拘
##拙
##拚
##招
##拜
##拟
##拡
##拢
##拣
##拥
##拦
##拧
##拨
##择
##括
##拭
##拮
##拯
##拱
##拳
##拴
##拷
##拼
##拽
##拾
##拿
##持
##挂
##指
##挈
##按
##挎
##挑
##挖
##挙
##挚
##挛
##挝
##挞
##挟
##挠
##挡
##挣
##挤
##挥
##挨
##挪
##挫
##振
##挲
##挹
##挺
##挽
##挾
##捂
##捅
##捆
##捉
##捋
##捌
##捍
##捎
##捏
##捐
##捕
##捞
##损
##捡
##换
##捣
##捧
##捨
##捩
##据
##捱
##捲
##捶
##捷
##捺
##捻
##掀
##掂
##掃
##掇
##授
##掉
##掌
##掏
##掐
##排
##掖
##掘
##掙
##掛
##掠
##採
##探
##掣
##接
##控
##推
##掩
##措
##掬
##掰
##掲
##掳
##掴
##掷
##掸
##掺
##揀
##揃
##揄
##揆
##揉
##揍
##描
##提
##插
##揖
##揚
##換
##握
##揣
##揩
##揪
##揭
##揮
##援
##揶
##揸
##揹
##揽
##搀
##搁
##搂
##搅
##損
##搏
##搐
##搓
##搔
##搖
##搗
##搜
##搞
##搡
##搪
##搬
##搭
##搵
##搶
##携
##搽
##摀
##摁
##摄
##摆
##摇
##摈
##摊
##摒
##摔
##摘
##摞
##摟
##摧
##摩
##摯
##摳
##摸
##摹
##摺
##摻
##撂
##撃
##撅
##撇
##撈
##撐
##撑
##撒
##撓
##撕
##撚
##撞
##撤
##撥
##撩
##撫
##撬
##播
##撮
##撰
##撲
##撵
##撷
##撸
##撻
##撼
##撿
##擀
##擁
##擂
##擄
##擅
##擇
##擊
##擋
##操
##擎
##擒
##擔
##擘
##據
##擞
##擠
##擡
##擢
##擦
##擬
##擰
##擱
##擲
##擴
##擷
##擺
##擼
##擾
##攀
##攏
##攒
##攔
##攘
##攙
##攜
##攝
##攞
##攢
##攣
##攤
##攥
##攪
##攫
##攬
##支
##收
##攸
##改
##攻
##放
##政
##故
##效
##敌
##敍
##敎
##敏
##救
##敕
##敖
##敗
##敘
##教
##敛
##敝
##敞
##敢
##散
##敦
##敬
##数
##敲
##整
##敵
##敷
##數
##斂
##斃
##文
##斋
##斌
##斎
##斐
##斑
##斓
##斗
##料
##斛
##斜
##斟
##斡
##斤
##斥
##斧
##斩
##斫
##斬
##断
##斯
##新
##斷
##方
##於
##施
##旁
##旃
##旅
##旋
##旌
##旎
##族
##旖
##旗
##无
##既
##日
##旦
##旧
##旨
##早
##旬
##旭
##旮
##旱
##时
##旷
##旺
##旻
##昀
##昂
##昆
##昇
##昉
##昊
##昌
##明
##昏
##易
##昔
##昕
##昙
##星
##映
##春
##昧
##昨
##昭
##是
##昱
##昴
##昵
##昶
##昼
##显
##晁
##時
##晃
##晉
##晋
##晌
##晏
##晒
##晓
##晔
##晕
##晖
##晗
##晚
##晝
##晞
##晟
##晤
##晦
##晨
##晩
##普
##景
##晰
##晴
##晶
##晷
##智
##晾
##暂
##暄
##暇
##暈
##暉
##暌
##暐
##暑
##暖
##暗
##暝
##暢
##暧
##暨
##暫
##暮
##暱
##暴
##暸
##暹
##曄
##曆
##曇
##曉
##曖
##曙
##曜
##曝
##曠
##曦
##曬
##曰
##曲
##曳
##更
##書
##曹
##曼
##曾
##替
##最
##會
##月
##有
##朋
##服
##朐
##朔
##朕
##朗
##望
##朝
##期
##朦
##朧
##木
##未
##末
##本
##札
##朮
##术
##朱
##朴
##朵
##机
##朽
##杀
##杂
##权
##杆
##杈
##杉
##李
##杏
##材
##村
##杓
##杖
##杜
##杞
##束
##杠
##条
##来
##杨
##杭
##杯
##杰
##東
##杳
##杵
##杷
##杼
##松
##板
##极
##构
##枇
##枉
##枋
##析
##枕
##林
##枚
##果
##枝
##枢
##枣
##枪
##枫
##枭
##枯
##枰
##枱
##枳
##架
##枷
##枸
##柄
##柏
##某
##柑
##柒
##染
##柔
##柘
##柚
##柜
##柞
##柠
##柢
##查
##柩
##柬
##柯
##柱
##柳
##柴
##柵
##査
##柿
##栀
##栃
##栄
##栅
##标
##栈
##栉
##栋
##栎
##栏
##树
##栓
##栖
##栗
##校
##栩
##株
##样
##核
##根
##格
##栽
##栾
##桀
##桁
##桂
##桃
##桅
##框
##案
##桉
##桌
##桎
##桐
##桑
##桓
##桔
##桜
##桠
##桡
##桢
##档
##桥
##桦
##桧
##桨
##桩
##桶
##桿
##梁
##梅
##梆
##梏
##梓
##梗
##條
##梟
##梢
##梦
##梧
##梨
##梭
##梯
##械
##梳
##梵
##梶
##检
##棂
##棄
##棉
##棋
##棍
##棒
##棕
##棗
##棘
##棚
##棟
##棠
##棣
##棧
##森
##棱
##棲
##棵
##棹
##棺
##椁
##椅
##椋
##植
##椎
##椒
##検
##椪
##椭
##椰
##椹
##椽
##椿
##楂
##楊
##楓
##楔
##楚
##楝
##楞
##楠
##楣
##楨
##楫
##業
##楮
##極
##楷
##楸
##楹
##楼
##楽
##概
##榄
##榆
##榈
##榉
##榔
##榕
##榖
##榛
##榜
##榨
##榫
##榭
##榮
##榱
##榴
##榷
##榻
##槁
##槃
##構
##槌
##槍
##槎
##槐
##槓
##様
##槛
##槟
##槤
##槭
##槲
##槳
##槻
##槽
##槿
##樁
##樂
##樊
##樑
##樓
##標
##樞
##樟
##模
##樣
##権
##横
##樫
##樯
##樱
##樵
##樸
##樹
##樺
##樽
##樾
##橄
##橇
##橋
##橐
##橘
##橙
##機
##橡
##橢
##橫
##橱
##橹
##橼
##檀
##檄
##檎
##檐
##檔
##檗
##檜
##檢
##檬
##檯
##檳
##檸
##檻
##櫃
##櫚
##櫛
##櫥
##櫸
##櫻
##欄
##權
##欒
##欖
##欠
##次
##欢
##欣
##欧
##欲
##欸
##欺
##欽
##款
##歆
##歇
##歉
##歌
##歎
##歐
##歓
##歙
##歛
##歡
##止
##正
##此
##步
##武
##歧
##歩
##歪
##歯
##歲
##歳
##歴
##歷
##歸
##歹
##死
##歼
##殁
##殃
##殆
##殇
##殉
##殊
##残
##殒
##殓
##殖
##殘
##殞
##殡
##殤
##殭
##殯
##殲
##殴
##段
##殷
##殺
##殼
##殿
##毀
##毁
##毂
##毅
##毆
##毋
##母
##毎
##每
##毒
##毓
##比
##毕
##毗
##毘
##毙
##毛
##毡
##毫
##毯
##毽
##氈
##氏
##氐
##民
##氓
##气
##氖
##気
##氙
##氛
##氟
##氡
##氢
##氣
##氤
##氦
##氧
##氨
##氪
##氫
##氮
##氯
##氰
##氲
##水
##氷
##永
##氹
##氾
##汀
##汁
##求
##汆
##汇
##汉
##汎
##汐
##汕
##汗
##汙
##汛
##汝
##汞
##江
##池
##污
##汤
##汨
##汩
##汪
##汰
##汲
##汴
##汶
##汹
##決
##汽
##汾
##沁
##沂
##沃
##沅
##沈
##沉
##沌
##沏
##沐
##沒
##沓
##沖
##沙
##沛
##沟
##没
##沢
##沣
##沥
##沦
##沧
##沪
##沫
##沭
##沮
##沱
##河
##沸
##油
##治
##沼
##沽
##沾
##沿
##況
##泄
##泉
##泊
##泌
##泓
##法
##泗
##泛
##泞
##泠
##泡
##波
##泣
##泥
##注
##泪
##泫
##泮
##泯
##泰
##泱
##泳
##泵
##泷
##泸
##泻
##泼
##泽
##泾
##洁
##洄
##洋
##洒
##洗
##洙
##洛
##洞
##津
##洩
##洪
##洮
##洱
##洲
##洵
##洶
##洸
##洹
##活
##洼
##洽
##派
##流
##浃
##浄
##浅
##浆
##浇
##浊
##测
##济
##浏
##浑
##浒
##浓
##浔
##浙
##浚
##浜
##浣
##浦
##浩
##浪
##浬
##浮
##浯
##浴
##海
##浸
##涂
##涅
##涇
##消
##涉
##涌
##涎
##涓
##涔
##涕
##涙
##涛
##涝
##涞
##涟
##涠
##涡
##涣
##涤
##润
##涧
##涨
##涩
##涪
##涮
##涯
##液
##涵
##涸
##涼
##涿
##淀
##淄
##淅
##淆
##淇
##淋
##淌
##淑
##淒
##淖
##淘
##淙
##淚
##淞
##淡
##淤
##淦
##淨
##淩
##淪
##淫
##淬
##淮
##深
##淳
##淵
##混
##淹
##淺
##添
##淼
##清
##済
##渉
##渊
##渋
##渍
##渎
##渐
##渔
##渗
##渙
##渚
##減
##渝
##渠
##渡
##渣
##渤
##渥
##渦
##温
##測
##渭
##港
##渲
##渴
##游
##渺
##渾
##湃
##湄
##湊
##湍
##湖
##湘
##湛
##湟
##湧
##湫
##湮
##湯
##湳
##湾
##湿
##満
##溃
##溅
##溉
##溏
##源
##準
##溜
##溝
##溟
##溢
##溥
##溧
##溪
##溫
##溯
##溱
##溴
##溶
##溺
##溼
##滁
##滂
##滄
##滅
##滇
##滋
##滌
##滑
##滓
##滔
##滕
##滙
##滚
##滝
##滞
##滟
##满
##滢
##滤
##滥
##滦
##滨
##滩
##滬
##滯
##滲
##滴
##滷
##滸
##滾
##滿
##漁
##漂
##漆
##漉
##漏
##漓
##演
##漕
##漠
##漢
##漣
##漩
##漪
##漫
##漬
##漯
##漱
##漲
##漳
##漸
##漾
##漿
##潆
##潇
##潋
##潍
##潑
##潔
##潘
##潛
##潜
##潞
##潟
##潢
##潤
##潦
##潧
##潭
##潮
##潰
##潴
##潸
##潺
##潼
##澀
##澄
##澆
##澈
##澍
##澎
##澗
##澜
##澡
##澤
##澧
##澱
##澳
##澹
##激
##濁
##濂
##濃
##濑
##濒
##濕
##濘
##濛
##濟
##濠
##濡
##濤
##濫
##濬
##濮
##濯
##濱
##濺
##濾
##瀅
##瀆
##瀉
##瀋
##瀏
##瀑
##瀕
##瀘
##瀚
##瀛
##瀝
##瀞
##瀟
##瀧
##瀨
##瀬
##瀰
##瀾
##灌
##灏
##灑
##灘
##灝
##灞
##灣
##火
##灬
##灭
##灯
##灰
##灵
##灶
##灸
##灼
##災
##灾
##灿
##炀
##炁
##炅
##炉
##炊
##炎
##炒
##炔
##炕
##炖
##炙
##炜
##炫
##炬
##炭
##炮
##炯
##炳
##炷
##炸
##点
##為
##炼
##炽
##烁
##烂
##烃
##烈
##烊
##烏
##烘
##烙
##烛
##烟
##烤
##烦
##烧
##烨
##烩
##烫
##烬
##热
##烯
##烷
##烹
##烽
##焉
##焊
##焕
##焖
##焗
##焘
##焙
##焚
##焜
##無
##焦
##焯
##焰
##焱
##然
##焼
##煅
##煉
##煊
##煌
##煎
##煒
##煖
##煙
##煜
##煞
##煤
##煥
##煦
##照
##煨
##煩
##煮
##煲
##煸
##煽
##熄
##熊
##熏
##熒
##熔
##熙
##熟
##熠
##熨
##熬
##熱
##熵
##熹
##熾
##燁
##燃
##燄
##燈
##燉
##燊
##燎
##燒
##燔
##燕
##燙
##燜
##營
##燥
##燦
##燧
##燭
##燮
##燴
##燻
##燼
##燿
##爆
##爍
##爐
##爛
##爪
##爬
##爭
##爰
##爱
##爲
##爵
##父
##爷
##爸
##爹
##爺
##爻
##爽
##爾
##牆
##片
##版
##牌
##牍
##牒
##牙
##牛
##牝
##牟
##牠
##牡
##牢
##牦
##牧
##物
##牯
##牲
##牴
##牵
##特
##牺
##牽
##犀
##犁
##犄
##犊
##犍
##犒
##犢
##犧
##犬
##犯
##状
##犷
##犸
##犹
##狀
##狂
##狄
##狈
##狎
##狐
##狒
##狗
##狙
##狞
##狠
##狡
##狩
##独
##狭
##狮
##狰
##狱
##狸
##狹
##狼
##狽
##猎
##猕
##猖
##猗
##猙
##猛
##猜
##猝
##猥
##猩
##猪
##猫
##猬
##献
##猴
##猶
##猷
##猾
##猿
##獄
##獅
##獎
##獐
##獒
##獗
##獠
##獣
##獨
##獭
##獰
##獲
##獵
##獷
##獸
##獺
##獻
##獼
##獾
##玄
##率
##玉
##王
##玑
##玖
##玛
##玟
##玠
##玥
##玩
##玫
##玮
##环
##现
##玲
##玳
##玷
##玺
##玻
##珀
##珂
##珅
##珈
##珉
##珊
##珍
##珏
##珐
##珑
##珙
##珞
##珠
##珣
##珥
##珩
##珪
##班
##珮
##珲
##珺
##現
##球
##琅
##理
##琇
##琉
##琊
##琍
##琏
##琐
##琛
##琢
##琥
##琦
##琨
##琪
##琬
##琮
##琰
##琲
##琳
##琴
##琵
##琶
##琺
##琼
##瑀
##瑁
##瑄
##瑋
##瑕
##瑗
##瑙
##瑚
##瑛
##瑜
##瑞
##瑟
##瑠
##瑣
##瑤
##瑩
##瑪
##瑯
##瑰
##瑶
##瑾
##璀
##璁
##璃
##璇
##璉
##璋
##璎
##璐
##璜
##璞
##璟
##璧
##璨
##環
##璽
##璿
##瓊
##瓏
##瓒
##瓜
##瓢
##瓣
##瓤
##瓦
##瓮
##瓯
##瓴
##瓶
##瓷
##甄
##甌
##甕
##甘
##甙
##甚
##甜
##生
##產
##産
##甥
##甦
##用
##甩
##甫
##甬
##甭
##甯
##田
##由
##甲
##申
##电
##男
##甸
##町
##画
##甾
##畀
##畅
##界
##畏
##畑
##畔
##留
##畜
##畝
##畢
##略
##畦
##番
##畫
##異
##畲
##畳
##畴
##當
##畸
##畹
##畿
##疆
##疇
##疊
##疏
##疑
##疔
##疖
##疗
##疙
##疚
##疝
##疟
##疡
##疣
##疤
##疥
##疫
##疮
##疯
##疱
##疲
##疳
##疵
##疸
##疹
##疼
##疽
##疾
##痂
##病
##症
##痈
##痉
##痊
##痍
##痒
##痔
##痕
##痘
##痙
##痛
##痞
##痠
##痢
##痣
##痤
##痧
##痨
##痪
##痫
##痰
##痱
##痴
##痹
##痺
##痼
##痿
##瘀
##瘁
##瘋
##瘍
##瘓
##瘘
##瘙
##瘟
##瘠
##瘡
##瘢
##瘤
##瘦
##瘧
##瘩
##瘪
##瘫
##瘴
##瘸
##瘾
##療
##癇
##癌
##癒
##癖
##癜
##癞
##癡
##癢
##癣
##癥
##癫
##癬
##癮
##癱
##癲
##癸
##発
##登
##發
##白
##百
##皂
##的
##皆
##皇
##皈
##皋
##皎
##皑
##皓
##皖
##皙
##皚
##皮
##皰
##皱
##皴
##皺
##皿
##盂
##盃
##盅
##盆
##盈
##益
##盎
##盏
##盐
##监
##盒
##盔
##盖
##盗
##盘
##盛
##盜
##盞
##盟
##盡
##監
##盤
##盥
##盧
##盪
##目
##盯
##盱
##盲
##直
##相
##盹
##盼
##盾
##省
##眈
##眉
##看
##県
##眙
##眞
##真
##眠
##眦
##眨
##眩
##眯
##眶
##眷
##眸
##眺
##眼
##眾
##着
##睁
##睇
##睏
##睐
##睑
##睛
##睜
##睞
##睡
##睢
##督
##睥
##睦
##睨
##睪
##睫
##睬
##睹
##睽
##睾
##睿
##瞄
##瞅
##瞇
##瞋
##瞌
##瞎
##瞑
##瞒
##瞓
##瞞
##瞟
##瞠
##瞥
##瞧
##瞩
##瞪
##瞬
##瞭
##瞰
##瞳
##瞻
##瞼
##瞿
##矇
##矍
##矗
##矚
##矛
##矜
##矢
##矣
##知
##矩
##矫
##短
##矮
##矯
##石
##矶
##矽
##矾
##矿
##码
##砂
##砌
##砍
##砒
##研
##砖
##砗
##砚
##砝
##砣
##砥
##砧
##砭
##砰
##砲
##破
##砷
##砸
##砺
##砼
##砾
##础
##硅
##硐
##硒
##硕
##硝
##硫
##硬
##确
##硯
##硼
##碁
##碇
##碉
##碌
##碍
##碎
##碑
##碓
##碗
##碘
##碚
##碛
##碟
##碣
##碧
##碩
##碰
##碱
##碳
##碴
##確
##碼
##碾
##磁
##磅
##磊
##磋
##磐
##磕
##磚
##磡
##磨
##磬
##磯
##磲
##磷
##磺
##礁
##礎
##礙
##礡
##礦
##礪
##礫
##礴
##示
##礼
##社
##祀
##祁
##祂
##祇
##祈
##祉
##祎
##祐
##祕
##祖
##祗
##祚
##祛
##祜
##祝
##神
##祟
##祠
##祢
##祥
##票
##祭
##祯
##祷
##祸
##祺
##祿
##禀
##禁
##禄
##禅
##禍
##禎
##福
##禛
##禦
##禧
##禪
##禮
##禱
##禹
##禺
##离
##禽
##禾
##禿
##秀
##私
##秃
##秆
##秉
##秋
##种
##科
##秒
##秘
##租
##秣
##秤
##秦
##秧
##秩
##秭
##积
##称
##秸
##移
##秽
##稀
##稅
##程
##稍
##税
##稔
##稗
##稚
##稜
##稞
##稟
##稠
##稣
##種
##稱
##稲
##稳
##稷
##稹
##稻
##稼
##稽
##稿
##穀
##穂
##穆
##穌
##積
##穎
##穗
##穢
##穩
##穫
##穴
##究
##穷
##穹
##空
##穿
##突
##窃
##窄
##窈
##窍
##窑
##窒
##窓
##窕
##窖
##窗
##窘
##窜
##窝
##窟
##窠
##窥
##窦
##窨
##窩
##窪
##窮
##窯
##窺
##窿
##竄
##竅
##竇
##竊
##立
##竖
##站
##竜
##竞
##竟
##章
##竣
##童
##竭
##端
##競
##竹
##竺
##竽
##竿
##笃
##笆
##笈
##笋
##笏
##笑
##笔
##笙
##笛
##笞
##笠
##符
##笨
##第
##笹
##笺
##笼
##筆
##等
##筊
##筋
##筍
##筏
##筐
##筑
##筒
##答
##策
##筛
##筝
##筠
##筱
##筲
##筵
##筷
##筹
##签
##简
##箇
##箋
##箍
##箏
##箐
##箔
##箕
##算
##箝
##管
##箩
##箫
##箭
##箱
##箴
##箸
##節
##篁
##範
##篆
##篇
##築
##篑
##篓
##篙
##篝
##篠
##篡
##篤
##篩
##篪
##篮
##篱
##篷
##簇
##簌
##簍
##簡
##簦
##簧
##簪
##簫
##簷
##簸
##簽
##簾
##簿
##籁
##籃
##籌
##籍
##籐
##籟
##籠
##籤
##籬
##籮
##籲
##米
##类
##籼
##籽
##粄
##粉
##粑
##粒
##粕
##粗
##粘
##粟
##粤
##粥
##粧
##粪
##粮
##粱
##粲
##粳
##粵
##粹
##粼
##粽
##精
##粿
##糅
##糊
##糍
##糕
##糖
##糗
##糙
##糜
##糞
##糟
##糠
##糧
##糬
##糯
##糰
##糸
##系
##糾
##紀
##紂
##約
##紅
##紉
##紊
##紋
##納
##紐
##紓
##純
##紗
##紘
##紙
##級
##紛
##紜
##素
##紡
##索
##紧
##紫
##紮
##累
##細
##紳
##紹
##紺
##終
##絃
##組
##絆
##経
##結
##絕
##絞
##絡
##絢
##給
##絨
##絮
##統
##絲
##絳
##絵
##絶
##絹
##綁
##綏
##綑
##經
##継
##続
##綜
##綠
##綢
##綦
##綫
##綬
##維
##綱
##網
##綴
##綵
##綸
##綺
##綻
##綽
##綾
##綿
##緊
##緋
##総
##緑
##緒
##緘
##線
##緝
##緞
##締
##緣
##編
##緩
##緬
##緯
##練
##緹
##緻
##縁
##縄
##縈
##縛
##縝
##縣
##縫
##縮
##縱
##縴
##縷
##總
##績
##繁
##繃
##繆
##繇
##繋
##織
##繕
##繚
##繞
##繡
##繩
##繪
##繫
##繭
##繳
##繹
##繼
##繽
##纂
##續
##纍
##纏
##纓
##纔
##纖
##纜
##纠
##红
##纣
##纤
##约
##级
##纨
##纪
##纫
##纬
##纭
##纯
##纰
##纱
##纲
##纳
##纵
##纶
##纷
##纸
##纹
##纺
##纽
##纾
##线
##绀
##练
##组
##绅
##细
##织
##终
##绊
##绍
##绎
##经
##绑
##绒
##结
##绔
##绕
##绘
##给
##绚
##绛
##络
##绝
##绞
##统
##绡
##绢
##绣
##绥
##绦
##继
##绩
##绪
##绫
##续
##绮
##绯
##绰
##绳
##维
##绵
##绶
##绷
##绸
##绻
##综
##绽
##绾
##绿
##缀
##缄
##缅
##缆
##缇
##缈
##缉
##缎
##缓
##缔
##缕
##编
##缘
##缙
##缚
##缜
##缝
##缠
##缢
##缤
##缥
##缨
##缩
##缪
##缭
##缮
##缰
##缱
##缴
##缸
##缺
##缽
##罂
##罄
##罌
##罐
##网
##罔
##罕
##罗
##罚
##罡
##罢
##罩
##罪
##置
##罰
##署
##罵
##罷
##罹
##羁
##羅
##羈
##羊
##羌
##美
##羔
##羚
##羞
##羟
##羡
##羣
##群
##羥
##羧
##羨
##義
##羯
##羲
##羸
##羹
##羽
##羿
##翁
##翅
##翊
##翌
##翎
##習
##翔
##翘
##翟
##翠
##翡
##翦
##翩
##翰
##翱
##翳
##翹
##翻
##翼
##耀
##老
##考
##耄
##者
##耆
##耋
##而
##耍
##耐
##耒
##耕
##耗
##耘
##耙
##耦
##耨
##耳
##耶
##耷
##耸
##耻
##耽
##耿
##聂
##聆
##聊
##聋
##职
##聒
##联
##聖
##聘
##聚
##聞
##聪
##聯
##聰
##聲
##聳
##聴
##聶
##職
##聽
##聾
##聿
##肃
##肄
##肅
##肆
##肇
##肉
##肋
##肌
##肏
##肓
##肖
##肘
##肚
##肛
##肝
##肠
##股
##肢
##肤
##肥
##肩
##肪
##肮
##肯
##肱
##育
##肴
##肺
##肽
##肾
##肿
##胀
##胁
##胃
##胄
##胆
##背
##胍
##胎
##胖
##胚
##胛
##胜
##胝
##胞
##胡
##胤
##胥
##胧
##胫
##胭
##胯
##胰
##胱
##胳
##胴
##胶
##胸
##胺
##能
##脂
##脅
##脆
##脇
##脈
##脉
##脊
##脍
##脏
##脐
##脑
##脓
##脖
##脘
##脚
##脛
##脣
##脩
##脫
##脯
##脱
##脲
##脳
##脸
##脹
##脾
##腆
##腈
##腊
##腋
##腌
##腎
##腐
##腑
##腓
##腔
##腕
##腥
##腦
##腩
##腫
##腭
##腮
##腰
##腱
##腳
##腴
##腸
##腹
##腺
##腻
##腼
##腾
##腿
##膀
##膈
##膊
##膏
##膑
##膘
##膚
##膛
##膜
##膝
##膠
##膦
##膨
##膩
##膳
##膺
##膻
##膽
##膾
##膿
##臀
##臂
##臃
##臆
##臉
##臊
##臍
##臓
##臘
##臟
##臣
##臥
##臧
##臨
##自
##臬
##臭
##至
##致
##臺
##臻
##臼
##臾
##舀
##舂
##舅
##舆
##與
##興
##舉
##舊
##舌
##舍
##舎
##舐
##舒
##舔
##舖
##舗
##舛
##舜
##舞
##舟
##航
##舫
##般
##舰
##舱
##舵
##舶
##舷
##舸
##船
##舺
##舾
##艇
##艋
##艘
##艙
##艦
##艮
##良
##艰
##艱
##色
##艳
##艷
##艹
##艺
##艾
##节
##芃
##芈
##芊
##芋
##芍
##芎
##芒
##芙
##芜
##芝
##芡
##芥
##芦
##芩
##芪
##芫
##芬
##芭
##芮
##芯
##花
##芳
##芷
##芸
##芹
##芻
##芽
##芾
##苁
##苄
##苇
##苋
##苍
##苏
##苑
##苒
##苓
##苔
##苕
##苗
##苛
##苜
##苞
##苟
##苡
##苣
##若
##苦
##苫
##苯
##英
##苷
##苹
##苻
##茁
##茂
##范
##茄
##茅
##茉
##茎
##茏
##茗
##茜
##茧
##茨
##茫
##茬
##茭
##茯
##茱
##茲
##茴
##茵
##茶
##茸
##茹
##茼
##荀
##荃
##荆
##草
##荊
##荏
##荐
##荒
##荔
##荖
##荘
##荚
##荞
##荟
##荠
##荡
##荣
##荤
##荥
##荧
##荨
##荪
##荫
##药
##荳
##荷
##荸
##荻
##荼
##荽
##莅
##莆
##莉
##莊
##莎
##莒
##莓
##莖
##莘
##莞
##莠
##莢
##莧
##莪
##莫
##莱
##莲
##莴
##获
##莹
##莺
##莽
##莿
##菀
##菁
##菅
##菇
##菈
##菊
##菌
##菏
##菓
##菖
##菘
##菜
##菟
##菠
##菡
##菩
##華
##菱
##菲
##菸
##菽
##萁
##萃
##萄
##萊
##萋
##萌
##萍
##萎
##萘
##萝
##萤
##营
##萦
##萧
##萨
##萩
##萬
##萱
##萵
##萸
##萼
##落
##葆
##葉
##著
##葚
##葛
##葡
##董
##葦
##葩
##葫
##葬
##葭
##葯
##葱
##葳
##葵
##葷
##葺
##蒂
##蒋
##蒐
##蒔
##蒙
##蒜
##蒞
##蒟
##蒡
##蒨
##蒲
##蒸
##蒹
##蒻
##蒼
##蒿
##蓁
##蓄
##蓆
##蓉
##蓋
##蓑
##蓓
##蓖
##蓝
##蓟
##蓦
##蓬
##蓮
##蓼
##蓿
##蔑
##蔓
##蔔
##蔗
##蔘
##蔚
##蔡
##蔣
##蔥
##蔫
##蔬
##蔭
##蔵
##蔷
##蔺
##蔻
##蔼
##蔽
##蕁
##蕃
##蕈
##蕉
##蕊
##蕎
##蕙
##蕤
##蕨
##蕩
##蕪
##蕭
##蕲
##蕴
##蕻
##蕾
##薄
##薅
##薇
##薈
##薊
##薏
##薑
##薔
##薙
##薛
##薦
##薨
##薩
##薪
##薬
##薯
##薰
##薹
##藉
##藍
##藏
##藐
##藓
##藕
##藜
##藝
##藤
##藥
##藩
##藹
##藻
##藿
##蘆
##蘇
##蘊
##蘋
##蘑
##蘚
##蘭
##蘸
##蘼
##蘿
##虎
##虏
##虐
##虑
##虔
##處
##虚
##虛
##虜
##虞
##號
##虢
##虧
##虫
##虬
##虱
##虹
##虻
##虽
##虾
##蚀
##蚁
##蚂
##蚊
##蚌
##蚓
##蚕
##蚜
##蚝
##蚣
##蚤
##蚩
##蚪
##蚯
##蚱
##蚵
##蛀
##蛆
##蛇
##蛊
##蛋
##蛎
##蛐
##蛔
##蛙
##蛛
##蛟
##蛤
##蛭
##蛮
##蛰
##蛳
##蛹
##蛻
##蛾
##蜀
##蜂
##蜃
##蜆
##蜇
##蜈
##蜊
##蜍
##蜒
##蜓
##蜕
##蜗
##蜘
##蜚
##蜜
##蜡
##蜢
##蜥
##蜱
##蜴
##蜷
##蜻
##蜿
##蝇
##蝈
##蝉
##蝌
##蝎
##蝕
##蝗
##蝙
##蝟
##蝠
##蝦
##蝨
##蝴
##蝶
##蝸
##蝼
##螂
##螃
##融
##螞
##螢
##螨
##螯
##螳
##螺
##蟀
##蟄
##蟆
##蟋
##蟎
##蟑
##蟒
##蟠
##蟬
##蟲
##蟹
##蟻
##蟾
##蠅
##蠍
##蠔
##蠕
##蠛
##蠟
##蠡
##蠢
##蠣
##蠱
##蠶
##蠹
##蠻
##血
##衄
##衅
##衆
##行
##衍
##術
##衔
##街
##衙
##衛
##衝
##衞
##衡
##衢
##衣
##补
##表
##衩
##衫
##衬
##衮
##衰
##衲
##衷
##衹
##衾
##衿
##袁
##袂
##袄
##袅
##袈
##袋
##袍
##袒
##袖
##袜
##袞
##袤
##袪
##被
##袭
##袱
##裁
##裂
##装
##裆
##裊
##裏
##裔
##裕
##裘
##裙
##補
##裝
##裟
##裡
##裤
##裨
##裱
##裳
##裴
##裸
##裹
##製
##裾
##褂
##複
##褐
##褒
##褓
##褔
##褚
##褥
##褪
##褫
##褲
##褶
##褻
##襁
##襄
##襟
##襠
##襪
##襬
##襯
##襲
##西
##要
##覃
##覆
##覇
##見
##規
##覓
##視
##覚
##覦
##覧
##親
##覬
##観
##覷
##覺
##覽
##觀
##见
##观
##规
##觅
##视
##览
##觉
##觊
##觎
##觐
##觑
##角
##觞
##解
##觥
##触
##觸
##言
##訂
##計
##訊
##討
##訓
##訕
##訖
##託
##記
##訛
##訝
##訟
##訣
##訥
##訪
##設
##許
##訳
##訴
##訶
##診
##註
##証
##詆
##詐
##詔
##評
##詛
##詞
##詠
##詡
##詢
##詣
##試
##詩
##詫
##詬
##詭
##詮
##詰
##話
##該
##詳
##詹
##詼
##誅
##誇
##誉
##誌
##認
##誓
##誕
##誘
##語
##誠
##誡
##誣
##誤
##誥
##誦
##誨
##說
##説
##読
##誰
##課
##誹
##誼
##調
##諄
##談
##請
##諏
##諒
##論
##諗
##諜
##諡
##諦
##諧
##諫
##諭
##諮
##諱
##諳
##諷
##諸
##諺
##諾
##謀
##謁
##謂
##謄
##謊
##謎
##謐
##謔
##謗
##謙
##講
##謝
##謠
##謨
##謬
##謹
##謾
##譁
##證
##譎
##譏
##識
##譙
##譚
##譜
##警
##譬
##譯
##議
##譲
##譴
##護
##譽
##讀
##變
##讓
##讚
##讞
##计
##订
##认
##讥
##讧
##讨
##让
##讪
##讫
##训
##议
##讯
##记
##讲
##讳
##讴
##讶
##讷
##许
##讹
##论
##讼
##讽
##设
##访
##诀
##证
##诃
##评
##诅
##识
##诈
##诉
##诊
##诋
##词
##诏
##译
##试
##诗
##诘
##诙
##诚
##诛
##话
##诞
##诟
##诠
##诡
##询
##诣
##诤
##该
##详
##诧
##诩
##诫
##诬
##语
##误
##诰
##诱
##诲
##说
##诵
##诶
##请
##诸
##诺
##读
##诽
##课
##诿
##谀
##谁
##调
##谄
##谅
##谆
##谈
##谊
##谋
##谌
##谍
##谎
##谏
##谐
##谑
##谒
##谓
##谔
##谕
##谗
##谘
##谙
##谚
##谛
##谜
##谟
##谢
##谣
##谤
##谥
##谦
##谧
##谨
##谩
##谪
##谬
##谭
##谯
##谱
##谲
##谴
##谶
##谷
##豁
##豆
##豇
##豈
##豉
##豊
##豌
##豎
##豐
##豔
##豚
##象
##豢
##豪
##豫
##豬
##豹
##豺
##貂
##貅
##貌
##貓
##貔
##貘
##貝
##貞
##負
##財
##貢
##貧
##貨
##販
##貪
##貫
##責
##貯
##貰
##貳
##貴
##貶
##買
##貸
##費
##貼
##貽
##貿
##賀
##賁
##賂
##賃
##賄
##資
##賈
##賊
##賑
##賓
##賜
##賞
##賠
##賡
##賢
##賣
##賤
##賦
##質
##賬
##賭
##賴
##賺
##購
##賽
##贅
##贈
##贊
##贍
##贏
##贓
##贖
##贛
##贝
##贞
##负
##贡
##财
##责
##贤
##败
##账
##货
##质
##贩
##贪
##贫
##贬
##购
##贮
##贯
##贰
##贱
##贲
##贴
##贵
##贷
##贸
##费
##贺
##贻
##贼
##贾
##贿
##赁
##赂
##赃
##资
##赅
##赈
##赊
##赋
##赌
##赎
##赏
##赐
##赓
##赔
##赖
##赘
##赚
##赛
##赝
##赞
##赠
##赡
##赢
##赣
##赤
##赦
##赧
##赫
##赭
##走
##赳
##赴
##赵
##赶
##起
##趁
##超
##越
##趋
##趕
##趙
##趟
##趣
##趨
##足
##趴
##趵
##趸
##趺
##趾
##跃
##跄
##跆
##跋
##跌
##跎
##跑
##跖
##跚
##跛
##距
##跟
##跡
##跤
##跨
##跩
##跪
##路
##跳
##践
##跷
##跹
##跺
##跻
##踉
##踊
##踌
##踏
##踐
##踝
##踞
##踟
##踢
##踩
##踪
##踮
##踱
##踴
##踵
##踹
##蹂
##蹄
##蹇
##蹈
##蹉
##蹊
##蹋
##蹑
##蹒
##蹙
##蹟
##蹣
##蹤
##蹦
##蹩
##蹬
##蹭
##蹲
##蹴
##蹶
##蹺
##蹼
##蹿
##躁
##躇
##躉
##躊
##躋
##躍
##躏
##躪
##身
##躬
##躯
##躲
##躺
##軀
##車
##軋
##軌
##軍
##軒
##軟
##転
##軸
##軼
##軽
##軾
##較
##載
##輒
##輓
##輔
##輕
##輛
##輝
##輟
##輩
##輪
##輯
##輸
##輻
##輾
##輿
##轄
##轅
##轆
##轉
##轍
##轎
##轟
##车
##轧
##轨
##轩
##转
##轭
##轮
##软
##轰
##轲
##轴
##轶
##轻
##轼
##载
##轿
##较
##辄
##辅
##辆
##辇
##辈
##辉
##辊
##辍
##辐
##辑
##输
##辕
##辖
##辗
##辘
##辙
##辛
##辜
##辞
##辟
##辣
##辦
##辨
##辩
##辫
##辭
##辮
##辯
##辰
##辱
##農
##边
##辺
##辻
##込
##辽
##达
##迁
##迂
##迄
##迅
##过
##迈
##迎
##运
##近
##返
##还
##这
##进
##远
##违
##连
##迟
##迢
##迤
##迥
##迦
##迩
##迪
##迫
##迭
##述
##迴
##迷
##迸
##迹
##迺
##追
##退
##送
##适
##逃
##逅
##逆
##选
##逊
##逍
##透
##逐
##递
##途
##逕
##逗
##這
##通
##逛
##逝
##逞
##速
##造
##逢
##連
##逮
##週
##進
##逵
##逶
##逸
##逻
##逼
##逾
##遁
##遂
##遅
##遇
##遊
##運
##遍
##過
##遏
##遐
##遑
##遒
##道
##達
##違
##遗
##遙
##遛
##遜
##遞
##遠
##遢
##遣
##遥
##遨
##適
##遭
##遮
##遲
##遴
##遵
##遶
##遷
##選
##遺
##遼
##遽
##避
##邀
##邁
##邂
##邃
##還
##邇
##邈
##邊
##邋
##邏
##邑
##邓
##邕
##邛
##邝
##邢
##那
##邦
##邨
##邪
##邬
##邮
##邯
##邰
##邱
##邳
##邵
##邸
##邹
##邺
##邻
##郁
##郅
##郊
##郎
##郑
##郜
##郝
##郡
##郢
##郤
##郦
##郧
##部
##郫
##郭
##郴
##郵
##郷
##郸
##都
##鄂
##鄉
##鄒
##鄔
##鄙
##鄞
##鄢
##鄧
##鄭
##鄰
##鄱
##鄲
##鄺
##酉
##酊
##酋
##酌
##配
##酐
##酒
##酗
##酚
##酝
##酢
##酣
##酥
##酩
##酪
##酬
##酮
##酯
##酰
##酱
##酵
##酶
##酷
##酸
##酿
##醃
##醇
##醉
##醋
##醍
##醐
##醒
##醚
##醛
##醜
##醞
##醣
##醪
##醫
##醬
##醮
##醯
##醴
##醺
##釀
##釁
##采
##釉
##释
##釋
##里
##重
##野
##量
##釐
##金
##釗
##釘
##釜
##針
##釣
##釦
##釧
##釵
##鈀
##鈉
##鈍
##鈎
##鈔
##鈕
##鈞
##鈣
##鈦
##鈪
##鈴
##鈺
##鈾
##鉀
##鉄
##鉅
##鉉
##鉑
##鉗
##鉚
##鉛
##鉤
##鉴
##鉻
##銀
##銃
##銅
##銑
##銓
##銖
##銘
##銜
##銬
##銭
##銮
##銳
##銷
##銹
##鋁
##鋅
##鋒
##鋤
##鋪
##鋰
##鋸
##鋼
##錄
##錐
##錘
##錚
##錠
##錢
##錦
##錨
##錫
##錮
##錯
##録
##錳
##錶
##鍊
##鍋
##鍍
##鍛
##鍥
##鍰
##鍵
##鍺
##鍾
##鎂
##鎊
##鎌
##鎏
##鎔
##鎖
##鎗
##鎚
##鎧
##鎬
##鎮
##鎳
##鏈
##鏖
##鏗
##鏘
##鏞
##鏟
##鏡
##鏢
##鏤
##鏽
##鐘
##鐮
##鐲
##鐳
##鐵
##鐸
##鐺
##鑄
##鑊
##鑑
##鑒
##鑣
##鑫
##鑰
##鑲
##鑼
##鑽
##鑾
##鑿
##针
##钉
##钊
##钎
##钏
##钒
##钓
##钗
##钙
##钛
##钜
##钝
##钞
##钟
##钠
##钡
##钢
##钣
##钤
##钥
##钦
##钧
##钨
##钩
##钮
##钯
##钰
##钱
##钳
##钴
##钵
##钺
##钻
##钼
##钾
##钿
##铀
##铁
##铂
##铃
##铄
##铅
##铆
##铉
##铎
##铐
##铛
##铜
##铝
##铠
##铡
##铢
##铣
##铤
##铨
##铩
##铬
##铭
##铮
##铰
##铲
##铵
##银
##铸
##铺
##链
##铿
##销
##锁
##锂
##锄
##锅
##锆
##锈
##锉
##锋
##锌
##锏
##锐
##锑
##错
##锚
##锟
##锡
##锢
##锣
##锤
##锥
##锦
##锭
##键
##锯
##锰
##锲
##锵
##锹
##锺
##锻
##镀
##镁
##镂
##镇
##镉
##镌
##镍
##镐
##镑
##镕
##镖
##镗
##镛
##镜
##镣
##镭
##镯
##镰
##镳
##镶
##長
##长
##門
##閃
##閉
##開
##閎
##閏
##閑
##閒
##間
##閔
##閘
##閡
##関
##閣
##閥
##閨
##閩
##閱
##閲
##閹
##閻
##閾
##闆
##闇
##闊
##闌
##闍
##闔
##闕
##闖
##闘
##關
##闡
##闢
##门
##闪
##闫
##闭
##问
##闯
##闰
##闲
##间
##闵
##闷
##闸
##闹
##闺
##闻
##闽
##闾
##阀
##阁
##阂
##阅
##阆
##阇
##阈
##阉
##阎
##阐
##阑
##阔
##阕
##阖
##阙
##阚
##阜
##队
##阡
##阪
##阮
##阱
##防
##阳
##阴
##阵
##阶
##阻
##阿
##陀
##陂
##附
##际
##陆
##陇
##陈
##陋
##陌
##降
##限
##陕
##陛
##陝
##陞
##陟
##陡
##院
##陣
##除
##陨
##险
##陪
##陰
##陲
##陳
##陵
##陶
##陷
##陸
##険
##陽
##隅
##隆
##隈
##隊
##隋
##隍
##階
##随
##隐
##隔
##隕
##隘
##隙
##際
##障
##隠
##隣
##隧
##隨
##險
##隱
##隴
##隶
##隸
##隻
##隼
##隽
##难
##雀
##雁
##雄
##雅
##集
##雇
##雉
##雋
##雌
##雍
##雎
##雏
##雑
##雒
##雕
##雖
##雙
##雛
##雜
##雞
##離
##難
##雨
##雪
##雯
##雰
##雲
##雳
##零
##雷
##雹
##電
##雾
##需
##霁
##霄
##霆
##震
##霈
##霉
##霊
##霍
##霎
##霏
##霑
##霓
##霖
##霜
##霞
##霧
##霭
##霰
##露
##霸
##霹
##霽
##霾
##靂
##靄
##靈
##青
##靓
##靖
##静
##靚
##靛
##靜
##非
##靠
##靡
##面
##靥
##靦
##革
##靳
##靴
##靶
##靼
##鞅
##鞋
##鞍
##鞏
##鞑
##鞘
##鞠
##鞣
##鞦
##鞭
##韆
##韋
##韌
##韓
##韜
##韦
##韧
##韩
##韬
##韭
##音
##韵
##韶
##韻
##響
##頁
##頂
##頃
##項
##順
##須
##頌
##預
##頑
##頒
##頓
##頗
##領
##頜
##頡
##頤
##頫
##頭
##頰
##頷
##頸
##頹
##頻
##頼
##顆
##題
##額
##顎
##顏
##顔
##願
##顛
##類
##顧
##顫
##顯
##顱
##顴
##页
##顶
##顷
##项
##顺
##须
##顼
##顽
##顾
##顿
##颁
##颂
##预
##颅
##领
##颇
##颈
##颉
##颊
##颌
##颍
##颐
##频
##颓
##颔
##颖
##颗
##题
##颚
##颛
##颜
##额
##颞
##颠
##颡
##颢
##颤
##颦
##颧
##風
##颯
##颱
##颳
##颶
##颼
##飄
##飆
##风
##飒
##飓
##飕
##飘
##飙
##飚
##飛
##飞
##食
##飢
##飨
##飩
##飪
##飯
##飲
##飼
##飽
##飾
##餃
##餅
##餉
##養
##餌
##餐
##餒
##餓
##餘
##餚
##餛
##餞
##餡
##館
##餮
##餵
##餾
##饅
##饈
##饋
##饌
##饍
##饑
##饒
##饕
##饗
##饞
##饥
##饨
##饪
##饬
##饭
##饮
##饯
##饰
##饱
##饲
##饴
##饵
##饶
##饷
##饺
##饼
##饽
##饿
##馀
##馁
##馄
##馅
##馆
##馈
##馋
##馍
##馏
##馒
##馔
##首
##馗
##香
##馥
##馨
##馬
##馭
##馮
##馳
##馴
##駁
##駄
##駅
##駆
##駐
##駒
##駕
##駛
##駝
##駭
##駱
##駿
##騁
##騎
##騏
##験
##騙
##騨
##騰
##騷
##驀
##驅
##驊
##驍
##驒
##驕
##驗
##驚
##驛
##驟
##驢
##驥
##马
##驭
##驮
##驯
##驰
##驱
##驳
##驴
##驶
##驷
##驸
##驹
##驻
##驼
##驾
##驿
##骁
##骂
##骄
##骅
##骆
##骇
##骈
##骊
##骋
##验
##骏
##骐
##骑
##骗
##骚
##骛
##骜
##骞
##骠
##骡
##骤
##骥
##骧
##骨
##骯
##骰
##骶
##骷
##骸
##骼
##髂
##髅
##髋
##髏
##髒
##髓
##體
##髖
##高
##髦
##髪
##髮
##髯
##髻
##鬃
##鬆
##鬍
##鬓
##鬚
##鬟
##鬢
##鬣
##鬥
##鬧
##鬱
##鬼
##魁
##魂
##魄
##魅
##魇
##魍
##魏
##魔
##魘
##魚
##魯
##魷
##鮑
##鮨
##鮪
##鮭
##鮮
##鯉
##鯊
##鯖
##鯛
##鯨
##鯰
##鯽
##鰍
##鰓
##鰭
##鰲
##鰻
##鰾
##鱈
##鱉
##鱔
##鱗
##鱷
##鱸
##鱼
##鱿
##鲁
##鲈
##鲍
##鲑
##鲛
##鲜
##鲟
##鲢
##鲤
##鲨
##鲫
##鲱
##鲲
##鲶
##鲷
##鲸
##鳃
##鳄
##鳅
##鳌
##鳍
##鳕
##鳖
##鳗
##鳝
##鳞
##鳥
##鳩
##鳳
##鳴
##鳶
##鴉
##鴕
##鴛
##鴦
##鴨
##鴻
##鴿
##鵑
##鵜
##鵝
##鵡
##鵬
##鵰
##鵲
##鶘
##鶩
##鶯
##鶴
##鷗
##鷲
##鷹
##鷺
##鸚
##鸞
##鸟
##鸠
##鸡
##鸢
##鸣
##鸥
##鸦
##鸨
##鸪
##鸭
##鸯
##鸳
##鸵
##鸽
##鸾
##鸿
##鹂
##鹃
##鹄
##鹅
##鹈
##鹉
##鹊
##鹌
##鹏
##鹑
##鹕
##鹘
##鹜
##鹞
##鹤
##鹦
##鹧
##鹫
##鹭
##鹰
##鹳
##鹵
##鹹
##鹼
##鹽
##鹿
##麂
##麋
##麒
##麓
##麗
##麝
##麟
##麥
##麦
##麩
##麴
##麵
##麸
##麺
##麻
##麼
##麽
##麾
##黃
##黄
##黍
##黎
##黏
##黑
##黒
##黔
##默
##黛
##黜
##黝
##點
##黠
##黨
##黯
##黴
##鼋
##鼎
##鼐
##鼓
##鼠
##鼬
##鼹
##鼻
##鼾
##齁
##齊
##齋
##齐
##齒
##齡
##齢
##齣
##齦
##齿
##龄
##龅
##龈
##龊
##龋
##龌
##龍
##龐
##龔
##龕
##龙
##龚
##龛
##龜
##龟
##︰
##︱
##︶
##︿
##﹁
##﹂
##﹍
##﹏
##﹐
##﹑
##﹒
##﹔
##﹕
##﹖
##﹗
##﹙
##﹚
##﹝
##﹞
##﹡
##﹣
##!
##"
###
##$
##%
##&
##'
##(
##)
##*
##,
##-
##.
##/
##:
##;
##<
##?
##@
##[
##\
##]
##^
##_
##`
##f
##h
##j
##u
##w
##z
##{
##}
##。
##「
##」
##、
##・
##ッ
##ー
##イ
##ク
##シ
##ス
##ト
##ノ
##フ
##ラ
##ル
##ン
##゙
##゚
## ̄
##¥
##👍
##🔥
##😂
##😎
================================================
FILE: args.py
================================================
import os
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.INFO)
file_path = os.path.dirname(__file__)
#模型目录
model_dir = os.path.join(file_path, 'albert_lcqmc_checkpoints/')
#config文件
config_name = os.path.join(file_path, 'albert_config/albert_config_tiny.json')
#ckpt文件名称
ckpt_name = os.path.join(model_dir, 'model.ckpt')
#输出文件目录
output_dir = os.path.join(file_path, 'albert_lcqmc_checkpoints/')
#vocab文件目录
vocab_file = os.path.join(file_path, 'albert_config/vocab.txt')
#数据目录
data_dir = os.path.join(file_path, 'data/')
num_train_epochs = 10
batch_size = 128
learning_rate = 0.00005
# gpu使用率
gpu_memory_fraction = 0.8
# 默认取倒数第二层的输出值作为句向量
layer_indexes = [-2]
# 序列的最大程度,单文本建议把该值调小
max_seq_len = 128
# graph名字
graph_file = os.path.join(file_path, 'albert_lcqmc_checkpoints/graph')
================================================
FILE: bert_utils.py
================================================
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import copy
import json
import math
import re
import six
import tensorflow as tf
def get_shape_list(tensor, expected_rank=None, name=None):
"""Returns a list of the shape of tensor, preferring static dimensions.
Args:
tensor: A tf.Tensor object to find the shape of.
expected_rank: (optional) int. The expected rank of `tensor`. If this is
specified and the `tensor` has a different rank, and exception will be
thrown.
name: Optional name of the tensor for the error message.
Returns:
A list of dimensions of the shape of tensor. All static dimensions will
be returned as python integers, and dynamic dimensions will be returned
as tf.Tensor scalars.
"""
if name is None:
name = tensor.name
if expected_rank is not None:
assert_rank(tensor, expected_rank, name)
shape = tensor.shape.as_list()
non_static_indexes = []
for (index, dim) in enumerate(shape):
if dim is None:
non_static_indexes.append(index)
if not non_static_indexes:
return shape
dyn_shape = tf.shape(tensor)
for index in non_static_indexes:
shape[index] = dyn_shape[index]
return shape
def reshape_to_matrix(input_tensor):
"""Reshapes a >= rank 2 tensor to a rank 2 tensor (i.e., a matrix)."""
ndims = input_tensor.shape.ndims
if ndims < 2:
raise ValueError("Input tensor must have at least rank 2. Shape = %s" %
(input_tensor.shape))
if ndims == 2:
return input_tensor
width = input_tensor.shape[-1]
output_tensor = tf.reshape(input_tensor, [-1, width])
return output_tensor
def reshape_from_matrix(output_tensor, orig_shape_list):
"""Reshapes a rank 2 tensor back to its original rank >= 2 tensor."""
if len(orig_shape_list) == 2:
return output_tensor
output_shape = get_shape_list(output_tensor)
orig_dims = orig_shape_list[0:-1]
width = output_shape[-1]
return tf.reshape(output_tensor, orig_dims + [width])
def assert_rank(tensor, expected_rank, name=None):
"""Raises an exception if the tensor rank is not of the expected rank.
Args:
tensor: A tf.Tensor to check the rank of.
expected_rank: Python integer or list of integers, expected rank.
name: Optional name of the tensor for the error message.
Raises:
ValueError: If the expected shape doesn't match the actual shape.
"""
if name is None:
name = tensor.name
expected_rank_dict = {}
if isinstance(expected_rank, six.integer_types):
expected_rank_dict[expected_rank] = True
else:
for x in expected_rank:
expected_rank_dict[x] = True
actual_rank = tensor.shape.ndims
if actual_rank not in expected_rank_dict:
scope_name = tf.get_variable_scope().name
raise ValueError(
"For the tensor `%s` in scope `%s`, the actual rank "
"`%d` (shape = %s) is not equal to the expected rank `%s`" %
(name, scope_name, actual_rank, str(tensor.shape), str(expected_rank)))
def gather_indexes(sequence_tensor, positions):
"""Gathers the vectors at the specific positions over a minibatch."""
sequence_shape = get_shape_list(sequence_tensor, expected_rank=3)
batch_size = sequence_shape[0]
seq_length = sequence_shape[1]
width = sequence_shape[2]
flat_offsets = tf.reshape(
tf.range(0, batch_size, dtype=tf.int32) * seq_length, [-1, 1])
flat_positions = tf.reshape(positions + flat_offsets, [-1])
flat_sequence_tensor = tf.reshape(sequence_tensor,
[batch_size * seq_length, width])
output_tensor = tf.gather(flat_sequence_tensor, flat_positions)
return output_tensor
# add sequence mask for:
# 1. random shuffle lm modeling---xlnet with random shuffled input
# 2. left2right and right2left language modeling
# 3. conditional generation
def generate_seq2seq_mask(attention_mask, mask_sequence, seq_type, **kargs):
if seq_type == 'seq2seq':
if mask_sequence is not None:
seq_shape = get_shape_list(mask_sequence, expected_rank=2)
seq_len = seq_shape[1]
ones = tf.ones((1, seq_len, seq_len))
a_mask = tf.matrix_band_part(ones, -1, 0)
s_ex12 = tf.expand_dims(tf.expand_dims(mask_sequence, 1), 2)
s_ex13 = tf.expand_dims(tf.expand_dims(mask_sequence, 1), 3)
a_mask = (1 - s_ex13) * (1 - s_ex12) + s_ex13 * a_mask
# generate mask of batch x seq_len x seq_len
a_mask = tf.reshape(a_mask, (-1, seq_len, seq_len))
out_mask = attention_mask * a_mask
else:
ones = tf.ones_like(attention_mask[:1])
mask = (tf.matrix_band_part(ones, -1, 0))
out_mask = attention_mask * mask
else:
out_mask = attention_mask
return out_mask
================================================
FILE: classifier_utils.py
================================================
# -*- coding: utf-8 -*-
# @Author: bo.shi
# @Date: 2019-12-01 22:28:41
# @Last Modified by: bo.shi
# @Last Modified time: 2019-12-02 18:36:50
# coding=utf-8
# Copyright 2019 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Utility functions for GLUE classification tasks."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import json
import csv
import os
import six
import tensorflow as tf
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
class InputExample(object):
"""A single training/test example for simple sequence classification."""
def __init__(self, guid, text_a, text_b=None, label=None):
"""Constructs a InputExample.
Args:
guid: Unique id for the example.
text_a: string. The untokenized text of the first sequence. For single
sequence tasks, only this sequence must be specified.
text_b: (Optional) string. The untokenized text of the second sequence.
Only must be specified for sequence pair tasks.
label: (Optional) string. The label of the example. This should be
specified for train and dev examples, but not for test examples.
"""
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
class PaddingInputExample(object):
"""Fake example so the num input examples is a multiple of the batch size.
When running eval/predict on the TPU, we need to pad the number of examples
to be a multiple of the batch size, because the TPU requires a fixed batch
size. The alternative is to drop the last batch, which is bad because it means
the entire output data won't be generated.
We use this class instead of `None` because treating `None` as padding
battches could cause silent errors.
"""
class DataProcessor(object):
"""Base class for data converters for sequence classification data sets."""
def get_train_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the train set."""
raise NotImplementedError()
def get_dev_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the dev set."""
raise NotImplementedError()
def get_test_examples(self, data_dir):
"""Gets a collection of `InputExample`s for prediction."""
raise NotImplementedError()
def get_labels(self):
"""Gets the list of labels for this data set."""
raise NotImplementedError()
@classmethod
def _read_tsv(cls, input_file, delimiter="\t", quotechar=None):
"""Reads a tab separated value file."""
with tf.gfile.Open(input_file, "r") as f:
reader = csv.reader(f, delimiter=delimiter, quotechar=quotechar)
lines = []
for line in reader:
lines.append(line)
return lines
@classmethod
def _read_txt(cls, input_file):
"""Reads a tab separated value file."""
with tf.gfile.Open(input_file, "r") as f:
reader = f.readlines()
lines = []
for line in reader:
lines.append(line.strip().split("_!_"))
return lines
@classmethod
def _read_json(cls, input_file):
"""Reads a tab separated value file."""
with tf.gfile.Open(input_file, "r") as f:
reader = f.readlines()
lines = []
for line in reader:
lines.append(json.loads(line.strip()))
return lines
class XnliProcessor(DataProcessor):
"""Processor for the XNLI data set."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "train.json")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "dev.json")), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "test.json")), "test")
def _create_examples(self, lines, set_type):
"""See base class."""
examples = []
for (i, line) in enumerate(lines):
guid = "%s-%s" % (set_type, i)
text_a = convert_to_unicode(line['premise'])
text_b = convert_to_unicode(line['hypo'])
label = convert_to_unicode(line['label']) if set_type != 'test' else 'contradiction'
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
def get_labels(self):
"""See base class."""
return ["contradiction", "entailment", "neutral"]
# class TnewsProcessor(DataProcessor):
# """Processor for the MRPC data set (GLUE version)."""
#
# def get_train_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_txt(os.path.join(data_dir, "toutiao_category_train.txt")), "train")
#
# def get_dev_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_txt(os.path.join(data_dir, "toutiao_category_dev.txt")), "dev")
#
# def get_test_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_txt(os.path.join(data_dir, "toutiao_category_test.txt")), "test")
#
# def get_labels(self):
# """See base class."""
# labels = []
# for i in range(17):
# if i == 5 or i == 11:
# continue
# labels.append(str(100 + i))
# return labels
#
# def _create_examples(self, lines, set_type):
# """Creates examples for the training and dev sets."""
# examples = []
# for (i, line) in enumerate(lines):
# if i == 0:
# continue
# guid = "%s-%s" % (set_type, i)
# text_a = convert_to_unicode(line[3])
# text_b = None
# label = convert_to_unicode(line[1])
# examples.append(
# InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
# return examples
class TnewsProcessor(DataProcessor):
"""Processor for the MRPC data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "train.json")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "dev.json")), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "test.json")), "test")
def get_labels(self):
"""See base class."""
labels = []
for i in range(17):
if i == 5 or i == 11:
continue
labels.append(str(100 + i))
return labels
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
guid = "%s-%s" % (set_type, i)
text_a = convert_to_unicode(line['sentence'])
text_b = None
label = convert_to_unicode(line['label']) if set_type != 'test' else "100"
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
# class iFLYTEKDataProcessor(DataProcessor):
# """Processor for the iFLYTEKData data set (GLUE version)."""
#
# def get_train_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_txt(os.path.join(data_dir, "train.txt")), "train")
#
# def get_dev_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_txt(os.path.join(data_dir, "dev.txt")), "dev")
#
# def get_test_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_txt(os.path.join(data_dir, "test.txt")), "test")
#
# def get_labels(self):
# """See base class."""
# labels = []
# for i in range(119):
# labels.append(str(i))
# return labels
#
# def _create_examples(self, lines, set_type):
# """Creates examples for the training and dev sets."""
# examples = []
# for (i, line) in enumerate(lines):
# if i == 0:
# continue
# guid = "%s-%s" % (set_type, i)
# text_a = convert_to_unicode(line[1])
# text_b = None
# label = convert_to_unicode(line[0])
# examples.append(
# InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
# return examples
class iFLYTEKDataProcessor(DataProcessor):
"""Processor for the iFLYTEKData data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "train.json")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "dev.json")), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "test.json")), "test")
def get_labels(self):
"""See base class."""
labels = []
for i in range(119):
labels.append(str(i))
return labels
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
guid = "%s-%s" % (set_type, i)
text_a = convert_to_unicode(line['sentence'])
text_b = None
label = convert_to_unicode(line['label']) if set_type != 'test' else "0"
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class AFQMCProcessor(DataProcessor):
"""Processor for the internal data set. sentence pair classification"""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "train.json")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "dev.json")), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "test.json")), "test")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
guid = "%s-%s" % (set_type, i)
text_a = convert_to_unicode(line['sentence1'])
text_b = convert_to_unicode(line['sentence2'])
label = convert_to_unicode(line['label']) if set_type != 'test' else '0'
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class CMNLIProcessor(DataProcessor):
"""Processor for the CMNLI data set."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples_json(os.path.join(data_dir, "train.json"), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples_json(os.path.join(data_dir, "dev.json"), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples_json(os.path.join(data_dir, "test.json"), "test")
def get_labels(self):
"""See base class."""
return ["contradiction", "entailment", "neutral"]
def _create_examples_json(self, file_name, set_type):
"""Creates examples for the training and dev sets."""
examples = []
lines = tf.gfile.Open(file_name, "r")
index = 0
for line in lines:
line_obj = json.loads(line)
index = index + 1
guid = "%s-%s" % (set_type, index)
text_a = convert_to_unicode(line_obj["sentence1"])
text_b = convert_to_unicode(line_obj["sentence2"])
label = convert_to_unicode(line_obj["label"]) if set_type != 'test' else 'neutral'
if label != "-":
examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class CslProcessor(DataProcessor):
"""Processor for the CSL data set."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "train.json")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "dev.json")), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "test.json")), "test")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
guid = "%s-%s" % (set_type, i)
text_a = convert_to_unicode(" ".join(line['keyword']))
text_b = convert_to_unicode(line['abst'])
label = convert_to_unicode(line['label']) if set_type != 'test' else '0'
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
# class InewsProcessor(DataProcessor):
# """Processor for the MRPC data set (GLUE version)."""
#
# def get_train_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_txt(os.path.join(data_dir, "train.txt")), "train")
#
# def get_dev_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_txt(os.path.join(data_dir, "dev.txt")), "dev")
#
# def get_test_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_txt(os.path.join(data_dir, "test.txt")), "test")
#
# def get_labels(self):
# """See base class."""
# labels = ["0", "1", "2"]
# return labels
#
# def _create_examples(self, lines, set_type):
# """Creates examples for the training and dev sets."""
# examples = []
# for (i, line) in enumerate(lines):
# if i == 0:
# continue
# guid = "%s-%s" % (set_type, i)
# text_a = convert_to_unicode(line[2])
# text_b = convert_to_unicode(line[3])
# label = convert_to_unicode(line[0]) if set_type != "test" else '0'
# examples.append(
# InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
# return examples
#
#
# class THUCNewsProcessor(DataProcessor):
# """Processor for the THUCNews data set (GLUE version)."""
#
# def get_train_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_txt(os.path.join(data_dir, "train.txt")), "train")
#
# def get_dev_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_txt(os.path.join(data_dir, "dev.txt")), "dev")
#
# def get_test_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_txt(os.path.join(data_dir, "test.txt")), "test")
#
# def get_labels(self):
# """See base class."""
# labels = []
# for i in range(14):
# labels.append(str(i))
# return labels
#
# def _create_examples(self, lines, set_type):
# """Creates examples for the training and dev sets."""
# examples = []
# for (i, line) in enumerate(lines):
# if i == 0 or len(line) < 3:
# continue
# guid = "%s-%s" % (set_type, i)
# text_a = convert_to_unicode(line[3])
# text_b = None
# label = convert_to_unicode(line[0])
# examples.append(
# InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
# return examples
#
# class LCQMCProcessor(DataProcessor):
# """Processor for the internal data set. sentence pair classification"""
#
# def __init__(self):
# self.language = "zh"
#
# def get_train_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_tsv(os.path.join(data_dir, "train.txt")), "train")
# # dev_0827.tsv
#
# def get_dev_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_tsv(os.path.join(data_dir, "dev.txt")), "dev")
#
# def get_test_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_tsv(os.path.join(data_dir, "test.txt")), "test")
#
# def get_labels(self):
# """See base class."""
# return ["0", "1"]
# # return ["-1","0", "1"]
#
# def _create_examples(self, lines, set_type):
# """Creates examples for the training and dev sets."""
# examples = []
# print("length of lines:", len(lines))
# for (i, line) in enumerate(lines):
# # print('#i:',i,line)
# if i == 0:
# continue
# guid = "%s-%s" % (set_type, i)
# try:
# label = convert_to_unicode(line[2])
# text_a = convert_to_unicode(line[0])
# text_b = convert_to_unicode(line[1])
# examples.append(
# InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
# except Exception:
# print('###error.i:', i, line)
# return examples
#
#
# class JDCOMMENTProcessor(DataProcessor):
# """Processor for the internal data set. sentence pair classification"""
#
# def __init__(self):
# self.language = "zh"
#
# def get_train_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_tsv(os.path.join(data_dir, "jd_train.csv"), ",", "\""), "train")
# # dev_0827.tsv
#
# def get_dev_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_tsv(os.path.join(data_dir, "jd_dev.csv"), ",", "\""), "dev")
#
# def get_test_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_tsv(os.path.join(data_dir, "jd_test.csv"), ",", "\""), "test")
#
# def get_labels(self):
# """See base class."""
# return ["1", "2", "3", "4", "5"]
# # return ["-1","0", "1"]
#
# def _create_examples(self, lines, set_type):
# """Creates examples for the training and dev sets."""
# examples = []
# print("length of lines:", len(lines))
# for (i, line) in enumerate(lines):
# # print('#i:',i,line)
# if i == 0:
# continue
# guid = "%s-%s" % (set_type, i)
# try:
# label = convert_to_unicode(line[0])
# text_a = convert_to_unicode(line[1])
# text_b = convert_to_unicode(line[2])
# examples.append(
# InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
# except Exception:
# print('###error.i:', i, line)
# return examples
#
#
# class BQProcessor(DataProcessor):
# """Processor for the internal data set. sentence pair classification"""
#
# def __init__(self):
# self.language = "zh"
#
# def get_train_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_tsv(os.path.join(data_dir, "train.txt")), "train")
# # dev_0827.tsv
#
# def get_dev_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_tsv(os.path.join(data_dir, "dev.txt")), "dev")
#
# def get_test_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_tsv(os.path.join(data_dir, "test.txt")), "test")
#
# def get_labels(self):
# """See base class."""
# return ["0", "1"]
# # return ["-1","0", "1"]
#
# def _create_examples(self, lines, set_type):
# """Creates examples for the training and dev sets."""
# examples = []
# print("length of lines:", len(lines))
# for (i, line) in enumerate(lines):
# # print('#i:',i,line)
# if i == 0:
# continue
# guid = "%s-%s" % (set_type, i)
# try:
# label = convert_to_unicode(line[2])
# text_a = convert_to_unicode(line[0])
# text_b = convert_to_unicode(line[1])
# examples.append(
# InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
# except Exception:
# print('###error.i:', i, line)
# return examples
#
#
# class MnliProcessor(DataProcessor):
# """Processor for the MultiNLI data set (GLUE version)."""
#
# def get_train_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
#
# def get_dev_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")),
# "dev_matched")
#
# def get_test_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_tsv(os.path.join(data_dir, "test_matched.tsv")), "test")
#
# def get_labels(self):
# """See base class."""
# return ["contradiction", "entailment", "neutral"]
#
# def _create_examples(self, lines, set_type):
# """Creates examples for the training and dev sets."""
# examples = []
# for (i, line) in enumerate(lines):
# if i == 0:
# continue
# guid = "%s-%s" % (set_type, convert_to_unicode(line[0]))
# text_a = convert_to_unicode(line[8])
# text_b = convert_to_unicode(line[9])
# if set_type == "test":
# label = "contradiction"
# else:
# label = convert_to_unicode(line[-1])
# examples.append(
# InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
# return examples
#
#
# class MrpcProcessor(DataProcessor):
# """Processor for the MRPC data set (GLUE version)."""
#
# def get_train_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
#
# def get_dev_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
#
# def get_test_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
#
# def get_labels(self):
# """See base class."""
# return ["0", "1"]
#
# def _create_examples(self, lines, set_type):
# """Creates examples for the training and dev sets."""
# examples = []
# for (i, line) in enumerate(lines):
# if i == 0:
# continue
# guid = "%s-%s" % (set_type, i)
# text_a = convert_to_unicode(line[3])
# text_b = convert_to_unicode(line[4])
# if set_type == "test":
# label = "0"
# else:
# label = convert_to_unicode(line[0])
# examples.append(
# InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
# return examples
#
#
# class ColaProcessor(DataProcessor):
# """Processor for the CoLA data set (GLUE version)."""
#
# def get_train_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
#
# def get_dev_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
#
# def get_test_examples(self, data_dir):
# """See base class."""
# return self._create_examples(
# self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
#
# def get_labels(self):
# """See base class."""
# return ["0", "1"]
#
# def _create_examples(self, lines, set_type):
# """Creates examples for the training and dev sets."""
# examples = []
# for (i, line) in enumerate(lines):
# # Only the test set has a header
# if set_type == "test" and i == 0:
# continue
# guid = "%s-%s" % (set_type, i)
# if set_type == "test":
# text_a = convert_to_unicode(line[1])
# label = "0"
# else:
# text_a = convert_to_unicode(line[3])
# label = convert_to_unicode(line[1])
# examples.append(
# InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
# return examples
class WSCProcessor(DataProcessor):
"""Processor for the internal data set. sentence pair classification"""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "train.json")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "dev.json")), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "test.json")), "test")
def get_labels(self):
"""See base class."""
return ["true", "false"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
guid = "%s-%s" % (set_type, i)
text_a = convert_to_unicode(line['text'])
text_a_list = list(text_a)
target = line['target']
query = target['span1_text']
query_idx = target['span1_index']
pronoun = target['span2_text']
pronoun_idx = target['span2_index']
assert text_a[pronoun_idx: (pronoun_idx + len(pronoun))
] == pronoun, "pronoun: {}".format(pronoun)
assert text_a[query_idx: (query_idx + len(query))] == query, "query: {}".format(query)
if pronoun_idx > query_idx:
text_a_list.insert(query_idx, "_")
text_a_list.insert(query_idx + len(query) + 1, "_")
text_a_list.insert(pronoun_idx + 2, "[")
text_a_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]")
else:
text_a_list.insert(pronoun_idx, "[")
text_a_list.insert(pronoun_idx + len(pronoun) + 1, "]")
text_a_list.insert(query_idx + 2, "_")
text_a_list.insert(query_idx + len(query) + 2 + 1, "_")
text_a = "".join(text_a_list)
if set_type == "test":
label = "true"
else:
label = line['label']
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
class COPAProcessor(DataProcessor):
"""Processor for the internal data set. sentence pair classification"""
def __init__(self):
self.language = "zh"
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "train.json")), "train")
# dev_0827.tsv
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "dev.json")), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_json(os.path.join(data_dir, "test.json")), "test")
def get_labels(self):
"""See base class."""
return ["0", "1"]
@classmethod
def _create_examples_one(self, lines, set_type):
examples = []
for (i, line) in enumerate(lines):
guid1 = "%s-%s" % (set_type, i)
# try:
if line['question'] == 'cause':
text_a = convert_to_unicode(line['premise'] + '原因是什么呢?' + line['choice0'])
text_b = convert_to_unicode(line['premise'] + '原因是什么呢?' + line['choice1'])
else:
text_a = convert_to_unicode(line['premise'] + '造成了什么影响呢?' + line['choice0'])
text_b = convert_to_unicode(line['premise'] + '造成了什么影响呢?' + line['choice1'])
label = convert_to_unicode(str(1 if line['label'] == 0 else 0)) if set_type != 'test' else '0'
examples.append(
InputExample(guid=guid1, text_a=text_a, text_b=text_b, label=label))
# except Exception as e:
# print('###error.i:',e, i, line)
return examples
@classmethod
def _create_examples(self, lines, set_type):
examples = []
for (i, line) in enumerate(lines):
i = 2 * i
guid1 = "%s-%s" % (set_type, i)
guid2 = "%s-%s" % (set_type, i + 1)
# try:
premise = convert_to_unicode(line['premise'])
choice0 = convert_to_unicode(line['choice0'])
label = convert_to_unicode(str(1 if line['label'] == 0 else 0)) if set_type != 'test' else '0'
#text_a2 = convert_to_unicode(line['premise'])
choice1 = convert_to_unicode(line['choice1'])
label2 = convert_to_unicode(
str(0 if line['label'] == 0 else 1)) if set_type != 'test' else '0'
if line['question'] == 'effect':
text_a = premise
text_b = choice0
text_a2 = premise
text_b2 = choice1
elif line['question'] == 'cause':
text_a = choice0
text_b = premise
text_a2 = choice1
text_b2 = premise
else:
print('wrong format!!')
return None
examples.append(
InputExample(guid=guid1, text_a=text_a, text_b=text_b, label=label))
examples.append(
InputExample(guid=guid2, text_a=text_a2, text_b=text_b2, label=label2))
# except Exception as e:
# print('###error.i:',e, i, line)
return examples
================================================
FILE: create_pretrain_data.sh
================================================
#!/usr/bin/env bash
BERT_BASE_DIR=./albert_config
python3 create_pretraining_data.py --do_whole_word_mask=True --input_file=data/news_zh_1.txt \
--output_file=data/tf_news_2016_zh_raw_news2016zh_1.tfrecord --vocab_file=$BERT_BASE_DIR/vocab.txt --do_lower_case=True \
--max_seq_length=512 --max_predictions_per_seq=51 --masked_lm_prob=0.10
================================================
FILE: create_pretraining_data.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Create masked LM/next sentence masked_lm TF examples for BERT."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import random
import tokenization
import tensorflow as tf
import jieba
import re
flags = tf.flags
FLAGS = flags.FLAGS
flags.DEFINE_string("input_file", None,
"Input raw text file (or comma-separated list of files).")
flags.DEFINE_string(
"output_file", None,
"Output TF example file (or comma-separated list of files).")
flags.DEFINE_string("vocab_file", None,
"The vocabulary file that the BERT model was trained on.")
flags.DEFINE_bool(
"do_lower_case", True,
"Whether to lower case the input text. Should be True for uncased "
"models and False for cased models.")
flags.DEFINE_bool(
"do_whole_word_mask", False,
"Whether to use whole word masking rather than per-WordPiece masking.")
flags.DEFINE_integer("max_seq_length", 128, "Maximum sequence length.")
flags.DEFINE_integer("max_predictions_per_seq", 20,
"Maximum number of masked LM predictions per sequence.")
flags.DEFINE_integer("random_seed", 12345, "Random seed for data generation.")
flags.DEFINE_integer(
"dupe_factor", 10,
"Number of times to duplicate the input data (with different masks).")
flags.DEFINE_float("masked_lm_prob", 0.15, "Masked LM probability.")
flags.DEFINE_float(
"short_seq_prob", 0.1,
"Probability of creating sequences which are shorter than the "
"maximum length.")
flags.DEFINE_bool("non_chinese", False,"manually set this to True if you are not doing chinese pre-train task.")
class TrainingInstance(object):
"""A single training instance (sentence pair)."""
def __init__(self, tokens, segment_ids, masked_lm_positions, masked_lm_labels,
is_random_next):
self.tokens = tokens
self.segment_ids = segment_ids
self.is_random_next = is_random_next
self.masked_lm_positions = masked_lm_positions
self.masked_lm_labels = masked_lm_labels
def __str__(self):
s = ""
s += "tokens: %s\n" % (" ".join(
[tokenization.printable_text(x) for x in self.tokens]))
s += "segment_ids: %s\n" % (" ".join([str(x) for x in self.segment_ids]))
s += "is_random_next: %s\n" % self.is_random_next
s += "masked_lm_positions: %s\n" % (" ".join(
[str(x) for x in self.masked_lm_positions]))
s += "masked_lm_labels: %s\n" % (" ".join(
[tokenization.printable_text(x) for x in self.masked_lm_labels]))
s += "\n"
return s
def __repr__(self):
return self.__str__()
def write_instance_to_example_files(instances, tokenizer, max_seq_length,
max_predictions_per_seq, output_files):
"""Create TF example files from `TrainingInstance`s."""
writers = []
for output_file in output_files:
writers.append(tf.python_io.TFRecordWriter(output_file))
writer_index = 0
total_written = 0
for (inst_index, instance) in enumerate(instances):
input_ids = tokenizer.convert_tokens_to_ids(instance.tokens)
input_mask = [1] * len(input_ids)
segment_ids = list(instance.segment_ids)
assert len(input_ids) <= max_seq_length
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
masked_lm_positions = list(instance.masked_lm_positions)
masked_lm_ids = tokenizer.convert_tokens_to_ids(instance.masked_lm_labels)
masked_lm_weights = [1.0] * len(masked_lm_ids)
while len(masked_lm_positions) < max_predictions_per_seq:
masked_lm_positions.append(0)
masked_lm_ids.append(0)
masked_lm_weights.append(0.0)
next_sentence_label = 1 if instance.is_random_next else 0
features = collections.OrderedDict()
features["input_ids"] = create_int_feature(input_ids)
features["input_mask"] = create_int_feature(input_mask)
features["segment_ids"] = create_int_feature(segment_ids)
features["masked_lm_positions"] = create_int_feature(masked_lm_positions)
features["masked_lm_ids"] = create_int_feature(masked_lm_ids)
features["masked_lm_weights"] = create_float_feature(masked_lm_weights)
features["next_sentence_labels"] = create_int_feature([next_sentence_label])
tf_example = tf.train.Example(features=tf.train.Features(feature=features))
writers[writer_index].write(tf_example.SerializeToString())
writer_index = (writer_index + 1) % len(writers)
total_written += 1
if inst_index < 20:
tf.logging.info("*** Example ***")
tf.logging.info("tokens: %s" % " ".join(
[tokenization.printable_text(x) for x in instance.tokens]))
for feature_name in features.keys():
feature = features[feature_name]
values = []
if feature.int64_list.value:
values = feature.int64_list.value
elif feature.float_list.value:
values = feature.float_list.value
tf.logging.info(
"%s: %s" % (feature_name, " ".join([str(x) for x in values])))
for writer in writers:
writer.close()
tf.logging.info("Wrote %d total instances", total_written)
def create_int_feature(values):
feature = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
return feature
def create_float_feature(values):
feature = tf.train.Feature(float_list=tf.train.FloatList(value=list(values)))
return feature
def create_training_instances(input_files, tokenizer, max_seq_length,
dupe_factor, short_seq_prob, masked_lm_prob,
max_predictions_per_seq, rng):
"""Create `TrainingInstance`s from raw text."""
all_documents = [[]]
# Input file format:
# (1) One sentence per line. These should ideally be actual sentences, not
# entire paragraphs or arbitrary spans of text. (Because we use the
# sentence boundaries for the "next sentence prediction" task).
# (2) Blank lines between documents. Document boundaries are needed so
# that the "next sentence prediction" task doesn't span between documents.
for input_file in input_files:
with tf.gfile.GFile(input_file, "r") as reader:
while True:
strings=reader.readline()
strings=strings.replace(" "," ").replace(" "," ") # 如果有两个或三个空格,替换为一个空格
line = tokenization.convert_to_unicode(strings)
if not line:
break
line = line.strip()
# Empty lines are used as document delimiters
if not line:
all_documents.append([])
tokens = tokenizer.tokenize(line)
if tokens:
all_documents[-1].append(tokens)
# Remove empty documents
all_documents = [x for x in all_documents if x]
rng.shuffle(all_documents)
vocab_words = list(tokenizer.vocab.keys())
instances = []
for _ in range(dupe_factor):
for document_index in range(len(all_documents)):
instances.extend(
create_instances_from_document_albert( # change to albert style for sentence order prediction(SOP), 2019-08-28, brightmart
all_documents, document_index, max_seq_length, short_seq_prob,
masked_lm_prob, max_predictions_per_seq, vocab_words, rng))
rng.shuffle(instances)
return instances
def get_new_segment(segment): # 新增的方法 ####
"""
输入一句话,返回一句经过处理的话: 为了支持中文全称mask,将被分开的词,将上特殊标记("#"),使得后续处理模块,能够知道哪些字是属于同一个词的。
:param segment: 一句话. e.g. ['悬', '灸', '技', '术', '培', '训', '专', '家', '教', '你', '艾', '灸', '降', '血', '糖', ',', '为', '爸', '妈', '收', '好', '了', '!']
:return: 一句处理过的话 e.g. ['悬', '##灸', '技', '术', '培', '训', '专', '##家', '教', '你', '艾', '##灸', '降', '##血', '##糖', ',', '为', '爸', '##妈', '收', '##好', '了', '!']
"""
seq_cws = jieba.lcut("".join(segment)) # 分词
seq_cws_dict = {x: 1 for x in seq_cws} # 分词后的词加入到词典dict
new_segment = []
i = 0
while i < len(segment): # 从句子的第一个字开始处理,知道处理完整个句子
if len(re.findall('[\u4E00-\u9FA5]', segment[i])) == 0: # 如果找不到中文的,原文加进去即不用特殊处理。
new_segment.append(segment[i])
i += 1
continue
has_add = False
for length in range(3, 0, -1):
if i + length > len(segment):
continue
if ''.join(segment[i:i + length]) in seq_cws_dict:
new_segment.append(segment[i])
for l in range(1, length):
new_segment.append('##' + segment[i + l])
i += length
has_add = True
break
if not has_add:
new_segment.append(segment[i])
i += 1
# print("get_new_segment.wwm.get_new_segment:",new_segment)
return new_segment
def create_instances_from_document_albert(
all_documents, document_index, max_seq_length, short_seq_prob,
masked_lm_prob, max_predictions_per_seq, vocab_words, rng):
"""Creates `TrainingInstance`s for a single document.
This method is changed to create sentence-order prediction (SOP) followed by idea from paper of ALBERT, 2019-08-28, brightmart
"""
document = all_documents[document_index] # 得到一个文档
# Account for [CLS], [SEP], [SEP]
max_num_tokens = max_seq_length - 3
# We *usually* want to fill up the entire sequence since we are padding
# to `max_seq_length` anyways, so short sequences are generally wasted
# computation. However, we *sometimes*
# (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
# sequences to minimize the mismatch between pre-training and fine-tuning.
# The `target_seq_length` is just a rough target however, whereas
# `max_seq_length` is a hard limit.
target_seq_length = max_num_tokens
if rng.random() < short_seq_prob: # 有一定的比例,如10%的概率,我们使用比较短的序列长度,以缓解预训练的长序列和调优阶段(可能的)短序列的不一致情况
target_seq_length = rng.randint(2, max_num_tokens)
# We DON'T just concatenate all of the tokens from a document into a long
# sequence and choose an arbitrary split point because this would make the
# next sentence prediction task too easy. Instead, we split the input into
# segments "A" and "B" based on the actual "sentences" provided by the user
# input.
# 设法使用实际的句子,而不是任意的截断句子,从而更好的构造句子连贯性预测的任务
instances = []
current_chunk = [] # 当前处理的文本段,包含多个句子
current_length = 0
i = 0
# print("###document:",document) # 一个document可以是一整篇文章、新闻、词条等. document:[['是', '爷', '们', ',', '就', '得', '给', '媳', '妇', '幸', '福'], ['关', '注', '【', '晨', '曦', '教', '育', '】', ',', '获', '取', '育', '儿', '的', '智', '慧', ',', '与', '孩', '子', '一', '同', '成', '长', '!'], ['方', '法', ':', '打', '开', '微', '信', '→', '添', '加', '朋', '友', '→', '搜', '号', '→', '##he', '##bc', '##x', '##jy', '##→', '关', '注', '!', '我', '是', '一', '个', '爷', '们', ',', '孝', '顺', '是', '做', '人', '的', '第', '一', '准', '则', '。'], ['甭', '管', '小', '时', '候', '怎', '么', '跟', '家', '长', '犯', '混', '蛋', ',', '长', '大', '了', ',', '就', '底', '报', '答', '父', '母', ',', '以', '后', '我', '媳', '妇', '也', '必', '须', '孝', '顺', '。'], ['我', '是', '一', '个', '爷', '们', ',', '可', '以', '花', '心', ',', '可', '以', '好', '玩', '。'], ['但', '我', '一', '定', '会', '找', '一', '个', '管', '的', '住', '我', '的', '女', '人', ',', '和', '我', '一', '起', '生', '活', '。'], ['28', '岁', '以', '前', '在', '怎', '么', '玩', '都', '行', ',', '但', '我', '最', '后', '一', '定', '会', '找', '一', '个', '勤', '俭', '持', '家', '的', '女', '人', '。'], ['我', '是', '一', '爷', '们', ',', '我', '不', '会', '让', '自', '己', '的', '女', '人', '受', '一', '点', '委', '屈', ',', '每', '次', '把', '她', '抱', '在', '怀', '里', ',', '看', '她', '洋', '溢', '着', '幸', '福', '的', '脸', ',', '我', '都', '会', '引', '以', '为', '傲', ',', '这', '特', '么', '就', '是', '我', '的', '女', '人', '。'], ['我', '是', '一', '爷', '们', ',', '干', '什', '么', '也', '不', '能', '忘', '了', '自', '己', '媳', '妇', ',', '就', '算', '和', '哥', '们', '一', '起', '喝', '酒', ',', '喝', '到', '很', '晚', ',', '也', '要', '提', '前', '打', '电', '话', '告', '诉', '她', ',', '让', '她', '早', '点', '休', '息', '。'], ['我', '是', '一', '爷', '们', ',', '我', '媳', '妇', '绝', '对', '不', '能', '抽', '烟', ',', '喝', '酒', '还', '勉', '强', '过', '得', '去', ',', '不', '过', '该', '喝', '的', '时', '候', '喝', ',', '不', '该', '喝', '的', '时', '候', ',', '少', '扯', '纳', '极', '薄', '蛋', '。'], ['我', '是', '一', '爷', '们', ',', '我', '媳', '妇', '必', '须', '听', '我', '话', ',', '在', '人', '前', '一', '定', '要', '给', '我', '面', '子', ',', '回', '家', '了', '咱', '什', '么', '都', '好', '说', '。'], ['我', '是', '一', '爷', '们', ',', '就', '算', '难', '的', '吃', '不', '上', '饭', '了', ',', '都', '不', '张', '口', '跟', '媳', '妇', '要', '一', '分', '钱', '。'], ['我', '是', '一', '爷', '们', ',', '不', '管', '上', '学', '还', '是', '上', '班', ',', '我', '都', '会', '送', '媳', '妇', '回', '家', '。'], ['我', '是', '一', '爷', '们', ',', '交', '往', '不', '到', '1', '年', ',', '绝', '对', '不', '会', '和', '媳', '妇', '提', '过', '分', '的', '要', '求', ',', '我', '会', '尊', '重', '她', '。'], ['我', '是', '一', '爷', '们', ',', '游', '戏', '永', '远', '比', '不', '上', '我', '媳', '妇', '重', '要', ',', '只', '要', '媳', '妇', '发', '话', ',', '我', '绝', '对', '唯', '命', '是', '从', '。'], ['我', '是', '一', '爷', '们', ',', '上', 'q', '绝', '对', '是', '为', '了', '等', '媳', '妇', ',', '所', '有', '暧', '昧', '的', '心', '情', '只', '为', '她', '一', '个', '女', '人', '而', '写', ',', '我', '不', '一', '定', '会', '经', '常', '写', '日', '志', ',', '可', '是', '我', '会', '告', '诉', '全', '世', '界', ',', '我', '很', '爱', '她', '。'], ['我', '是', '一', '爷', '们', ',', '不', '一', '定', '要', '经', '常', '制', '造', '浪', '漫', '、', '偶', '尔', '过', '个', '节', '日', '也', '要', '送', '束', '玫', '瑰', '花', '给', '媳', '妇', '抱', '回', '家', '。'], ['我', '是', '一', '爷', '们', ',', '手', '机', '会', '24', '小', '时', '为', '她', '开', '机', ',', '让', '她', '半', '夜', '痛', '经', '的', '时', '候', ',', '做', '恶', '梦', '的', '时', '候', ',', '随', '时', '可', '以', '联', '系', '到', '我', '。'], ['我', '是', '一', '爷', '们', ',', '我', '会', '经', '常', '带', '媳', '妇', '出', '去', '玩', ',', '她', '不', '一', '定', '要', '和', '我', '所', '有', '的', '哥', '们', '都', '认', '识', ',', '但', '见', '面', '能', '说', '的', '上', '话', '就', '行', '。'], ['我', '是', '一', '爷', '们', ',', '我', '会', '和', '媳', '妇', '的', '姐', '妹', '哥', '们', '搞', '好', '关', '系', ',', '让', '她', '们', '相', '信', '我', '一', '定', '可', '以', '给', '我', '媳', '妇', '幸', '福', '。'], ['我', '是', '一', '爷', '们', ',', '吵', '架', '后', '、', '也', '要', '主', '动', '打', '电', '话', '关', '心', '她', ',', '咱', '是', '一', '爷', '们', ',', '给', '媳', '妇', '服', '个', '软', ',', '道', '个', '歉', '怎', '么', '了', '?'], ['我', '是', '一', '爷', '们', ',', '绝', '对', '不', '会', '嫌', '弃', '自', '己', '媳', '妇', ',', '拿', '她', '和', '别', '人', '比', ',', '说', '她', '这', '不', '如', '人', '家', ',', '纳', '不', '如', '人', '家', '的', '。'], ['我', '是', '一', '爷', '们', ',', '陪', '媳', '妇', '逛', '街', '时', ',', '碰', '见', '熟', '人', ',', '无', '论', '我', '媳', '妇', '长', '的', '好', '看', '与', '否', ',', '我', '都', '会', '大', '方', '的', '介', '绍', '。'], ['谁', '让', '咱', '爷', '们', '就', '好', '这', '口', '呢', '。'], ['我', '是', '一', '爷', '们', ',', '我', '想', '我', '会', '给', '我', '媳', '妇', '最', '好', '的', '幸', '福', '。'], ['【', '我', '们', '重', '在', '分', '享', '。'], ['所', '有', '文', '字', '和', '美', '图', ',', '来', '自', '网', '络', ',', '晨', '欣', '教', '育', '整', '理', '。'], ['对', '原', '文', '作', '者', ',', '表', '示', '敬', '意', '。'], ['】', '关', '注', '晨', '曦', '教', '育', '[UNK]', '[UNK]', '晨', '曦', '教', '育', '(', '微', '信', '号', ':', 'he', '##bc', '##x', '##jy', ')', '。'], ['打', '开', '微', '信', ',', '扫', '描', '二', '维', '码', ',', '关', '注', '[UNK]', '晨', '曦', '教', '育', '[UNK]', ',', '获', '取', '更', '多', '育', '儿', '资', '源', '。'], ['点', '击', '下', '面', '订', '阅', '按', '钮', '订', '阅', ',', '会', '有', '更', '多', '惊', '喜', '哦', '!']]
while i < len(document): # 从文档的第一个位置开始,按个往下看
segment = document[i] # segment是列表,代表的是按字分开的一个完整句子,如 segment=['我', '是', '一', '爷', '们', ',', '我', '想', '我', '会', '给', '我', '媳', '妇', '最', '好', '的', '幸', '福', '。']
if FLAGS.non_chinese==False: # if non chinese is False, that means it is chinese, then do something to make chinese whole word mask works.
segment = get_new_segment(segment) # whole word mask for chinese: 结合分词的中文的whole mask设置即在需要的地方加上“##”
current_chunk.append(segment) # 将一个独立的句子加入到当前的文本块中
current_length += len(segment) # 累计到为止位置接触到句子的总长度
if i == len(document) - 1 or current_length >= target_seq_length:
# 如果累计的序列长度达到了目标的长度,或当前走到了文档结尾==>构造并添加到“A[SEP]B“中的A和B中;
if current_chunk: # 如果当前块不为空
# `a_end` is how many segments from `current_chunk` go into the `A`
# (first) sentence.
a_end = 1
if len(current_chunk) >= 2: # 当前块,如果包含超过两个句子,取当前块的一部分作为“A[SEP]B“中的A部分
a_end = rng.randint(1, len(current_chunk) - 1)
# 将当前文本段中选取出来的前半部分,赋值给A即tokens_a
tokens_a = []
for j in range(a_end):
tokens_a.extend(current_chunk[j])
# 构造“A[SEP]B“中的B部分(有一部分是正常的当前文档中的后半部;在原BERT的实现中一部分是随机的从另一个文档中选取的,)
tokens_b = []
for j in range(a_end, len(current_chunk)):
tokens_b.extend(current_chunk[j])
# 有百分之50%的概率交换一下tokens_a和tokens_b的位置
# print("tokens_a length1:",len(tokens_a))
# print("tokens_b length1:",len(tokens_b)) # len(tokens_b) = 0
if len(tokens_a) == 0 or len(tokens_b) == 0: i += 1; continue
if rng.random() < 0.5: # 交换一下tokens_a和tokens_b
is_random_next=True
temp=tokens_a
tokens_a=tokens_b
tokens_b=temp
else:
is_random_next=False
truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng)
assert len(tokens_a) >= 1
assert len(tokens_b) >= 1
# 把tokens_a & tokens_b加入到按照bert的风格,即以[CLS]tokens_a[SEP]tokens_b[SEP]的形式,结合到一起,作为最终的tokens; 也带上segment_ids,前面部分segment_ids的值是0,后面部分的值是1.
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
# 创建masked LM的任务的数据 Creates the predictions for the masked LM objective
(tokens, masked_lm_positions,
masked_lm_labels) = create_masked_lm_predictions(
tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng)
instance = TrainingInstance( # 创建训练实例的对象
tokens=tokens,
segment_ids=segment_ids,
is_random_next=is_random_next,
masked_lm_positions=masked_lm_positions,
masked_lm_labels=masked_lm_labels)
instances.append(instance)
current_chunk = [] # 清空当前块
current_length = 0 # 重置当前文本块的长度
i += 1 # 接着文档中的内容往后看
return instances
def create_instances_from_document_original( # THIS IS ORIGINAL BERT STYLE FOR CREATE DATA OF MLM AND NEXT SENTENCE PREDICTION TASK
all_documents, document_index, max_seq_length, short_seq_prob,
masked_lm_prob, max_predictions_per_seq, vocab_words, rng):
"""Creates `TrainingInstance`s for a single document."""
document = all_documents[document_index] # 得到一个文档
# Account for [CLS], [SEP], [SEP]
max_num_tokens = max_seq_length - 3
# We *usually* want to fill up the entire sequence since we are padding
# to `max_seq_length` anyways, so short sequences are generally wasted
# computation. However, we *sometimes*
# (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
# sequences to minimize the mismatch between pre-training and fine-tuning.
# The `target_seq_length` is just a rough target however, whereas
# `max_seq_length` is a hard limit.
target_seq_length = max_num_tokens
if rng.random() < short_seq_prob: # 有一定的比例,如10%的概率,我们使用比较短的序列长度,以缓解预训练的长序列和调优阶段(可能的)短序列的不一致情况
target_seq_length = rng.randint(2, max_num_tokens)
# We DON'T just concatenate all of the tokens from a document into a long
# sequence and choose an arbitrary split point because this would make the
# next sentence prediction task too easy. Instead, we split the input into
# segments "A" and "B" based on the actual "sentences" provided by the user
# input.
# 设法使用实际的句子,而不是任意的截断句子,从而更好的构造句子连贯性预测的任务
instances = []
current_chunk = [] # 当前处理的文本段,包含多个句子
current_length = 0
i = 0
# print("###document:",document) # 一个document可以是一整篇文章、新闻、一个词条等. document:[['是', '爷', '们', ',', '就', '得', '给', '媳', '妇', '幸', '福'], ['关', '注', '【', '晨', '曦', '教', '育', '】', ',', '获', '取', '育', '儿', '的', '智', '慧', ',', '与', '孩', '子', '一', '同', '成', '长', '!'], ['方', '法', ':', '打', '开', '微', '信', '→', '添', '加', '朋', '友', '→', '搜', '号', '→', '##he', '##bc', '##x', '##jy', '##→', '关', '注', '!', '我', '是', '一', '个', '爷', '们', ',', '孝', '顺', '是', '做', '人', '的', '第', '一', '准', '则', '。'], ['甭', '管', '小', '时', '候', '怎', '么', '跟', '家', '长', '犯', '混', '蛋', ',', '长', '大', '了', ',', '就', '底', '报', '答', '父', '母', ',', '以', '后', '我', '媳', '妇', '也', '必', '须', '孝', '顺', '。'], ['我', '是', '一', '个', '爷', '们', ',', '可', '以', '花', '心', ',', '可', '以', '好', '玩', '。'], ['但', '我', '一', '定', '会', '找', '一', '个', '管', '的', '住', '我', '的', '女', '人', ',', '和', '我', '一', '起', '生', '活', '。'], ['28', '岁', '以', '前', '在', '怎', '么', '玩', '都', '行', ',', '但', '我', '最', '后', '一', '定', '会', '找', '一', '个', '勤', '俭', '持', '家', '的', '女', '人', '。'], ['我', '是', '一', '爷', '们', ',', '我', '不', '会', '让', '自', '己', '的', '女', '人', '受', '一', '点', '委', '屈', ',', '每', '次', '把', '她', '抱', '在', '怀', '里', ',', '看', '她', '洋', '溢', '着', '幸', '福', '的', '脸', ',', '我', '都', '会', '引', '以', '为', '傲', ',', '这', '特', '么', '就', '是', '我', '的', '女', '人', '。'], ['我', '是', '一', '爷', '们', ',', '干', '什', '么', '也', '不', '能', '忘', '了', '自', '己', '媳', '妇', ',', '就', '算', '和', '哥', '们', '一', '起', '喝', '酒', ',', '喝', '到', '很', '晚', ',', '也', '要', '提', '前', '打', '电', '话', '告', '诉', '她', ',', '让', '她', '早', '点', '休', '息', '。'], ['我', '是', '一', '爷', '们', ',', '我', '媳', '妇', '绝', '对', '不', '能', '抽', '烟', ',', '喝', '酒', '还', '勉', '强', '过', '得', '去', ',', '不', '过', '该', '喝', '的', '时', '候', '喝', ',', '不', '该', '喝', '的', '时', '候', ',', '少', '扯', '纳', '极', '薄', '蛋', '。'], ['我', '是', '一', '爷', '们', ',', '我', '媳', '妇', '必', '须', '听', '我', '话', ',', '在', '人', '前', '一', '定', '要', '给', '我', '面', '子', ',', '回', '家', '了', '咱', '什', '么', '都', '好', '说', '。'], ['我', '是', '一', '爷', '们', ',', '就', '算', '难', '的', '吃', '不', '上', '饭', '了', ',', '都', '不', '张', '口', '跟', '媳', '妇', '要', '一', '分', '钱', '。'], ['我', '是', '一', '爷', '们', ',', '不', '管', '上', '学', '还', '是', '上', '班', ',', '我', '都', '会', '送', '媳', '妇', '回', '家', '。'], ['我', '是', '一', '爷', '们', ',', '交', '往', '不', '到', '1', '年', ',', '绝', '对', '不', '会', '和', '媳', '妇', '提', '过', '分', '的', '要', '求', ',', '我', '会', '尊', '重', '她', '。'], ['我', '是', '一', '爷', '们', ',', '游', '戏', '永', '远', '比', '不', '上', '我', '媳', '妇', '重', '要', ',', '只', '要', '媳', '妇', '发', '话', ',', '我', '绝', '对', '唯', '命', '是', '从', '。'], ['我', '是', '一', '爷', '们', ',', '上', 'q', '绝', '对', '是', '为', '了', '等', '媳', '妇', ',', '所', '有', '暧', '昧', '的', '心', '情', '只', '为', '她', '一', '个', '女', '人', '而', '写', ',', '我', '不', '一', '定', '会', '经', '常', '写', '日', '志', ',', '可', '是', '我', '会', '告', '诉', '全', '世', '界', ',', '我', '很', '爱', '她', '。'], ['我', '是', '一', '爷', '们', ',', '不', '一', '定', '要', '经', '常', '制', '造', '浪', '漫', '、', '偶', '尔', '过', '个', '节', '日', '也', '要', '送', '束', '玫', '瑰', '花', '给', '媳', '妇', '抱', '回', '家', '。'], ['我', '是', '一', '爷', '们', ',', '手', '机', '会', '24', '小', '时', '为', '她', '开', '机', ',', '让', '她', '半', '夜', '痛', '经', '的', '时', '候', ',', '做', '恶', '梦', '的', '时', '候', ',', '随', '时', '可', '以', '联', '系', '到', '我', '。'], ['我', '是', '一', '爷', '们', ',', '我', '会', '经', '常', '带', '媳', '妇', '出', '去', '玩', ',', '她', '不', '一', '定', '要', '和', '我', '所', '有', '的', '哥', '们', '都', '认', '识', ',', '但', '见', '面', '能', '说', '的', '上', '话', '就', '行', '。'], ['我', '是', '一', '爷', '们', ',', '我', '会', '和', '媳', '妇', '的', '姐', '妹', '哥', '们', '搞', '好', '关', '系', ',', '让', '她', '们', '相', '信', '我', '一', '定', '可', '以', '给', '我', '媳', '妇', '幸', '福', '。'], ['我', '是', '一', '爷', '们', ',', '吵', '架', '后', '、', '也', '要', '主', '动', '打', '电', '话', '关', '心', '她', ',', '咱', '是', '一', '爷', '们', ',', '给', '媳', '妇', '服', '个', '软', ',', '道', '个', '歉', '怎', '么', '了', '?'], ['我', '是', '一', '爷', '们', ',', '绝', '对', '不', '会', '嫌', '弃', '自', '己', '媳', '妇', ',', '拿', '她', '和', '别', '人', '比', ',', '说', '她', '这', '不', '如', '人', '家', ',', '纳', '不', '如', '人', '家', '的', '。'], ['我', '是', '一', '爷', '们', ',', '陪', '媳', '妇', '逛', '街', '时', ',', '碰', '见', '熟', '人', ',', '无', '论', '我', '媳', '妇', '长', '的', '好', '看', '与', '否', ',', '我', '都', '会', '大', '方', '的', '介', '绍', '。'], ['谁', '让', '咱', '爷', '们', '就', '好', '这', '口', '呢', '。'], ['我', '是', '一', '爷', '们', ',', '我', '想', '我', '会', '给', '我', '媳', '妇', '最', '好', '的', '幸', '福', '。'], ['【', '我', '们', '重', '在', '分', '享', '。'], ['所', '有', '文', '字', '和', '美', '图', ',', '来', '自', '网', '络', ',', '晨', '欣', '教', '育', '整', '理', '。'], ['对', '原', '文', '作', '者', ',', '表', '示', '敬', '意', '。'], ['】', '关', '注', '晨', '曦', '教', '育', '[UNK]', '[UNK]', '晨', '曦', '教', '育', '(', '微', '信', '号', ':', 'he', '##bc', '##x', '##jy', ')', '。'], ['打', '开', '微', '信', ',', '扫', '描', '二', '维', '码', ',', '关', '注', '[UNK]', '晨', '曦', '教', '育', '[UNK]', ',', '获', '取', '更', '多', '育', '儿', '资', '源', '。'], ['点', '击', '下', '面', '订', '阅', '按', '钮', '订', '阅', ',', '会', '有', '更', '多', '惊', '喜', '哦', '!']]
while i < len(document): # 从文档的第一个位置开始,按个往下看
segment = document[i] # segment是列表,代表的是按字分开的一个完整句子,如 segment=['我', '是', '一', '爷', '们', ',', '我', '想', '我', '会', '给', '我', '媳', '妇', '最', '好', '的', '幸', '福', '。']
# print("###i:",i,";segment:",segment)
current_chunk.append(segment) # 将一个独立的句子加入到当前的文本块中
current_length += len(segment) # 累计到为止位置接触到句子的总长度
if i == len(document) - 1 or current_length >= target_seq_length: # 如果累计的序列长度达到了目标的长度==>构造并添加到“A[SEP]B“中的A和B中。
if current_chunk: # 如果当前块不为空
# `a_end` is how many segments from `current_chunk` go into the `A`
# (first) sentence.
a_end = 1
if len(current_chunk) >= 2: # 当前块,如果包含超过两个句子,怎取当前块的一部分作为“A[SEP]B“中的A部分
a_end = rng.randint(1, len(current_chunk) - 1)
# 将当前文本段中选取出来的前半部分,赋值给A即tokens_a
tokens_a = []
for j in range(a_end):
tokens_a.extend(current_chunk[j])
# 构造“A[SEP]B“中的B部分(原本的B有一部分是随机的从另一个文档中选取的,有一部分是正常的当前文档中的后半部)
tokens_b = []
# Random next
is_random_next = False
if len(current_chunk) == 1 or rng.random() < 0.5: # 有50%的概率,是从其他文档中随机的选取一个文档,并得到这个文档的后半版本作为B即tokens_b
is_random_next = True
target_b_length = target_seq_length - len(tokens_a)
# This should rarely go for more than one iteration for large
# corpora. However, just to be careful, we try to make sure that
# the random document is not the same as the document
# we're processing.
random_document_index=0
for _ in range(10): # 随机的选出一个与当前的文档不一样的文档的索引
random_document_index = rng.randint(0, len(all_documents) - 1)
if random_document_index != document_index:
break
random_document = all_documents[random_document_index] # 选出这个文档
random_start = rng.randint(0, len(random_document) - 1) # 从这个文档选出一个段落的开始位置
for j in range(random_start, len(random_document)): # 从这个文档的开始位置到结束,作为我们的“A[SEP]B“中的B即tokens_b
tokens_b.extend(random_document[j])
if len(tokens_b) >= target_b_length:
break
# We didn't actually use these segments so we "put them back" so
# they don't go to waste. 这里是为了防止文本的浪费的一个小技巧
num_unused_segments = len(current_chunk) - a_end # e.g. 550-200=350
i -= num_unused_segments # i=i-num_unused_segments, e.g. i=400, num_unused_segments=350, 那么 i=i-num_unused_segments=400-350=50
# Actual next
else: # 有另外50%的几乎,从当前文本块(长度为max_sequence_length)中的后段中填充到tokens_b即“A[SEP]B“中的B。
is_random_next = False
for j in range(a_end, len(current_chunk)):
tokens_b.extend(current_chunk[j])
truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng)
assert len(tokens_a) >= 1
assert len(tokens_b) >= 1
# 把tokens_a & tokens_b加入到按照bert的风格,即以[CLS]tokens_a[SEP]tokens_b[SEP]的形式,结合到一起,作为最终的tokens; 也带上segment_ids,前面部分segment_ids的值是0,后面部分的值是1.
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
# 创建masked LM的任务的数据 Creates the predictions for the masked LM objective
(tokens, masked_lm_positions,
masked_lm_labels) = create_masked_lm_predictions(
tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng)
instance = TrainingInstance( # 创建训练实例的对象
tokens=tokens,
segment_ids=segment_ids,
is_random_next=is_random_next,
masked_lm_positions=masked_lm_positions,
masked_lm_labels=masked_lm_labels)
instances.append(instance)
current_chunk = [] # 清空当前块
current_length = 0 # 重置当前文本块的长度
i += 1 # 接着文档中的内容往后看
return instances
MaskedLmInstance = collections.namedtuple("MaskedLmInstance",
["index", "label"])
def create_masked_lm_predictions(tokens, masked_lm_prob,
max_predictions_per_seq, vocab_words, rng):
"""Creates the predictions for the masked LM objective."""
cand_indexes = []
for (i, token) in enumerate(tokens):
if token == "[CLS]" or token == "[SEP]":
continue
# Whole Word Masking means that if we mask all of the wordpieces
# corresponding to an original word. When a word has been split into
# WordPieces, the first token does not have any marker and any subsequence
# tokens are prefixed with ##. So whenever we see the ## token, we
# append it to the previous set of word indexes.
#
# Note that Whole Word Masking does *not* change the training code
# at all -- we still predict each WordPiece independently, softmaxed
# over the entire vocabulary.
if (FLAGS.do_whole_word_mask and len(cand_indexes) >= 1 and
token.startswith("##")):
cand_indexes[-1].append(i)
else:
cand_indexes.append([i])
rng.shuffle(cand_indexes)
if FLAGS.non_chinese==False: # if non chinese is False, that means it is chinese, then try to remove "##" which is added previously
output_tokens = [t[2:] if len(re.findall('##[\u4E00-\u9FA5]', t)) > 0 else t for t in tokens] # 去掉"##"
else: # english and other language, which is not chinese
output_tokens = list(tokens)
num_to_predict = min(max_predictions_per_seq,
max(1, int(round(len(tokens) * masked_lm_prob))))
masked_lms = []
covered_indexes = set()
for index_set in cand_indexes:
if len(masked_lms) >= num_to_predict:
break
# If adding a whole-word mask would exceed the maximum number of
# predictions, then just skip this candidate.
if len(masked_lms) + len(index_set) > num_to_predict:
continue
is_any_index_covered = False
for index in index_set:
if index in covered_indexes:
is_any_index_covered = True
break
if is_any_index_covered:
continue
for index in index_set:
covered_indexes.add(index)
masked_token = None
# 80% of the time, replace with [MASK]
if rng.random() < 0.8:
masked_token = "[MASK]"
else:
# 10% of the time, keep original
if rng.random() < 0.5:
if FLAGS.non_chinese == False: # if non chinese is False, that means it is chinese, then try to remove "##" which is added previously
masked_token = tokens[index][2:] if len(re.findall('##[\u4E00-\u9FA5]', tokens[index])) > 0 else tokens[index] # 去掉"##"
else:
masked_token = tokens[index]
# 10% of the time, replace with random word
else:
masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
output_tokens[index] = masked_token
masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
assert len(masked_lms) <= num_to_predict
masked_lms = sorted(masked_lms, key=lambda x: x.index)
masked_lm_positions = []
masked_lm_labels = []
for p in masked_lms:
masked_lm_positions.append(p.index)
masked_lm_labels.append(p.label)
# tf.logging.info('%s' % (tokens))
# tf.logging.info('%s' % (output_tokens))
return (output_tokens, masked_lm_positions, masked_lm_labels)
def create_masked_lm_predictions_original(tokens, masked_lm_prob,
max_predictions_per_seq, vocab_words, rng):
"""Creates the predictions for the masked LM objective."""
cand_indexes = []
for (i, token) in enumerate(tokens):
if token == "[CLS]" or token == "[SEP]":
continue
# Whole Word Masking means that if we mask all of the wordpieces
# corresponding to an original word. When a word has been split into
# WordPieces, the first token does not have any marker and any subsequence
# tokens are prefixed with ##. So whenever we see the ## token, we
# append it to the previous set of word indexes.
#
# Note that Whole Word Masking does *not* change the training code
# at all -- we still predict each WordPiece independently, softmaxed
# over the entire vocabulary.
if (FLAGS.do_whole_word_mask and len(cand_indexes) >= 1 and
token.startswith("##")):
cand_indexes[-1].append(i)
else:
cand_indexes.append([i])
rng.shuffle(cand_indexes)
output_tokens = list(tokens)
num_to_predict = min(max_predictions_per_seq,
max(1, int(round(len(tokens) * masked_lm_prob))))
masked_lms = []
covered_indexes = set()
for index_set in cand_indexes:
if len(masked_lms) >= num_to_predict:
break
# If adding a whole-word mask would exceed the maximum number of
# predictions, then just skip this candidate.
if len(masked_lms) + len(index_set) > num_to_predict:
continue
is_any_index_covered = False
for index in index_set:
if index in covered_indexes:
is_any_index_covered = True
break
if is_any_index_covered:
continue
for index in index_set:
covered_indexes.add(index)
masked_token = None
# 80% of the time, replace with [MASK]
if rng.random() < 0.8:
masked_token = "[MASK]"
else:
# 10% of the time, keep original
if rng.random() < 0.5:
masked_token = tokens[index]
# 10% of the time, replace with random word
else:
masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
output_tokens[index] = masked_token
masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
assert len(masked_lms) <= num_to_predict
masked_lms = sorted(masked_lms, key=lambda x: x.index)
masked_lm_positions = []
masked_lm_labels = []
for p in masked_lms:
masked_lm_positions.append(p.index)
masked_lm_labels.append(p.label)
return (output_tokens, masked_lm_positions, masked_lm_labels)
def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng):
"""Truncates a pair of sequences to a maximum sequence length."""
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_num_tokens:
break
trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b
assert len(trunc_tokens) >= 1
# We want to sometimes truncate from the front and sometimes from the
# back to add more randomness and avoid biases.
if rng.random() < 0.5:
del trunc_tokens[0]
else:
trunc_tokens.pop()
def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
tokenizer = tokenization.FullTokenizer(
vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
input_files = []
for input_pattern in FLAGS.input_file.split(","):
input_files.extend(tf.gfile.Glob(input_pattern))
tf.logging.info("*** Reading from input files ***")
for input_file in input_files:
tf.logging.info(" %s", input_file)
rng = random.Random(FLAGS.random_seed)
instances = create_training_instances(
input_files, tokenizer, FLAGS.max_seq_length, FLAGS.dupe_factor,
FLAGS.short_seq_prob, FLAGS.masked_lm_prob, FLAGS.max_predictions_per_seq,
rng)
output_files = FLAGS.output_file.split(",")
tf.logging.info("*** Writing to output files ***")
for output_file in output_files:
tf.logging.info(" %s", output_file)
write_instance_to_example_files(instances, tokenizer, FLAGS.max_seq_length,
FLAGS.max_predictions_per_seq, output_files)
if __name__ == "__main__":
flags.mark_flag_as_required("input_file")
flags.mark_flag_as_required("output_file")
flags.mark_flag_as_required("vocab_file")
tf.app.run()
================================================
FILE: create_pretraining_data_google.py
================================================
# coding=utf-8
# Copyright 2019 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Lint as: python2, python3
# coding=utf-8
"""Create masked LM/next sentence masked_lm TF examples for ALBERT."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import random
import numpy as np
import six
from six.moves import range
from six.moves import zip
import tensorflow as tf
from albert import tokenization
flags = tf.flags
FLAGS = flags.FLAGS
flags.DEFINE_string("input_file", None,
"Input raw text file (or comma-separated list of files).")
flags.DEFINE_string(
"output_file", None,
"Output TF example file (or comma-separated list of files).")
flags.DEFINE_string(
"vocab_file", None,
"The vocabulary file that the ALBERT model was trained on.")
flags.DEFINE_string("spm_model_file", None,
"The model file for sentence piece tokenization.")
flags.DEFINE_bool(
"do_lower_case", True,
"Whether to lower case the input text. Should be True for uncased "
"models and False for cased models.")
flags.DEFINE_bool(
"do_whole_word_mask", True,
"Whether to use whole word masking rather than per-xWordPiece masking.")
flags.DEFINE_bool(
"do_permutation", False,
"Whether to do the permutation training.")
flags.DEFINE_bool(
"favor_shorter_ngram", False,
"Whether to set higher probabilities for sampling shorter ngrams.")
flags.DEFINE_bool(
"random_next_sentence", False,
"Whether to use the sentence that's right before the current sentence "
"as the negative sample for next sentence prection, rather than using "
"sentences from other random documents.")
flags.DEFINE_integer("max_seq_length", 512, "Maximum sequence length.")
flags.DEFINE_integer("ngram", 3, "Maximum number of ngrams to mask.")
flags.DEFINE_integer("max_predictions_per_seq", 20,
"Maximum number of masked LM predictions per sequence.")
flags.DEFINE_integer("random_seed", 12345, "Random seed for data generation.")
flags.DEFINE_integer(
"dupe_factor", 10,
"Number of times to duplicate the input data (with different masks).")
flags.DEFINE_float("masked_lm_prob", 0.15, "Masked LM probability.")
flags.DEFINE_float(
"short_seq_prob", 0.1,
"Probability of creating sequences which are shorter than the "
"maximum length.")
class TrainingInstance(object):
"""A single training instance (sentence pair)."""
def __init__(self, tokens, segment_ids, masked_lm_positions, masked_lm_labels,
is_random_next, token_boundary):
self.tokens = tokens
self.segment_ids = segment_ids
self.is_random_next = is_random_next
self.token_boundary = token_boundary
self.masked_lm_positions = masked_lm_positions
self.masked_lm_labels = masked_lm_labels
def __str__(self):
s = ""
s += "tokens: %s\n" % (" ".join(
[tokenization.printable_text(x) for x in self.tokens]))
s += "segment_ids: %s\n" % (" ".join([str(x) for x in self.segment_ids]))
s += "token_boundary: %s\n" % (" ".join(
[str(x) for x in self.token_boundary]))
s += "is_random_next: %s\n" % self.is_random_next
s += "masked_lm_positions: %s\n" % (" ".join(
[str(x) for x in self.masked_lm_positions]))
s += "masked_lm_labels: %s\n" % (" ".join(
[tokenization.printable_text(x) for x in self.masked_lm_labels]))
s += "\n"
return s
def __repr__(self):
return self.__str__()
def write_instance_to_example_files(instances, tokenizer, max_seq_length,
max_predictions_per_seq, output_files):
"""Create TF example files from `TrainingInstance`s."""
writers = []
for output_file in output_files:
writers.append(tf.python_io.TFRecordWriter(output_file))
writer_index = 0
total_written = 0
for (inst_index, instance) in enumerate(instances):
input_ids = tokenizer.convert_tokens_to_ids(instance.tokens)
input_mask = [1] * len(input_ids)
segment_ids = list(instance.segment_ids)
token_boundary = list(instance.token_boundary)
assert len(input_ids) <= max_seq_length
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
token_boundary.append(0)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
masked_lm_positions = list(instance.masked_lm_positions)
masked_lm_ids = tokenizer.convert_tokens_to_ids(instance.masked_lm_labels)
masked_lm_weights = [1.0] * len(masked_lm_ids)
multiplier = 1 + int(FLAGS.do_permutation)
while len(masked_lm_positions) < max_predictions_per_seq * multiplier:
masked_lm_positions.append(0)
masked_lm_ids.append(0)
masked_lm_weights.append(0.0)
sentence_order_label = 1 if instance.is_random_next else 0
features = collections.OrderedDict()
features["input_ids"] = create_int_feature(input_ids)
features["input_mask"] = create_int_feature(input_mask)
features["segment_ids"] = create_int_feature(segment_ids)
features["token_boundary"] = create_int_feature(token_boundary)
features["masked_lm_positions"] = create_int_feature(masked_lm_positions)
features["masked_lm_ids"] = create_int_feature(masked_lm_ids)
features["masked_lm_weights"] = create_float_feature(masked_lm_weights)
# Note: We keep this feature name `next_sentence_labels` to be compatible
# with the original data created by lanzhzh@. However, in the ALBERT case
# it does contain sentence_order_label.
features["next_sentence_labels"] = create_int_feature(
[sentence_order_label])
tf_example = tf.train.Example(features=tf.train.Features(feature=features))
writers[writer_index].write(tf_example.SerializeToString())
writer_index = (writer_index + 1) % len(writers)
total_written += 1
if inst_index < 6:
tf.logging.info("*** Example ***")
tf.logging.info("tokens: %s" % " ".join(
[tokenization.printable_text(x) for x in instance.tokens]))
for feature_name in features.keys():
feature = features[feature_name]
values = []
if feature.int64_list.value:
values = feature.int64_list.value
elif feature.float_list.value:
values = feature.float_list.value
tf.logging.info(
"%s: %s" % (feature_name, " ".join([str(x) for x in values])))
for writer in writers:
writer.close()
tf.logging.info("Wrote %d total instances", total_written)
def create_int_feature(values):
feature = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
return feature
def create_float_feature(values):
feature = tf.train.Feature(float_list=tf.train.FloatList(value=list(values)))
return feature
def create_training_instances(input_files, tokenizer, max_seq_length,
dupe_factor, short_seq_prob, masked_lm_prob,
max_predictions_per_seq, rng):
"""Create `TrainingInstance`s from raw text."""
all_documents = [[]]
# Input file format:
# (1) One sentence per line. These should ideally be actual sentences, not
# entire paragraphs or arbitrary spans of text. (Because we use the
# sentence boundaries for the "next sentence prediction" task).
# (2) Blank lines between documents. Document boundaries are needed so
# that the "next sentence prediction" task doesn't span between documents.
for input_file in input_files:
with tf.gfile.GFile(input_file, "r") as reader:
while True:
line = reader.readline()
if not FLAGS.spm_model_file:
line = tokenization.convert_to_unicode(line)
if not line:
break
if FLAGS.spm_model_file:
line = tokenization.preprocess_text(line, lower=FLAGS.do_lower_case)
else:
line = line.strip()
# Empty lines are used as document delimiters
if not line:
all_documents.append([])
tokens = tokenizer.tokenize(line)
if tokens:
all_documents[-1].append(tokens)
# Remove empty documents
all_documents = [x for x in all_documents if x]
rng.shuffle(all_documents)
vocab_words = list(tokenizer.vocab.keys())
instances = []
for _ in range(dupe_factor):
for document_index in range(len(all_documents)):
instances.extend(
create_instances_from_document(
all_documents, document_index, max_seq_length, short_seq_prob,
masked_lm_prob, max_predictions_per_seq, vocab_words, rng))
rng.shuffle(instances)
return instances
def create_instances_from_document(
all_documents, document_index, max_seq_length, short_seq_prob,
masked_lm_prob, max_predictions_per_seq, vocab_words, rng):
"""Creates `TrainingInstance`s for a single document."""
document = all_documents[document_index]
# Account for [CLS], [SEP], [SEP]
max_num_tokens = max_seq_length - 3
# We *usually* want to fill up the entire sequence since we are padding
# to `max_seq_length` anyways, so short sequences are generally wasted
# computation. However, we *sometimes*
# (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
# sequences to minimize the mismatch between pre-training and fine-tuning.
# The `target_seq_length` is just a rough target however, whereas
# `max_seq_length` is a hard limit.
target_seq_length = max_num_tokens
if rng.random() < short_seq_prob:
target_seq_length = rng.randint(2, max_num_tokens)
# We DON'T just concatenate all of the tokens from a document into a long
# sequence and choose an arbitrary split point because this would make the
# next sentence prediction task too easy. Instead, we split the input into
# segments "A" and "B" based on the actual "sentences" provided by the user
# input.
instances = []
current_chunk = []
current_length = 0
i = 0
while i < len(document):
segment = document[i]
current_chunk.append(segment)
current_length += len(segment)
if i == len(document) - 1 or current_length >= target_seq_length:
if current_chunk:
# `a_end` is how many segments from `current_chunk` go into the `A`
# (first) sentence.
a_end = 1
if len(current_chunk) >= 2:
a_end = rng.randint(1, len(current_chunk) - 1)
tokens_a = []
for j in range(a_end):
tokens_a.extend(current_chunk[j])
tokens_b = []
# Random next
is_random_next = False
if len(current_chunk) == 1 or \
(FLAGS.random_next_sentence and rng.random() < 0.5):
is_random_next = True
target_b_length = target_seq_length - len(tokens_a)
# This should rarely go for more than one iteration for large
# corpora. However, just to be careful, we try to make sure that
# the random document is not the same as the document
# we're processing.
for _ in range(10):
random_document_index = rng.randint(0, len(all_documents) - 1)
if random_document_index != document_index:
break
random_document = all_documents[random_document_index]
random_start = rng.randint(0, len(random_document) - 1)
for j in range(random_start, len(random_document)):
tokens_b.extend(random_document[j])
if len(tokens_b) >= target_b_length:
break
# We didn't actually use these segments so we "put them back" so
# they don't go to waste.
num_unused_segments = len(current_chunk) - a_end
i -= num_unused_segments
elif not FLAGS.random_next_sentence and rng.random() < 0.5:
is_random_next = True
for j in range(a_end, len(current_chunk)):
tokens_b.extend(current_chunk[j])
# Note(mingdachen): in this case, we just swap tokens_a and tokens_b
tokens_a, tokens_b = tokens_b, tokens_a
# Actual next
else:
is_random_next = False
for j in range(a_end, len(current_chunk)):
tokens_b.extend(current_chunk[j])
truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng)
assert len(tokens_a) >= 1
assert len(tokens_b) >= 1
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
(tokens, masked_lm_positions,
masked_lm_labels, token_boundary) = create_masked_lm_predictions(
tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng)
instance = TrainingInstance(
tokens=tokens,
segment_ids=segment_ids,
is_random_next=is_random_next,
token_boundary=token_boundary,
masked_lm_positions=masked_lm_positions,
masked_lm_labels=masked_lm_labels)
instances.append(instance)
current_chunk = []
current_length = 0
i += 1
return instances
MaskedLmInstance = collections.namedtuple("MaskedLmInstance",
["index", "label"])
def _is_start_piece_sp(piece):
"""Check if the current word piece is the starting piece (sentence piece)."""
special_pieces = set(list('!"#$%&\"()*+,-./:;?@[\\]^_`{|}~'))
special_pieces.add(u"€".encode("utf-8"))
special_pieces.add(u"£".encode("utf-8"))
# Note(mingdachen):
# For foreign characters, we always treat them as a whole piece.
english_chars = set(list("abcdefghijklmnopqrstuvwhyz"))
if (six.ensure_str(piece).startswith("▁") or
six.ensure_str(piece).startswith("<") or piece in special_pieces or
not all([i.lower() in english_chars.union(special_pieces)
for i in piece])):
return True
else:
return False
def _is_start_piece_bert(piece):
"""Check if the current word piece is the starting piece (BERT)."""
# When a word has been split into
# WordPieces, the first token does not have any marker and any subsequence
# tokens are prefixed with ##. So whenever we see the ## token, we
# append it to the previous set of word indexes.
return not six.ensure_str(piece).startswith("##")
def is_start_piece(piece):
if FLAGS.spm_model_file:
return _is_start_piece_sp(piece)
else:
return _is_start_piece_bert(piece)
def create_masked_lm_predictions(tokens, masked_lm_prob,
max_predictions_per_seq, vocab_words, rng):
"""Creates the predictions for the masked LM objective."""
cand_indexes = []
# Note(mingdachen): We create a list for recording if the piece is
# the starting piece of current token, where 1 means true, so that
# on-the-fly whole word masking is possible.
token_boundary = [0] * len(tokens)
for (i, token) in enumerate(tokens):
if token == "[CLS]" or token == "[SEP]":
token_boundary[i] = 1
continue
# Whole Word Masking means that if we mask all of the wordpieces
# corresponding to an original word.
#
# Note that Whole Word Masking does *not* change the training code
# at all -- we still predict each WordPiece independently, softmaxed
# over the entire vocabulary.
if (FLAGS.do_whole_word_mask and len(cand_indexes) >= 1 and
not is_start_piece(token)):
cand_indexes[-1].append(i)
else:
cand_indexes.append([i])
if is_start_piece(token):
token_boundary[i] = 1
output_tokens = list(tokens)
masked_lm_positions = []
masked_lm_labels = []
if masked_lm_prob == 0:
return (output_tokens, masked_lm_positions,
masked_lm_labels, token_boundary)
num_to_predict = min(max_predictions_per_seq,
max(1, int(round(len(tokens) * masked_lm_prob))))
# Note(mingdachen):
# By default, we set the probilities to favor longer ngram sequences.
ngrams = np.arange(1, FLAGS.ngram + 1, dtype=np.int64)
pvals = 1. / np.arange(1, FLAGS.ngram + 1)
pvals /= pvals.sum(keepdims=True)
if FLAGS.favor_shorter_ngram:
pvals = pvals[::-1]
ngram_indexes = []
for idx in range(len(cand_indexes)):
ngram_index = []
for n in ngrams:
ngram_index.append(cand_indexes[idx:idx+n])
ngram_indexes.append(ngram_index)
rng.shuffle(ngram_indexes)
masked_lms = []
covered_indexes = set()
for cand_index_set in ngram_indexes:
if len(masked_lms) >= num_to_predict:
break
if not cand_index_set:
continue
# Note(mingdachen):
# Skip current piece if they are covered in lm masking or previous ngrams.
for index_set in cand_index_set[0]:
for index in index_set:
if index in covered_indexes:
continue
n = np.random.choice(ngrams[:len(cand_index_set)],
p=pvals[:len(cand_index_set)] /
pvals[:len(cand_index_set)].sum(keepdims=True))
index_set = sum(cand_index_set[n - 1], [])
n -= 1
# Note(mingdachen):
# Repeatedly looking for a candidate that does not exceed the
# maximum number of predictions by trying shorter ngrams.
while len(masked_lms) + len(index_set) > num_to_predict:
if n == 0:
break
index_set = sum(cand_index_set[n - 1], [])
n -= 1
# If adding a whole-word mask would exceed the maximum number of
# predictions, then just skip this candidate.
if len(masked_lms) + len(index_set) > num_to_predict:
continue
is_any_index_covered = False
for index in index_set:
if index in covered_indexes:
is_any_index_covered = True
break
if is_any_index_covered:
continue
for index in index_set:
covered_indexes.add(index)
masked_token = None
# 80% of the time, replace with [MASK]
if rng.random() < 0.8:
masked_token = "[MASK]"
else:
# 10% of the time, keep original
if rng.random() < 0.5:
masked_token = tokens[index]
# 10% of the time, replace with random word
else:
masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
output_tokens[index] = masked_token
masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
assert len(masked_lms) <= num_to_predict
rng.shuffle(ngram_indexes)
select_indexes = set()
if FLAGS.do_permutation:
for cand_index_set in ngram_indexes:
if len(select_indexes) >= num_to_predict:
break
if not cand_index_set:
continue
# Note(mingdachen):
# Skip current piece if they are covered in lm masking or previous ngrams.
for index_set in cand_index_set[0]:
for index in index_set:
if index in covered_indexes or index in select_indexes:
continue
n = np.random.choice(ngrams[:len(cand_index_set)],
p=pvals[:len(cand_index_set)] /
pvals[:len(cand_index_set)].sum(keepdims=True))
index_set = sum(cand_index_set[n - 1], [])
n -= 1
while len(select_indexes) + len(index_set) > num_to_predict:
if n == 0:
break
index_set = sum(cand_index_set[n - 1], [])
n -= 1
# If adding a whole-word mask would exceed the maximum number of
# predictions, then just skip this candidate.
if len(select_indexes) + len(index_set) > num_to_predict:
continue
is_any_index_covered = False
for index in index_set:
if index in covered_indexes or index in select_indexes:
is_any_index_covered = True
break
if is_any_index_covered:
continue
for index in index_set:
select_indexes.add(index)
assert len(select_indexes) <= num_to_predict
select_indexes = sorted(select_indexes)
permute_indexes = list(select_indexes)
rng.shuffle(permute_indexes)
orig_token = list(output_tokens)
for src_i, tgt_i in zip(select_indexes, permute_indexes):
output_tokens[src_i] = orig_token[tgt_i]
masked_lms.append(MaskedLmInstance(index=src_i, label=orig_token[src_i]))
masked_lms = sorted(masked_lms, key=lambda x: x.index)
for p in masked_lms:
masked_lm_positions.append(p.index)
masked_lm_labels.append(p.label)
return (output_tokens, masked_lm_positions, masked_lm_labels, token_boundary)
def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng):
"""Truncates a pair of sequences to a maximum sequence length."""
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_num_tokens:
break
trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b
assert len(trunc_tokens) >= 1
# We want to sometimes truncate from the front and sometimes from the
# back to add more randomness and avoid biases.
if rng.random() < 0.5:
del trunc_tokens[0]
else:
trunc_tokens.pop()
def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
tokenizer = tokenization.FullTokenizer(
vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case,
spm_model_file=FLAGS.spm_model_file)
input_files = []
for input_pattern in FLAGS.input_file.split(","):
input_files.extend(tf.gfile.Glob(input_pattern))
tf.logging.info("*** Reading from input files ***")
for input_file in input_files:
tf.logging.info(" %s", input_file)
rng = random.Random(FLAGS.random_seed)
instances = create_training_instances(
input_files, tokenizer, FLAGS.max_seq_length, FLAGS.dupe_factor,
FLAGS.short_seq_prob, FLAGS.masked_lm_prob, FLAGS.max_predictions_per_seq,
rng)
tf.logging.info("number of instances: %i", len(instances))
output_files = FLAGS.output_file.split(",")
tf.logging.info("*** Writing to output files ***")
for output_file in output_files:
tf.logging.info(" %s", output_file)
write_instance_to_example_files(instances, tokenizer, FLAGS.max_seq_length,
FLAGS.max_predictions_per_seq, output_files)
if __name__ == "__main__":
flags.mark_flag_as_required("input_file")
flags.mark_flag_as_required("output_file")
flags.mark_flag_as_required("vocab_file")
tf.app.run()
================================================
FILE: data/news_zh_1.txt
================================================
最后的南京老城该往何处去 城市化时代呼唤文化自觉
【概要】80后学者姚远出版《城市的自觉》一书 姚远出版《城市的自觉》 作者简介姚远,政治学博士,1981年出生于南京,1999年从金陵中学毕业后考入北京大学国际关系学院,负笈燕园十二载,获政治学博士学位。
现任教于南京大学政府管理学院。
在关系古都北京、南京等历史文化名城存废的历史关头,他锲而不舍地为抢救中华文明奔走呐喊。
2010年,他被中国文物保护基金会评为“中国文化遗产保护年度十大杰出人物”,当时的获奖评语是:一支?土耳其诗人纳齐姆·希克梅特曾深情地说:“人的一生有两样东西不会忘记,那就是母亲的面孔和城市的面貌。
”然而,前不久南京再次发生颜料坊地块市级文保单位两进建筑被毁的事件。
故宫博物院院长、原国家文物局局长单霁翔近日在宁直言,南京城南再遭损毁令他心痛。
南京老城“路在何方”?
2010年被中国文物保护基金会评为“中国文化遗产保护年度十大杰出人物”的80后学者、南京大学姚远老师所著的《城市的自觉》近日正式出版。
书中探索古城保护与复兴的建设性路径,值得南京的决策者们在颜料坊事件后再次深思。
江南时报记者黄勇疑问:城市化,是否迷失了文化自觉“目睹一座座古建筑的消失,行走在古城的废墟,想到梁思成说过的‘拆掉北京的一座城楼,就像割掉我的一块肉;扒掉北京的一段城墙,就像扒掉我的一层皮’,真是感同身受,我流泪了。
”这是姚远最让记者为之动容的一句话,也是《城市的自觉》一书中的“魂”。
包括南京在内,中国大多数城市正处于大拆除的时代,成片的历史街区在“旧城改造”的大旗下被不断夷为平地。
有专家称,这场“休克疗法式”的“改造”,对中华文脉的影响之深、之巨、之不可逆,堪称中国城市史上“三千年未有之大变局”。
《城市的自觉》正是在这种背景下,由北京大学出版社于近日出版的。
书中,姚远以情理交融的文字,辅之以背景、南京古城珍贵的最后影像,如实记录了在北京梁思成故居和宣南、东四八条、钟鼓楼等历史街区,南京颜料坊、南捕厅、门东、门西等历史街区的最后时刻,为阻挡推土机而屡败屡战的历程。
同时,又理性剖析了与存续城市记忆密切相关的文化自觉、物权保护、民生改善、公众参与等议题,探索古城保护与复兴的建设性路径。
为何要保老城?
很多人认为陈旧的老街区、老房子应该为摩天大楼让位,造高速路、摩天楼是现代化,“保护老古董”是抱残守缺,姚远却不是这种看法:“一些决策者并不知城市遗产保护恰恰是‘后工业’、‘后现代’的思想,比前者的理念差不多领先了一个世纪。
” 在他眼里,南京这座千年古城曾是“活”着的,老城里有最纯正的方言、最鲜活的民俗、最地道的小吃,简直是一座巨大的民俗博物馆。
“你可以在同老者的交谈中,听到一个个家族或老宅的兴衰故事。
这里的城与人,就是一本厚重的大书,它们用最生动的语言向你讲述不一样的‘城南旧事’。
”面对许多古城不断遭到大拆大建、拆真建假、拆旧建新的厄运,姚远痛心地说,“我们的城市化,是否迷失了自我认同,是否失去了文化自觉的能力?
在城市化的文化自觉重建之前,我们还将继续付出多少代价?
”现状:老城南仅剩不到1平方公里南京城曾有十九个别称,如秦淮、白下、建邺、江宁等,建城史更是长达两千五百年。
但如今,除去明城墙以及一些重点文物以及七零八落的民国建筑之外,这个城市跟中国其他的城市看上去并无太多区别,鳞次栉比的高楼大厦,车水马龙的宽阔街道,川流不息的红男绿女……持续多年的旧城改造,已经让南京老城日益失去古朴的历史风貌。
秦淮河畔的老城南,是南京文化的发源地,是南京的根。
在2006年前,尽管南京诸多的“殿、庙、塔、桥”已在兵火和变乱中消失,但秦淮河畔的老城南依然保存了文物丰富、风貌完整的历史街区。
然而,2006年,南京风云突起,突击对颜料坊、安品街等历史街区实施“危旧房改造”,拆毁大量文物建筑。
2009年又是一轮“危改”,大大的“拆”字,再次涂上了门东、门西、南捕厅等多片老街区。
2010年至今,南京先后出台了《南京市历史文化名城保护条例》《南京历史文化名城保护规划》《南京老城南历史城区保护规划与城市设计》,以法规的高度,回应了社会各界的诉求,明确要求对老城的整体保护。
姚远和其他学者联名提出的建议,有40处被采纳进了最后的《条例》中。
姚远告诉江南时报记者,南京的传统旧城区——老城南仅剩不到1平方公里,尚不及50平方公里老城总面积的2%,整体保护势在必行。
但他并不认为整体保护意味着“冻结不动”,而是强调古民居、古街巷和宏伟的古建筑一样重要,它们是古都特有的城市肌理,低矮的民居衬托高大的城阙,形成轮廓丰富的城市格局。
如果消灭了它们,名胜古迹就变成无法交融联络的“孤岛”,古都的整体风貌则无从谈起。
“对于金陵古城濒危的最后这点种子,实行‘整体保护’已经没有任何讨价还价的余地。
”《城市的自觉》一书中,姚远的声音振聋发聩。
方案:探索保护与整治的最大合力可惜的是,在专家学者与推土机的拉锯战中,前者基本还是处于下风的,即便是中央领导的几次批示,旧城改造的推土机依然我行我素,将一面面古墙碾在轮下。
颜料坊、牛市、门东等被“肢解”的老城南片区,如今多已竖起或正在建设房地产开发、商业项目。
2002年8月,姚远在南京颜料坊开始了古城保护的第一次拍摄。
如今牛市64号-颜料坊49号这座百年清代建筑却再遭破坏。
单霁翔近日在南大演讲中也表示,颜料坊再遭损毁令人心痛。
“我不认同南京老城南成片拆除,搬迁当地住户的改造方式。
简单地认为它的居住形式落后了,这种态度是消极的,没有给予作为代表地域特色的传统建筑的居住形式有尊严的呵护。
”《城市的自觉》一书中也多次提及南京老城不能“只见物,不见人”。
姚远强调,南京历史文化名城的保护,离不开对传统社区的活态保护。
老城南有丰富的民俗和古老的街区,是唇齿相依的一个整体。
拆去了老宅,迁走了居民,文化自然就成了无源之水、无本之木。
“国际上的成功经验表明,保护从来不是发展、民生、现代化的反义词。
”姚远建议,老城区的整治,可以在政府的指导和协助下,以居民为主体,通过社区互助的“自我修缮”的方式来实施,将“旧城区改建”从拆迁模式下的行政关系转变为修缮模式下的民事关系,最大限度地调动各方面的积极性,形成保护与整治的最大合力。
措施:用行动让法律“站起来”经历了两次保卫战,姚远对于文物保护方面的法律条文早已如数家珍。
在他看来,“法治”和“参与”这两个关键词尤为重要。
姚远认为,政府的很多失误是因为政策制定的封闭性,推土机开到门口时才告知公众。
公民参与,就要求行政更加透明、公开。
“几次保护后制定的政策或者法律法规,也很重要。
因为未来只要有人参与去触动,政策或者法律法规就能‘站起来’,变成一套强有力的程序,约束政府行为。
”“这些年古城保护的每一点进步,都离不开广泛的公众参与,都凝结着社会各界共同的努力。
”姚远认为,在北京、南京等许多古城,一批志愿者、社会人士和民间团体,在古城命运的危急关头,已经显示出日益崛起的公众参与的巨大力量。
“关键要有人能够站出来。
第一个人站出来,就会有第二个人跟上,专家和媒体也会介入,事情就能在公开博弈中得到较为合理的解决。
我国目前民间的文保力量正在逐渐成长,公民参与将成为构建良性社会机制的重要力量。
”姚远强调。
单霁翔对文化遗产保护中的公众参与也做出了高度评价。
他在《城市的自觉》的序中写道:“保护文化遗产绝不仅仅是各级政府和文物工作者的专利,只有广大民众真心地、持久地参与文化遗产保护,文化遗产才能得到最可靠的保障。
以姚远博士为代表的一批志愿者和社会人士,在我国文化遗产保护事业中已经显示出不可低估、无可替代的力量。
不是每一块石头,都能叫珠宝
对于很多人来说,矿石是长成这样的石头: 上图:铁矿石 上图:石 上图:煤矿石 上图:锡矿石如你所想象的那样,很多矿石都是又黑又丑,即使在野外遇到,也不会多看一眼的那种石头。
当然,也不是所有矿石都这么丑。
我们再看看下面这些矿石: 上图:赤铜 上图:钼铅矿 上图:方硼石 上图:自然硫 上图:云母这些矿石,能否让你感慨大自然的造化神奇?小伙伴们可能会想,这些漂亮的矿石,打磨以后就是漂亮的宝石啊,为什么我们不把他们加工成珠宝呢?这个是个好问题。
人类自古以来就没有停止过对美好事物的追求,凡漂亮的东西都可能被人们看上,成为制作饰品原料。
珠宝就是大自然赐予的美好的东西中的一种。
珠宝如果不美就不能成为珠宝,这种美或表现为绚丽的颜色,或表现为透明而洁净。
物以稀为贵,鸽血红级别的红宝石、矢车菊蓝级别的蓝宝石,每克拉价值上万美元,而某些颇美丽又可耐久的宝石(如白水晶),由于产量较多,开采较容易,其价格一直较低。
so,大家能明白了吧,不是每一块石头都能成为珠宝。
如果拥有珠宝,请务必珍惜。
目前1000+人已关注加入我们您看此文用· 秒,转发只需1秒呦~
北京市黄埔同学会接待“踏寻中山足迹学习之旅”台湾参访团
光明网讯(通讯员苏民军记者任生心)日前,由台湾中国统一联盟桃竹分会成员组成的“踏寻中山足迹学习之旅”参访团一行21人来到北京参观访问。
在北京市黄埔同学会的精心安排下,在京期间,参访团拜谒了中山先生衣冠冢,参观了卢沟桥、抗战纪念馆、抗战名将纪念馆和宋庆龄故居等;“踏寻中山足迹学习之旅”参访团还将赴南京中山堂等地参访。
在抗战纪念馆,参访团成员们认真聆听讲解员的介绍,仔细观看每张图片资料,回顾国共两党团结抗战的往事,缅怀那些为民族独立而壮烈牺牲的英雄。
而后,参访团一行来到位于京西香山深处的孙中山先生衣冠冢拜谒,参访团团长李尚贤(台湾中国统一联盟总会第一副主席兼秘书长)发表了简短的感言后,全体成员在孙中山雕像前三鞠躬,向孙中山先生致敬,缅怀孙中山先生以“三民主义”为宗旨的革命的一生。
随后,参访团一行又来到2009年建成的北京香麓园抗战名将纪念馆,瞻仰了佟麟阁将军墓,他们还参观了宋庆龄故居。
鼎丰(08056.HK)向客户借出5000万人币 月息1.75厘 为期一年
鼎丰集团控股(08056.HK)+0.030(+1.345%)公布,同意将一笔5000万元人民币的款项委托予贷款银行,以供转借予客户,贷款期为十二个月,月息1.75厘。
(报价延迟最少十五分钟。
在青岛不买房,居然能拥有这么多东西!
这段时间青岛房价扶摇直上闹得人心惶惶这不,青岛房市,又在国庆节火了一把 国庆5天内16城启动楼市限购一时之间楼市风云大转纵观9月份青岛一手房均价怎么也有一万三四了看完十三哥默默地回去工作了 按照一套房子100平米计算购买一套房子大概需要130万在青岛,买一套房子怎么也得需要130万如果这些钱不买房能在全世界各地买什么呢?
今天,小编就带大家(bai)感(ri)受(meng)一下在西班牙能买3.4个村庄 一位英国人,名叫尼尔·克里斯蒂,在西班牙农村西北部一个田园地区买下了一处村庄(阿鲁纳达),只花费了4.5万欧元(约合35.6万人民币)。
简直便宜到吐血,这点钱要是在青岛的豪宅区,恐怕厕所都买不了。
如果选的地方靠近旅游景区,稍微装修一下,变成一个度假村……妥妥的壕啊,画面太美,不敢想象……在爱尔兰差不多能买个小岛 Inishdooney岛,位于北爱尔兰西北部,售价14万英镑(约合139万人民币)。
约38万平方米的无人居住地有淡水池塘、天然溶洞和鹅卵石海滩,美翻了有木有!
一个小岛的钱,和青岛一个水泥格子的价格差不多。
不要拦着最懂妹,我要去爱尔兰做岛主!
在巴厘岛能买2座别墅 巴厘岛,蓝天、碧水、白云,美的像梦一样,而你知道吗,这座世界著名旅游岛一个小镇的别墅只要10.7万美元,也就是不到70万人民币,青岛买房那点钱都够买两栋别墅了。
在巴厘岛拥有两座别墅是什么概念?
发完文章小编就去买机票!
在美国能买1驾小飞机 美国塞斯纳C172R型,最大航程可达1270公里,飞机上具备GPS导航定位系统、自动驾驶、盲降设备等,价格大概在17万美元左右,也就是104万人民币。
在青岛买房的钱妥妥的够买一架飞机了。
直接移民去西班牙 一个以阳光和沙滩吸引着无数游客的国家,有着激情的足球和斗牛文化、独特的海鲜美食、发达的时装行业、热情火辣的西班牙女郎...... 直接去西班牙?
你以为我在搞笑?
西班牙有个买房移民的政策,在西班牙的指定区域购买当地售价在170万人民币以上的房产就可以办理多次往返签证了,然后你待够10年,就可以入西班牙国籍了。
买一大堆LV手袋 十三哥相信很多女孩应该都很喜欢LV手袋。
这款极具魅力的CHAIN LOUISE手袋价格为2.04万人民币。
随随便便买一堆!
带着爱人环游世界 微博上那对香港80后小夫妻历时308天花费16万人民币走遍了37国,你们还记得吗?
按照他们的行程,你几乎就能去环游世界了。
什么也不用想,痛痛快快环游地球一圈!
在澳大利亚当农场主 五卧室、三浴室的大房子,还有德尼利昆镇附近一块27英亩的农场。
只需要美元价格14.4万美元(≈96万人民币),是不是惊呆了!
哦,对了,澳大利亚还提供住房贷款业务哟!
十三哥要挣钱去澳大利亚买牧场!
在莫斯科买下1座别墅 莫斯科市中心双卧室、双浴室的豪华大别墅,你觉得多少钱?
千万别吃惊,美元价格在15.2万美元左右(≈100.1万人民币)。
虽然在这个城市生活总会有各种各样的压力我们必须十分努力才能看起来毫不费力但是我们永远保持一颗向上的心不气馁,好好加油!
[海尔地产世纪公馆]新都心2期升级新品9月底推出 海尔地产世纪公馆二期规划8栋高层住宅,预计9月底推出,认筹中,交2.5万享99折优惠,预计均价17000-18000元/平。
户型面积区间89-162平,主力120-140平品质改善产品。
125-126平为套三,142-162平为套四。
海尔地产世纪公馆一户一价,以上价格仅供参考,所有在售户型价格以售楼处公布为准。
咨询电话:400-099-0099 转 27724[金隅和府]3大商圈环绕地铁房18000元 金隅和府一户一价,以下价格仅供参考,所有在售户型价格以售楼处公布为准。
金隅和府预计9月20日加推6#楼(24F)楼王,3个单元,1梯2户,户型面积为90平套二,122平、138平套三,团购交1万团购金、10万认筹金可以享受97折优惠,预计均价18000-26000元/平。
金隅和府位于镇江路12号,近邻山东路、延吉路、东西快速路等三横三纵交通网、未来享地铁M5之便利;CBD商圈、香港路商圈、台东商圈3大商圈环绕,居住生活便利。
直播拐点来临:未来直播APP开发还有哪些趋势?
趋势一:巨头收割直播价值,依赖巨头扶持的直播平台存活几率更高尽管一线垂直领域已经被巨头的直播平台占领,但创业者依然还有机会。
未来在泛娱乐社交、游戏、美妆电商等核心领域必然会有几家直播平台具有突出优势,而这些具备突出优势的直播平台很可能会被BAT入股收购或者收编,因此如果能够获得巨头的资本输血与流量扶持,往往存活的几率会更大。
趋势二:直播平台从争抢网红到争抢明星资源明星+粉丝经济+直播平台,很可能会衍生出新型的整合营销方式。
即怎样通过可购买价值的内容设定,运营好与粉丝之间的感情沟通,让粉丝群体进行持续性参与并进行情感消费投入,直播平台与明星组合叠加的人气效应与非理性消费的频次也非常契合品牌商的需求。
因此,直播的未来趋势将从争抢网红资源到争抢明星资源。
这是直播平台孕育粉丝经济进而带来新型的情感消费与商业模式的要走的一条必要的路径。
而未来可能会有越来越多的品牌商更愿意尝试这种直播互动带来的品牌曝光机会与商业变现模式。
趋势三:从泛娱乐明星网红直播转入到二级垂直细分市场的专业直播泛娱乐直播内容属性上由于其单一、无聊的直播内容无法构成平台的核心竞争力,直播平台未来大趋势是从泛娱乐直播转入到内涵直播。
目前部分视频直播平台已针对财经、育儿、时尚、体育、美食等垂直领域的自频道开放直播权限,内容的差异化与垂直化可以为直播平台带来新的商业模式,平台也可以通过优质的直播内容,产生付费、会员、打赏以及直播购物等盈利模式。
因为目前缺乏真正有价值的直播,多数直播平台在内容供给侧是存在问题的,网红要提升自身与粉丝之间的黏性,显然需要差异化的内容,而从目前的欧美网红与直播内容的发展规律来看,更健康、更有价值与内涵的直播内容成为未来的发展趋势之一。
趋势四:网红孵化器批量生产网红 将走向专业化由于在网红包装、传播、变现等方面具备专业的运营能力,网红孵化器未来须具备 “经纪人+代运营+供应链+网红星探”等多重角色,向专业网红群聚捆绑者向提供专业化的服务与垂直领域专家型、特长型、个性型网红培养者与发现者这一定位转型。
借助在用户洞察、网红运营、电商管理方面的精良团队,需要打通粉丝营销和电商运营,并将网红、粉丝,平台、内容,品牌、供应链,进行有效链接及整合。
趋势五:C端直播洗牌 B端企业直播崛起带动专业的商务直播需求目前,各种企业的商务发布会、沙龙、座谈、讲座、渠道大会、教育培训等方面直播需求强烈,在企业进行移动视频直播的需求推动下,它们开始寻求低成本、快速的搭建属于自己的高清视频直播平台的模式,而企业搭建视频直播平台需要专业的技术能力的服务商来应对这种需求。
用户可以通过微信直接观看企业直播参与互动,让直播突破空间场地的限制,某种程度也代表直播产业链的一个接入的发展方向。
趋势六:解决直播用户体验与新媒体营销,移动直播服务商将迎来新的机会直播行业进入了各行各业均可参与,并将直播作为企业服务工具的直播+时代,而玩转直播+,从技术、营销、服务、内容,进而可以衍生出更多的直播服务盈利。
而对于解决直播体验背后的移动直播服务商,也将迎来新的机会。
趋势七:直播或成为企业的标配,可能为企业带来更多转化率当直播火爆起来的时候,人们要关注的不仅仅是行业能火爆多久,它的商业模式是否成熟,在洗牌节点来临与巨头羽翼覆盖下,自身还有没有机会,创业者与企业都应该从中寻找自己的机会与跨界领域的嫁接。
它不仅仅是内容和流量的变现工具,更应该是一种营销与商业理念的转变。
不久前,马化腾向青年创业者建议,要关注两个产业跨界的部分,因为将新技术用在两个产业跨界部分往往最有可能诞生创新的机会。
而企业营销如果能从垂直细分领域的切入并借助直播技术与趋势为已所用,往往也能获得新的机会,尽管任何基于行业趋势的预测都意味着不确定性,但抓住不确定性的机会,才能最终在新一轮风口下,把握企业转型与商业、营销模式创新的机会,迎来属于自己的时代。
欢迎互联网创业者加入杭州互联网创业QQ群:157936473直接加QQ或pc上点击加群项目开发咨询:0571-28030088
邓伟根北美硅谷行“捎回”一个MBA授课点
南都讯记者郭伟豪通讯员伍新宇6月7日至16日,佛山市委常委、南海区委书记、佛山高新区党工委书记兼管委会主任邓伟根率领由南海区和佛山高新区相关人员组成的经贸洽谈和友好交流代表团,对新加坡、美国和加拿大进行友好访问。
由于新加坡裕廊、美国硅谷与有“加拿大高科技之都”美誉的万锦市均以发达的高科技产业著名,皆是所在国的硅谷,邓伟根更称此行为“三谷”之行。
在新加坡,邓伟根一行与新加坡淡马锡控股公司相关负责人就双方进一步深化合作进行了深入的探讨。
交流中,新加坡国立大学(N U S)商学院杨贤院长表示有意在南海设立N U S的海外M B A授课点,双方拟于6月下旬就有关意向在南海签订合作协议。
6月9日,邓伟根一行前往硅谷拜会了硅谷美华科技商会(S V C A C A )和华美半导体协会(C A SPA )。
SV C A C A和CA SPA将通过其广泛的会员和在硅谷等地的影响力,为佛高区、南高区在硅谷进行宣传推介,并积极把有意拓展中国市场的高科技项目推荐到南高区。
代表团一行还到访了南海区政府与万锦市政府联合举办了“南海区与万锦市经贸交流会”。
2012年12月,万锦市市长薛家平先生率团访问南海后,万锦市议会正式通过了为当地一道路命名“南海街”的议案,并于2013年9月举行道路命名仪式。
在本次交流中,邓伟根提议未来也在南海选址命名一条“万锦路”,此举也立即得到薛家平市长的认同。
对于“三谷”之行,邓伟根表示,南海将利用现有的南海乡亲和关系密切的协会等有利资源,计划在“三谷”建立南海和佛高区的海外联络处,学习和吸收海外高科技之都的先进经验,努力将已定位为“中国制造金谷”的佛高区南海核心园打造成为下一个“硅谷”,并争取早日实现佛高区挺进全国国家高新区20强的目标。
内地高中生将通篇学习《道德经》
摘要国内第一套自主研发的高中传统文化通识教材预计将于今年9月出版,四册分别为《论语》《孟子》《大学·中庸》和《道德经》。
2016年高考改革方案中,全国25个省高考要统一命题,并且增加分数后的语文考试,正在研究增加“中华优秀传统文化”之相关内容。
《道德经》成为高中传统文化教材。
法制晚报讯(记者 李文姬 )今天上午,记者从“十二五”教育部规划课题《传统文化与中小学生人格培养研究》总课题组了解到,国内第一套自主研发的高中传统文化通识教材预计将于今年9月出版,四册分别为《论语》《孟子》《大学·中庸》和《道德经》。
至此,课题组已完成了幼儿园、小学、初中、高中各阶段标准化传统文化教材的研发工作,高中国学教材将在各地开展成规模的教材试用工作。
中国国学文化艺术中心秘书长张健表示,目前各地高考改革的几个信号均指向国学,但考什么、怎么考又是一个难题。
专家建议,不应以文言文字词解释等传统形式考查,应关注考生如何消化吸收传统文化中的哲学素养和思想韬略。
教材各年级国学内容全覆盖据 “十二五”教育部规划课题《传统文化与中小学生人格培养研究》总课题组介绍,高中传统文化通识系列教材作为“十一五”、“十二五”两个阶段十年课题研究的重要成果之一,由中国国学文化艺术中心承担资源整合和编著。
去年,教育部印发了《完善中华优秀传统文化教育指导纲要》,要求在课程建设和课程标准修订中强化中华优秀传统文化内容。
在中小学德育、语文、历史等课程标准修订中,增加中华优秀传统文化的比重。
课题组秘书长张健表示,幼儿园、小学、初中、高中各阶段标准化传统文化教材的均已研发完成,明确提出以“青少年完美人格”为传统文化教育目标,教材知识相互关联,自成体系,并通过高中教材实现最终教学评价。
这是“十一五”“十二五”两个阶段十年课题研究的重要成果之一。
今年5月份之前,《高等教育传统文化教材》(12册)《全国行政领导干部国学教材》(10册)两套教材也将研发完毕。
内容高中教材含《论语》《道德经》此次即将出版的高中阶段传统文化通识教材共有4册,供高中一、二年级使用。
高一学习《论语》《孟子》,高二学习《大学·中庸》和《道德经》。
其中《道德经》为原文全本讲解,另外三册则是按主题归类讲解。
如《大学·中庸》一册,分为“慎独”“齐家”“格物致知”“中和”“为政”等章节。
据课题组专家介绍,这4册书并非孤立的高中教材,而是《中华优秀传统文化教育全国中小学实验教材》的高中部分。
全套教材包含小学、初中和高中三个阶段,经专家组反复研讨、论证,制定了“儒学养正、兵学相佑、道法自然、文化浸润”的课程结构,各阶段教学内容和深度循序渐进、系统科学。
事实上,小学高年级段已开始涉及《论语》《孟子》等儒学典籍,但仅以诵读和简单理解为主,到高中阶段,学生可在已有基础上更为深刻地领悟儒道经典的思想内涵,以达到融会贯通的程度。
此外,每一章节在讲解儒道核心精神的同时,还为学生提供了大量中西文化比较等拓展阅读素材。
针对公众关注的一个话题,即传统文化有望成为高考的新考点,课题组表示目前在研发高中传统文化教材的同时,就已开展了另一个重点子课题研究,即传统文化教学评价与考试模式研究。
张健强调高考改革的几个信号均指向国学,例如北京、上海等地公布的高考改革方案中,英语降分后其所降分数分给了语文,而且还更进一步明确指出了就是将分数转移给所增加的“传统文化考试内容”部分。
又如今年清华北大自主招生均招收国学特长生。
此外,近期公布的2016年高考改革方案中,全国25个省高考要统一命题,并且增加分数后的语文考试,正在研究增加“中华优秀传统文化”之相关内容。
张健表示,传统文化成为高考的又一创新考点指日可待,但考什么、怎么考又是一个重大难题。
由于相关子课题研究还没有结束,课题组非行政机构只承担建议义务。
张健坦言,能否在高考语文中出现一个新的形式——政论或申论形式的传统文化论述题,这一方向应该是研究和创新的改革方向之一。
若2016年传统文化进入高考,最大的问题是很多高中生没有接触过传统文化课程,不具备相关知识储备和素养,国学文化是通过长期熏陶和涵养才能显现的,不是靠一朝一夕突击补课就能拥有的。
悬灸技术培训专家教你艾灸降血糖,为爸妈收好了!
近年来随着我国经济条件的改善和人们生活水平的提高,我国糖尿病的患病率也在逐年上升。
悬灸技术培训的创始人艾灸专家刘全军先生对糖尿病深有研究,接下来,学一学他是怎么用艾灸降血压的吧!
中医认为,糖尿病是气血、阴阳失调等多种原因引起的一种慢性疾病。
虽然分为上消、中消、下消,但是无论何种糖尿病 ,治疗的原则都是荣养阴液,清热润燥。
艾灸对控制血糖效果不错。
艾灸功效:调升元阳降血糖艾灸可以修复受损胰岛细胞,激活再生,逐步实现胰岛素的自给自足。
服药一天比一天少,身体一天比一天好,彻底摆脱终生服药!
还可以双向调节血糖,使血糖老老实实地锁定在正常的恒定值范围。
也可以改善组织供氧,对微血管病变导致的视物不清、眼底出血等视网膜病变及早期肾病病变及早期肾病病变有明显治疗与改善作用,改善病人消瘦无力、免疫力低下、低蛋白质血证及伤口不愈等现象。
艾灸取穴糖尿病艾灸过的穴位有,承浆中脘足三里关元曲骨三阴交、期门太冲下脘天枢气海膈俞膻中、胃俞,这么多穴位可根据患者当时的症状进行选取。
选取后艾灸,每10天为一个疗程,疗程间休息3-5天后继续第二轮的治疗,三个疗程基本可见到理想疗效。
这几个穴位都是具有补充人体元阳功能的大穴和调节脏腑功能的腧穴,从根上调节人体的元阳使阴阳达到新的平衡,五脏六腑尤其是肺、脾肾的功能恢复正常,糖尿病自然也就不药而愈了。
艾灸可以有效控制糖尿病 ,这在很多资料都有报导。
艾灸使病人的营养能得到有效的吸收和利用,从而提高人体的自身免疫功能和抗病防病能力,防止了系列并发症的发生,真正做到综合治疗,标本兼治。
艾灸对于常见病是具有广泛的适应性的。
希望大家把艾灸推广出去,让艾灸这个疗法能够更完善,造福更多的人。
熟食放在垃圾旁无照窝点被取缔
本报讯(记者李涛)又黑又脏的墙面、随意堆放的加工原料、处处弥漫的刺鼻味道。
昨天上午,东小口镇政府与城管、食药、公安等部门开展联合执法行动时,依法取缔了一个位于昌平区东小口镇半截塔村的非法熟食加工窝点。
昨天上午,执法人员对东小口镇半截塔村进行环境整治时,一家挂着“久久鸭”招牌的小店的店主显得有点紧张,还“顺手”把通向后院的门关上了。
执法人员觉得有些蹊跷,便要求到后院进行检查。
一进院子,执法人员就发现大量的熟食加工原料被随意摆放在地上,旁边就堆放着垃圾。
院内煤炉上的一口锅内正煮着的食物,发出刺鼻的味道。
执法队员介绍,在炉子一旁的笸箩里盛着制作好的熟食制品,但却没有任何遮盖,一阵风起,煤灰混着尘土就落在上面。
执法队员说:“走进院旁的小屋内,地上和墙上满是油污,脏乎乎的冰柜上堆放着一袋一袋的半成品,一个个用来盛放熟食制品的笸箩摞在生锈的铁架子上。
”随后,执法人员仔细查找,没有发现任何消毒设施,调查得知从事加工的人员也没有取得加工熟食应需的健康证。
执法人员随后对店主进行询问,当执法人员要求出示营业执照及卫生许可证时,店主嘟囔了半天才坦白自己不具备任何手续。
执法人员当即对该非法生产窝点进行了取缔,对现场工作人员进行了宣传与教育,并依法没收了加工工具及食品。
================================================
FILE: lamb_optimizer_google.py
================================================
# coding=utf-8
# Copyright 2019 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Lint as: python2, python3
"""Functions and classes related to optimization (weight updates)."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import re
import six
import tensorflow as tf
# pylint: disable=g-direct-tensorflow-import
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import linalg_ops
from tensorflow.python.ops import math_ops
# pylint: enable=g-direct-tensorflow-import
class LAMBOptimizer(tf.train.Optimizer):
"""LAMB (Layer-wise Adaptive Moments optimizer for Batch training)."""
# A new optimizer that includes correct L2 weight decay, adaptive
# element-wise updating, and layer-wise justification. The LAMB optimizer
# was proposed by Yang You, Jing Li, Jonathan Hseu, Xiaodan Song,
# James Demmel, and Cho-Jui Hsieh in a paper titled as Reducing BERT
# Pre-Training Time from 3 Days to 76 Minutes (arxiv.org/abs/1904.00962)
def __init__(self,
learning_rate,
weight_decay_rate=0.0,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-6,
exclude_from_weight_decay=None,
exclude_from_layer_adaptation=None,
name="LAMBOptimizer"):
"""Constructs a LAMBOptimizer."""
super(LAMBOptimizer, self).__init__(False, name)
self.learning_rate = learning_rate
self.weight_decay_rate = weight_decay_rate
self.beta_1 = beta_1
self.beta_2 = beta_2
self.epsilon = epsilon
self.exclude_from_weight_decay = exclude_from_weight_decay
# exclude_from_layer_adaptation is set to exclude_from_weight_decay if the
# arg is None.
# TODO(jingli): validate if exclude_from_layer_adaptation is necessary.
if exclude_from_layer_adaptation:
self.exclude_from_layer_adaptation = exclude_from_layer_adaptation
else:
self.exclude_from_layer_adaptation = exclude_from_weight_decay
def apply_gradients(self, grads_and_vars, global_step=None, name=None):
"""See base class."""
assignments = []
for (grad, param) in grads_and_vars:
if grad is None or param is None:
continue
param_name = self._get_variable_name(param.name)
m = tf.get_variable(
name=six.ensure_str(param_name) + "/adam_m",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
v = tf.get_variable(
name=six.ensure_str(param_name) + "/adam_v",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
# Standard Adam update.
next_m = (
tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
next_v = (
tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
tf.square(grad)))
update = next_m / (tf.sqrt(next_v) + self.epsilon)
# Just adding the square of the weights to the loss function is *not*
# the correct way of using L2 regularization/weight decay with Adam,
# since that will interact with the m and v parameters in strange ways.
#
# Instead we want ot decay the weights in a manner that doesn't interact
# with the m/v parameters. This is equivalent to adding the square
# of the weights to the loss with plain (non-momentum) SGD.
if self._do_use_weight_decay(param_name):
update += self.weight_decay_rate * param
ratio = 1.0
if self._do_layer_adaptation(param_name):
w_norm = linalg_ops.norm(param, ord=2)
g_norm = linalg_ops.norm(update, ord=2)
ratio = array_ops.where(math_ops.greater(w_norm, 0), array_ops.where(
math_ops.greater(g_norm, 0), (w_norm / g_norm), 1.0), 1.0)
update_with_lr = ratio * self.learning_rate * update
next_param = param - update_with_lr
assignments.extend(
[param.assign(next_param),
m.assign(next_m),
v.assign(next_v)])
return tf.group(*assignments, name=name)
def _do_use_weight_decay(self, param_name):
"""Whether to use L2 weight decay for `param_name`."""
if not self.weight_decay_rate:
return False
if self.exclude_from_weight_decay:
for r in self.exclude_from_weight_decay:
if re.search(r, param_name) is not None:
return False
return True
def _do_layer_adaptation(self, param_name):
"""Whether to do layer-wise learning rate adaptation for `param_name`."""
if self.exclude_from_layer_adaptation:
for r in self.exclude_from_layer_adaptation:
if re.search(r, param_name) is not None:
return False
return True
def _get_variable_name(self, param_name):
"""Get the variable name from the tensor name."""
m = re.match("^(.*):\\d+$", six.ensure_str(param_name))
if m is not None:
param_name = m.group(1)
return param_name
================================================
FILE: modeling.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""The main BERT model and related functions."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import copy
import json
import math
import re
import numpy as np
import six
import tensorflow as tf
import bert_utils
class BertConfig(object):
"""Configuration for `BertModel`."""
def __init__(self,
vocab_size,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=16,
initializer_range=0.02):
"""Constructs BertConfig.
Args:
vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.
hidden_size: Size of the encoder layers and the pooler layer.
num_hidden_layers: Number of hidden layers in the Transformer encoder.
num_attention_heads: Number of attention heads for each attention layer in
the Transformer encoder.
intermediate_size: The size of the "intermediate" (i.e., feed-forward)
layer in the Transformer encoder.
hidden_act: The non-linear activation function (function or string) in the
encoder and pooler.
hidden_dropout_prob: The dropout probability for all fully connected
layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob: The dropout ratio for the attention
probabilities.
max_position_embeddings: The maximum sequence length that this model might
ever be used with. Typically set this to something large just in case
(e.g., 512 or 1024 or 2048).
type_vocab_size: The vocabulary size of the `token_type_ids` passed into
`BertModel`.
initializer_range: The stdev of the truncated_normal_initializer for
initializing all weight matrices.
"""
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.hidden_act = hidden_act
self.intermediate_size = intermediate_size
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.initializer_range = initializer_range
@classmethod
def from_dict(cls, json_object):
"""Constructs a `BertConfig` from a Python dictionary of parameters."""
config = BertConfig(vocab_size=None)
for (key, value) in six.iteritems(json_object):
config.__dict__[key] = value
return config
@classmethod
def from_json_file(cls, json_file):
"""Constructs a `BertConfig` from a json file of parameters."""
with tf.gfile.GFile(json_file, "r") as reader:
text = reader.read()
return cls.from_dict(json.loads(text))
def to_dict(self):
"""Serializes this instance to a Python dictionary."""
output = copy.deepcopy(self.__dict__)
return output
def to_json_string(self):
"""Serializes this instance to a JSON string."""
return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
class BertModel(object):
"""BERT model ("Bidirectional Encoder Representations from Transformers").
Example usage:
```python
# Already been converted into WordPiece token ids
input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])
input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])
token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]])
config = modeling.BertConfig(vocab_size=32000, hidden_size=512,
num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)
model = modeling.BertModel(config=config, is_training=True,
input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)
label_embeddings = tf.get_variable(...)
pooled_output = model.get_pooled_output()
logits = tf.matmul(pooled_output, label_embeddings)
...
```
"""
def __init__(self,
config,
is_training,
input_ids,
input_mask=None,
token_type_ids=None,
use_one_hot_embeddings=False,
scope=None):
"""Constructor for BertModel.
Args:
config: `BertConfig` instance.
is_training: bool. true for training model, false for eval model. Controls
whether dropout will be applied.
input_ids: int32 Tensor of shape [batch_size, seq_length].
input_mask: (optional) int32 Tensor of shape [batch_size, seq_length].
token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
use_one_hot_embeddings: (optional) bool. Whether to use one-hot word
embeddings or tf.embedding_lookup() for the word embeddings.
scope: (optional) variable scope. Defaults to "bert".
Raises:
ValueError: The config is invalid or one of the input tensor shapes
is invalid.
"""
config = copy.deepcopy(config)
if not is_training:
config.hidden_dropout_prob = 0.0
config.attention_probs_dropout_prob = 0.0
input_shape = get_shape_list(input_ids, expected_rank=2)
batch_size = input_shape[0]
seq_length = input_shape[1]
if input_mask is None:
input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)
if token_type_ids is None:
token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)
with tf.variable_scope(scope, default_name="bert"):
with tf.variable_scope("embeddings"):
# Perform embedding lookup on the word ids, but use stype of factorized embedding parameterization from albert. add by brightmart, 2019-09-28
(self.embedding_output, self.embedding_table,self.embedding_table_2) = embedding_lookup_factorized(
input_ids=input_ids,
vocab_size=config.vocab_size,
hidden_size=config.hidden_size,
embedding_size=config.embedding_size,
initializer_range=config.initializer_range,
word_embedding_name="word_embeddings",
use_one_hot_embeddings=use_one_hot_embeddings)
# Add positional embeddings and token type embeddings, then layer
# normalize and perform dropout.
self.embedding_output = embedding_postprocessor(
input_tensor=self.embedding_output,
use_token_type=True,
token_type_ids=token_type_ids,
token_type_vocab_size=config.type_vocab_size,
token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=config.initializer_range,
max_position_embeddings=config.max_position_embeddings,
dropout_prob=config.hidden_dropout_prob)
with tf.variable_scope("encoder"):
# This converts a 2D mask of shape [batch_size, seq_length] to a 3D
# mask of shape [batch_size, seq_length, seq_length] which is used
# for the attention scores.
attention_mask = create_attention_mask_from_input_mask(
input_ids, input_mask)
# Run the stacked transformer.
# `sequence_output` shape = [batch_size, seq_length, hidden_size].
ln_type=config.ln_type
print("ln_type:",ln_type)
if ln_type=='postln' or ln_type is None: # currently, base or large of albert used post-LN structure
print("old structure of transformer.use: transformer_model,which use post-LN")
self.all_encoder_layers = transformer_model(
input_tensor=self.embedding_output,
attention_mask=attention_mask,
hidden_size=config.hidden_size,
num_hidden_layers=config.num_hidden_layers,
num_attention_heads=config.num_attention_heads,
intermediate_size=config.intermediate_size,
intermediate_act_fn=get_activation(config.hidden_act),
hidden_dropout_prob=config.hidden_dropout_prob,
attention_probs_dropout_prob=config.attention_probs_dropout_prob,
initializer_range=config.initializer_range,
do_return_all_layers=True)
else: # xlarge or xxlarge of albert, used pre-LN structure
print("new structure of transformer.use: prelln_transformer_model,which use pre-LN")
self.all_encoder_layers = prelln_transformer_model( # change by brightmart, 4th, oct, 2019. pre-Layer Normalization can converge fast and better. check paper: ON LAYER NORMALIZATION IN THE TRANSFORMER ARCHITECTURE
input_tensor=self.embedding_output,
attention_mask=attention_mask,
hidden_size=config.hidden_size,
num_hidden_layers=config.num_hidden_layers,
num_attention_heads=config.num_attention_heads,
intermediate_size=config.intermediate_size,
intermediate_act_fn=get_activation(config.hidden_act),
hidden_dropout_prob=config.hidden_dropout_prob,
attention_probs_dropout_prob=config.attention_probs_dropout_prob,
initializer_range=config.initializer_range,
do_return_all_layers=True,
shared_type='all') # do_return_all_layers=True
self.sequence_output = self.all_encoder_layers[-1] # [batch_size, seq_length, hidden_size]
# The "pooler" converts the encoded sequence tensor of shape
# [batch_size, seq_length, hidden_size] to a tensor of shape
# [batch_size, hidden_size]. This is necessary for segment-level
# (or segment-pair-level) classification tasks where we need a fixed
# dimensional representation of the segment.
with tf.variable_scope("pooler"):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token. We assume that this has been pre-trained
first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
self.pooled_output = tf.layers.dense(
first_token_tensor,
config.hidden_size,
activation=tf.tanh,
kernel_initializer=create_initializer(config.initializer_range))
def get_pooled_output(self):
return self.pooled_output
def get_sequence_output(self):
"""Gets final hidden layer of encoder.
Returns:
float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
to the final hidden of the transformer encoder.
"""
return self.sequence_output
def get_all_encoder_layers(self):
return self.all_encoder_layers
def get_embedding_output(self):
"""Gets output of the embedding lookup (i.e., input to the transformer).
Returns:
float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
to the output of the embedding layer, after summing the word
embeddings with the positional embeddings and the token type embeddings,
then performing layer normalization. This is the input to the transformer.
"""
return self.embedding_output
def get_embedding_table(self):
return self.embedding_table
def get_embedding_table_2(self):
return self.embedding_table_2
def gelu(x):
"""Gaussian Error Linear Unit.
This is a smoother version of the RELU.
Original paper: https://arxiv.org/abs/1606.08415
Args:
x: float Tensor to perform activation.
Returns:
`x` with the GELU activation applied.
"""
cdf = 0.5 * (1.0 + tf.tanh(
(np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))
return x * cdf
def get_activation(activation_string):
"""Maps a string to a Python function, e.g., "relu" => `tf.nn.relu`.
Args:
activation_string: String name of the activation function.
Returns:
A Python function corresponding to the activation function. If
`activation_string` is None, empty, or "linear", this will return None.
If `activation_string` is not a string, it will return `activation_string`.
Raises:
ValueError: The `activation_string` does not correspond to a known
activation.
"""
# We assume that anything that"s not a string is already an activation
# function, so we just return it.
if not isinstance(activation_string, six.string_types):
return activation_string
if not activation_string:
return None
act = activation_string.lower()
if act == "linear":
return None
elif act == "relu":
return tf.nn.relu
elif act == "gelu":
return gelu
elif act == "tanh":
return tf.tanh
else:
raise ValueError("Unsupported activation: %s" % act)
def get_assignment_map_from_checkpoint(tvars, init_checkpoint):
"""Compute the union of the current variables and checkpoint variables."""
assignment_map = {}
initialized_variable_names = {}
name_to_variable = collections.OrderedDict()
for var in tvars:
name = var.name
m = re.match("^(.*):\\d+$", name)
if m is not None:
name = m.group(1)
name_to_variable[name] = var
init_vars = tf.train.list_variables(init_checkpoint)
assignment_map = collections.OrderedDict()
for x in init_vars:
(name, var) = (x[0], x[1])
if name not in name_to_variable:
continue
assignment_map[name] = name
initialized_variable_names[name] = 1
initialized_variable_names[name + ":0"] = 1
return (assignment_map, initialized_variable_names)
def dropout(input_tensor, dropout_prob):
"""Perform dropout.
Args:
input_tensor: float Tensor.
dropout_prob: Python float. The probability of dropping out a value (NOT of
*keeping* a dimension as in `tf.nn.dropout`).
Returns:
A version of `input_tensor` with dropout applied.
"""
if dropout_prob is None or dropout_prob == 0.0:
return input_tensor
output = tf.nn.dropout(input_tensor, 1.0 - dropout_prob)
return output
def layer_norm(input_tensor, name=None):
"""Run layer normalization on the last dimension of the tensor."""
return tf.contrib.layers.layer_norm(
inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name)
def layer_norm_and_dropout(input_tensor, dropout_prob, name=None):
"""Runs layer normalization followed by dropout."""
output_tensor = layer_norm(input_tensor, name)
output_tensor = dropout(output_tensor, dropout_prob)
return output_tensor
def create_initializer(initializer_range=0.02):
"""Creates a `truncated_normal_initializer` with the given range."""
return tf.truncated_normal_initializer(stddev=initializer_range)
def embedding_lookup(input_ids,
vocab_size,
embedding_size=128,
initializer_range=0.02,
word_embedding_name="word_embeddings",
use_one_hot_embeddings=False):
"""Looks up words embeddings for id tensor.
Args:
input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
ids.
vocab_size: int. Size of the embedding vocabulary.
embedding_size: int. Width of the word embeddings.
initializer_range: float. Embedding initialization range.
word_embedding_name: string. Name of the embedding table.
use_one_hot_embeddings: bool. If True, use one-hot method for word
embeddings. If False, use `tf.gather()`.
Returns:
float Tensor of shape [batch_size, seq_length, embedding_size].
"""
# This function assumes that the input is of shape [batch_size, seq_length,
# num_inputs].
#
# If the input is a 2D tensor of shape [batch_size, seq_length], we
# reshape to [batch_size, seq_length, 1].
if input_ids.shape.ndims == 2:
input_ids = tf.expand_dims(input_ids, axis=[-1]) # shape of input_ids is:[ batch_size, seq_length, 1]
embedding_table = tf.get_variable( # [vocab_size, embedding_size]
name=word_embedding_name,
shape=[vocab_size, embedding_size],
initializer=create_initializer(initializer_range))
flat_input_ids = tf.reshape(input_ids, [-1]) # one rank. shape as (batch_size * sequence_length,)
if use_one_hot_embeddings:
one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size) # one_hot_input_ids=[batch_size * sequence_length,vocab_size]
output = tf.matmul(one_hot_input_ids, embedding_table) # output=[batch_size * sequence_length,embedding_size]
else:
output = tf.gather(embedding_table, flat_input_ids) # [vocab_size, embedding_size]*[batch_size * sequence_length,]--->[batch_size * sequence_length,embedding_size]
input_shape = get_shape_list(input_ids) # input_shape=[ batch_size, seq_length, 1]
output = tf.reshape(output,input_shape[0:-1] + [input_shape[-1] * embedding_size]) # output=[batch_size,sequence_length,embedding_size]
return (output, embedding_table)
def embedding_lookup_factorized(input_ids, # Factorized embedding parameterization provide by albert
vocab_size,
hidden_size,
embedding_size=128,
initializer_range=0.02,
word_embedding_name="word_embeddings",
use_one_hot_embeddings=False):
"""Looks up words embeddings for id tensor, but in a factorized style followed by albert. it is used to reduce much percentage of parameters previous exists.
Check "Factorized embedding parameterization" session in the paper.
Args:
input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
ids.
vocab_size: int. Size of the embedding vocabulary.
embedding_size: int. Width of the word embeddings.
initializer_range: float. Embedding initialization range.
word_embedding_name: string. Name of the embedding table.
use_one_hot_embeddings: bool. If True, use one-hot method for word
embeddings. If False, use `tf.gather()`.
Returns:
float Tensor of shape [batch_size, seq_length, embedding_size].
"""
# This function assumes that the input is of shape [batch_size, seq_length,
# num_inputs].
#
# If the input is a 2D tensor of shape [batch_size, seq_length], we
# reshape to [batch_size, seq_length, 1].
# 1.first project one-hot vectors into a lower dimensional embedding space of size E
print("embedding_lookup_factorized. factorized embedding parameterization is used.")
if input_ids.shape.ndims == 2:
input_ids = tf.expand_dims(input_ids, axis=[-1]) # shape of input_ids is:[ batch_size, seq_length, 1]
embedding_table = tf.get_variable( # [vocab_size, embedding_size]
name=word_embedding_name,
shape=[vocab_size, embedding_size],
initializer=create_initializer(initializer_range))
flat_input_ids = tf.reshape(input_ids, [-1]) # one rank. shape as (batch_size * sequence_length,)
if use_one_hot_embeddings:
one_hot_input_ids = tf.one_hot(flat_input_ids,depth=vocab_size) # one_hot_input_ids=[batch_size * sequence_length,vocab_size]
output_middle = tf.matmul(one_hot_input_ids, embedding_table) # output=[batch_size * sequence_length,embedding_size]
else:
output_middle = tf.gather(embedding_table,flat_input_ids) # [vocab_size, embedding_size]*[batch_size * sequence_length,]--->[batch_size * sequence_length,embedding_size]
# 2. project vector(output_middle) to the hidden space
project_variable = tf.get_variable( # [embedding_size, hidden_size]
name=word_embedding_name+"_2",
shape=[embedding_size, hidden_size],
initializer=create_initializer(initializer_range))
output = tf.matmul(output_middle, project_variable) # ([batch_size * sequence_length, embedding_size] * [embedding_size, hidden_size])--->[batch_size * sequence_length, hidden_size]
# reshape back to 3 rank
input_shape = get_shape_list(input_ids) # input_shape=[ batch_size, seq_length, 1]
batch_size, sequene_length, _=input_shape
output = tf.reshape(output, (batch_size,sequene_length,hidden_size)) # output=[batch_size, sequence_length, hidden_size]
return (output, embedding_table, project_variable)
def embedding_postprocessor(input_tensor,
use_token_type=False,
token_type_ids=None,
token_type_vocab_size=16,
token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=0.02,
max_position_embeddings=512,
dropout_prob=0.1):
"""Performs various post-processing on a word embedding tensor.
Args:
input_tensor: float Tensor of shape [batch_size, seq_length,
embedding_size].
use_token_type: bool. Whether to add embeddings for `token_type_ids`.
token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
Must be specified if `use_token_type` is True.
token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
token_type_embedding_name: string. The name of the embedding table variable
for token type ids.
use_position_embeddings: bool. Whether to add position embeddings for the
position of each token in the sequence.
position_embedding_name: string. The name of the embedding table variable
for positional embeddings.
initializer_range: float. Range of the weight initialization.
max_position_embeddings: int. Maximum sequence length that might ever be
used with this model. This can be longer than the sequence length of
input_tensor, but cannot be shorter.
dropout_prob: float. Dropout probability applied to the final output tensor.
Returns:
float tensor with same shape as `input_tensor`.
Raises:
ValueError: One of the tensor shapes or input values is invalid.
"""
input_shape = get_shape_list(input_tensor, expected_rank=3)
batch_size = input_shape[0]
seq_length = input_shape[1]
width = input_shape[2]
output = input_tensor
if use_token_type:
if token_type_ids is None:
raise ValueError("`token_type_ids` must be specified if"
"`use_token_type` is True.")
token_type_table = tf.get_variable(
name=token_type_embedding_name,
shape=[token_type_vocab_size, width],
initializer=create_initializer(initializer_range))
# This vocab will be small so we always do one-hot here, since it is always
# faster for a small vocabulary.
flat_token_type_ids = tf.reshape(token_type_ids, [-1])
one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
token_type_embeddings = tf.reshape(token_type_embeddings,
[batch_size, seq_length, width])
output += token_type_embeddings
if use_position_embeddings:
assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
with tf.control_dependencies([assert_op]):
full_position_embeddings = tf.get_variable(
name=position_embedding_name,
shape=[max_position_embeddings, width],
initializer=create_initializer(initializer_range))
# Since the position embedding table is a learned variable, we create it
# using a (long) sequence length `max_position_embeddings`. The actual
# sequence length might be shorter than this, for faster training of
# tasks that do not have long sequences.
#
# So `full_position_embeddings` is effectively an embedding table
# for position [0, 1, 2, ..., max_position_embeddings-1], and the current
# sequence has positions [0, 1, 2, ... seq_length-1], so we can just
# perform a slice.
position_embeddings = tf.slice(full_position_embeddings, [0, 0],
[seq_length, -1])
num_dims = len(output.shape.as_list())
# Only the last two dimensions are relevant (`seq_length` and `width`), so
# we broadcast among the first dimensions, which is typically just
# the batch size.
position_broadcast_shape = []
for _ in range(num_dims - 2):
position_broadcast_shape.append(1)
position_broadcast_shape.extend([seq_length, width])
position_embeddings = tf.reshape(position_embeddings,
position_broadcast_shape)
output += position_embeddings
output = layer_norm_and_dropout(output, dropout_prob)
return output
def create_attention_mask_from_input_mask(from_tensor, to_mask):
"""Create 3D attention mask from a 2D tensor mask.
Args:
from_tensor: 2D or 3D Tensor of shape [batch_size, from_seq_length, ...].
to_mask: int32 Tensor of shape [batch_size, to_seq_length].
Returns:
float Tensor of shape [batch_size, from_seq_length, to_seq_length].
"""
from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
batch_size = from_shape[0]
from_seq_length = from_shape[1]
to_shape = get_shape_list(to_mask, expected_rank=2)
to_seq_length = to_shape[1]
to_mask = tf.cast(
tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)
# We don't assume that `from_tensor` is a mask (although it could be). We
# don't actually care if we attend *from* padding tokens (only *to* padding)
# tokens so we create a tensor of all ones.
#
# `broadcast_ones` = [batch_size, from_seq_length, 1]
broadcast_ones = tf.ones(
shape=[batch_size, from_seq_length, 1], dtype=tf.float32)
# Here we broadcast along two dimensions to create the mask.
mask = broadcast_ones * to_mask
return mask
def attention_layer(from_tensor,
to_tensor,
attention_mask=None,
num_attention_heads=1,
size_per_head=512,
query_act=None,
key_act=None,
value_act=None,
attention_probs_dropout_prob=0.0,
initializer_range=0.02,
do_return_2d_tensor=False,
batch_size=None,
from_seq_length=None,
to_seq_length=None):
"""Performs multi-headed attention from `from_tensor` to `to_tensor`.
This is an implementation of multi-headed attention based on "Attention
is all you Need". If `from_tensor` and `to_tensor` are the same, then
this is self-attention. Each timestep in `from_tensor` attends to the
corresponding sequence in `to_tensor`, and returns a fixed-with vector.
This function first projects `from_tensor` into a "query" tensor and
`to_tensor` into "key" and "value" tensors. These are (effectively) a list
of tensors of length `num_attention_heads`, where each tensor is of shape
[batch_size, seq_length, size_per_head].
Then, the query and key tensors are dot-producted and scaled. These are
softmaxed to obtain attention probabilities. The value tensors are then
interpolated by these probabilities, then concatenated back to a single
tensor and returned.
In practice, the multi-headed attention are done with transposes and
reshapes rather than actual separate tensors.
Args:
from_tensor: float Tensor of shape [batch_size, from_seq_length,
from_width].
to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
attention_mask: (optional) int32 Tensor of shape [batch_size,
from_seq_length, to_seq_length]. The values should be 1 or 0. The
attention scores will effectively be set to -infinity for any positions in
the mask that are 0, and will be unchanged for positions that are 1.
num_attention_heads: int. Number of attention heads.
size_per_head: int. Size of each attention head.
query_act: (optional) Activation function for the query transform.
key_act: (optional) Activation function for the key transform.
value_act: (optional) Activation function for the value transform.
attention_probs_dropout_prob: (optional) float. Dropout probability of the
attention probabilities.
initializer_range: float. Range of the weight initializer.
do_return_2d_tensor: bool. If True, the output will be of shape [batch_size
* from_seq_length, num_attention_heads * size_per_head]. If False, the
output will be of shape [batch_size, from_seq_length, num_attention_heads
* size_per_head].
batch_size: (Optional) int. If the input is 2D, this might be the batch size
of the 3D version of the `from_tensor` and `to_tensor`.
from_seq_length: (Optional) If the input is 2D, this might be the seq length
of the 3D version of the `from_tensor`.
to_seq_length: (Optional) If the input is 2D, this might be the seq length
of the 3D version of the `to_tensor`.
Returns:
float Tensor of shape [batch_size, from_seq_length,
num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is
true, this will be of shape [batch_size * from_seq_length,
num_attention_heads * size_per_head]).
Raises:
ValueError: Any of the arguments or tensor shapes are invalid.
"""
def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
seq_length, width):
output_tensor = tf.reshape(
input_tensor, [batch_size, seq_length, num_attention_heads, width])
output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
return output_tensor
from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])
if len(from_shape) != len(to_shape):
raise ValueError(
"The rank of `from_tensor` must match the rank of `to_tensor`.")
if len(from_shape) == 3:
batch_size = from_shape[0]
from_seq_length = from_shape[1]
to_seq_length = to_shape[1]
elif len(from_shape) == 2:
if (batch_size is None or from_seq_length is None or to_seq_length is None):
raise ValueError(
"When passing in rank 2 tensors to attention_layer, the values "
"for `batch_size`, `from_seq_length`, and `to_seq_length` "
"must all be specified.")
# Scalar dimensions referenced here:
# B = batch size (number of sequences)
# F = `from_tensor` sequence length
# T = `to_tensor` sequence length
# N = `num_attention_heads`
# H = `size_per_head`
from_tensor_2d = reshape_to_matrix(from_tensor)
to_tensor_2d = reshape_to_matrix(to_tensor)
# `query_layer` = [B*F, N*H]
query_layer = tf.layers.dense(
from_tensor_2d,
num_attention_heads * size_per_head,
activation=query_act,
name="query",
kernel_initializer=create_initializer(initializer_range))
# `key_layer` = [B*T, N*H]
key_layer = tf.layers.dense(
to_tensor_2d,
num_attention_heads * size_per_head,
activation=key_act,
name="key",
kernel_initializer=create_initializer(initializer_range))
# `value_layer` = [B*T, N*H]
value_layer = tf.layers.dense(
to_tensor_2d,
num_attention_heads * size_per_head,
activation=value_act,
name="value",
kernel_initializer=create_initializer(initializer_range))
# `query_layer` = [B, N, F, H]
query_layer = transpose_for_scores(query_layer, batch_size,
num_attention_heads, from_seq_length,
size_per_head)
# `key_layer` = [B, N, T, H]
key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
to_seq_length, size_per_head)
# Take the dot product between "query" and "key" to get the raw
# attention scores.
# `attention_scores` = [B, N, F, T]
attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
attention_scores = tf.multiply(attention_scores,
1.0 / math.sqrt(float(size_per_head)))
if attention_mask is not None:
# `attention_mask` = [B, 1, F, T]
attention_mask = tf.expand_dims(attention_mask, axis=[1])
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
# masked positions, this operation will create a tensor which is 0.0 for
# positions we want to attend and -10000.0 for masked positions.
adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0
# Since we are adding it to the raw scores before the softmax, this is
# effectively the same as removing these entirely.
attention_scores += adder
# Normalize the attention scores to probabilities.
# `attention_probs` = [B, N, F, T]
attention_probs = tf.nn.softmax(attention_scores)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = dropout(attention_probs, attention_probs_dropout_prob)
# `value_layer` = [B, T, N, H]
value_layer = tf.reshape(
value_layer,
[batch_size, to_seq_length, num_attention_heads, size_per_head])
# `value_layer` = [B, N, T, H]
value_layer = tf.transpose(value_layer, [0, 2, 1, 3])
# `context_layer` = [B, N, F, H]
context_layer = tf.matmul(attention_probs, value_layer)
# `context_layer` = [B, F, N, H]
context_layer = tf.transpose(context_layer, [0, 2, 1, 3])
if do_return_2d_tensor:
# `context_layer` = [B*F, N*H]
context_layer = tf.reshape(
context_layer,
[batch_size * from_seq_length, num_attention_heads * size_per_head])
else:
# `context_layer` = [B, F, N*H]
context_layer = tf.reshape(
context_layer,
[batch_size, from_seq_length, num_attention_heads * size_per_head])
return context_layer
def transformer_model(input_tensor,
attention_mask=None,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
intermediate_act_fn=gelu,
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
initializer_range=0.02,
do_return_all_layers=False,
share_parameter_across_layers=True):
"""Multi-headed, multi-layer Transformer from "Attention is All You Need".
This is almost an exact implementation of the original Transformer encoder.
See the original paper:
https://arxiv.org/abs/1706.03762
Also see:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py
Args:
input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
seq_length], with 1 for positions that can be attended to and 0 in
positions that should not be.
hidden_size: int. Hidden size of the Transformer.
num_hidden_layers: int. Number of layers (blocks) in the Transformer.
num_attention_heads: int. Number of attention heads in the Transformer.
intermediate_size: int. The size of the "intermediate" (a.k.a., feed
forward) layer.
intermediate_act_fn: function. The non-linear activation function to apply
to the output of the intermediate/feed-forward layer.
hidden_dropout_prob: float. Dropout probability for the hidden layers.
attention_probs_dropout_prob: float. Dropout probability of the attention
probabilities.
initializer_range: float. Range of the initializer (stddev of truncated
normal).
do_return_all_layers: Whether to also return all layers or just the final
layer.
Returns:
float Tensor of shape [batch_size, seq_length, hidden_size], the final
hidden layer of the Transformer.
Raises:
ValueError: A Tensor shape or parameter is invalid.
"""
if hidden_size % num_attention_heads != 0:
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
"heads (%d)" % (hidden_size, num_attention_heads))
attention_head_size = int(hidden_size / num_attention_heads)
input_shape = get_shape_list(input_tensor, expected_rank=3)
batch_size = input_shape[0]
seq_length = input_shape[1]
input_width = input_shape[2]
# The Transformer performs sum residuals on all layers so the input needs
# to be the same as the hidden size.
if input_width != hidden_size:
raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
(input_width, hidden_size))
# We keep the representation as a 2D tensor to avoid re-shaping it back and
# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
# the GPU/CPU but may not be free on the TPU, so we want to minimize them to
# help the optimizer.
prev_output = reshape_to_matrix(input_tensor)
all_layer_outputs = []
for layer_idx in range(num_hidden_layers):
if share_parameter_across_layers:
name_variable_scope="layer_shared"
else:
name_variable_scope="layer_%d" % layer_idx
# share all parameters across layers. add by brightmart, 2019-09-28. previous it is like this: "layer_%d" % layer_idx
with tf.variable_scope(name_variable_scope, reuse=True if (share_parameter_across_layers and layer_idx>0) else False):
layer_input = prev_output
with tf.variable_scope("attention"):
attention_heads = []
with tf.variable_scope("self"):
attention_head = attention_layer(
from_tensor=layer_input,
to_tensor=layer_input,
attention_mask=attention_mask,
num_attention_heads=num_attention_heads,
size_per_head=attention_head_size,
attention_probs_dropout_prob=attention_probs_dropout_prob,
initializer_range=initializer_range,
do_return_2d_tensor=True,
batch_size=batch_size,
from_seq_length=seq_length,
to_seq_length=seq_length)
attention_heads.append(attention_head)
attention_output = None
if len(attention_heads) == 1:
attention_output = attention_heads[0]
else:
# In the case where we have other sequences, we just concatenate
# them to the self-attention head before the projection.
attention_output = tf.concat(attention_heads, axis=-1)
# Run a linear projection of `hidden_size` then add a residual
# with `layer_input`.
with tf.variable_scope("output"):
attention_output = tf.layers.dense(
attention_output,
hidden_size,
kernel_initializer=create_initializer(initializer_range))
attention_output = dropout(attention_output, hidden_dropout_prob)
attention_output = layer_norm(attention_output + layer_input)
# The activation is only applied to the "intermediate" hidden layer.
with tf.variable_scope("intermediate"):
intermediate_output = tf.layers.dense(
attention_output,
intermediate_size,
activation=intermediate_act_fn,
kernel_initializer=create_initializer(initializer_range))
# Down-project back to `hidden_size` then add the residual.
with tf.variable_scope("output"):
layer_output = tf.layers.dense(
intermediate_output,
hidden_size,
kernel_initializer=create_initializer(initializer_range))
layer_output = dropout(layer_output, hidden_dropout_prob)
layer_output = layer_norm(layer_output + attention_output)
prev_output = layer_output
all_layer_outputs.append(layer_output)
if do_return_all_layers:
final_outputs = []
for layer_output in all_layer_outputs:
final_output = reshape_from_matrix(layer_output, input_shape)
final_outputs.append(final_output)
return final_outputs
else:
final_output = reshape_from_matrix(prev_output, input_shape)
return final_output
def get_shape_list(tensor, expected_rank=None, name=None):
"""Returns a list of the shape of tensor, preferring static dimensions.
Args:
tensor: A tf.Tensor object to find the shape of.
expected_rank: (optional) int. The expected rank of `tensor`. If this is
specified and the `tensor` has a different rank, and exception will be
thrown.
name: Optional name of the tensor for the error message.
Returns:
A list of dimensions of the shape of tensor. All static dimensions will
be returned as python integers, and dynamic dimensions will be returned
as tf.Tensor scalars.
"""
if name is None:
name = tensor.name
if expected_rank is not None:
assert_rank(tensor, expected_rank, name)
shape = tensor.shape.as_list()
non_static_indexes = []
for (index, dim) in enumerate(shape):
if dim is None:
non_static_indexes.append(index)
if not non_static_indexes:
return shape
dyn_shape = tf.shape(tensor)
for index in non_static_indexes:
shape[index] = dyn_shape[index]
return shape
def reshape_to_matrix(input_tensor):
"""Reshapes a >= rank 2 tensor to a rank 2 tensor (i.e., a matrix)."""
ndims = input_tensor.shape.ndims
if ndims < 2:
raise ValueError("Input tensor must have at least rank 2. Shape = %s" %
(input_tensor.shape))
if ndims == 2:
return input_tensor
width = input_tensor.shape[-1]
output_tensor = tf.reshape(input_tensor, [-1, width])
return output_tensor
def reshape_from_matrix(output_tensor, orig_shape_list):
"""Reshapes a rank 2 tensor back to its original rank >= 2 tensor."""
if len(orig_shape_list) == 2:
return output_tensor
output_shape = get_shape_list(output_tensor)
orig_dims = orig_shape_list[0:-1]
width = output_shape[-1]
return tf.reshape(output_tensor, orig_dims + [width])
def assert_rank(tensor, expected_rank, name=None):
"""Raises an exception if the tensor rank is not of the expected rank.
Args:
tensor: A tf.Tensor to check the rank of.
expected_rank: Python integer or list of integers, expected rank.
name: Optional name of the tensor for the error message.
Raises:
ValueError: If the expected shape doesn't match the actual shape.
"""
if name is None:
name = tensor.name
expected_rank_dict = {}
if isinstance(expected_rank, six.integer_types):
expected_rank_dict[expected_rank] = True
else:
for x in expected_rank:
expected_rank_dict[x] = True
actual_rank = tensor.shape.ndims
if actual_rank not in expected_rank_dict:
scope_name = tf.get_variable_scope().name
raise ValueError(
"For the tensor `%s` in scope `%s`, the actual rank "
"`%d` (shape = %s) is not equal to the expected rank `%s`" %
(name, scope_name, actual_rank, str(tensor.shape), str(expected_rank)))
def prelln_transformer_model(input_tensor,
attention_mask=None,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
intermediate_act_fn=gelu,
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
initializer_range=0.02,
do_return_all_layers=False,
shared_type='all', # None,
adapter_fn=None):
"""Multi-headed, multi-layer Transformer from "Attention is All You Need".
This is almost an exact implementation of the original Transformer encoder.
See the original paper:
https://arxiv.org/abs/1706.03762
Also see:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py
Args:
input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
seq_length], with 1 for positions that can be attended to and 0 in
positions that should not be.
hidden_size: int. Hidden size of the Transformer.
num_hidden_layers: int. Number of layers (blocks) in the Transformer.
num_attention_heads: int. Number of attention heads in the Transformer.
intermediate_size: int. The size of the "intermediate" (a.k.a., feed
forward) layer.
intermediate_act_fn: function. The non-linear activation function to apply
to the output of the intermediate/feed-forward layer.
hidden_dropout_prob: float. Dropout probability for the hidden layers.
attention_probs_dropout_prob: float. Dropout probability of the attention
probabilities.
initializer_range: float. Range of the initializer (stddev of truncated
normal).
do_return_all_layers: Whether to also return all layers or just the final
layer.
Returns:
float Tensor of shape [batch_size, seq_length, hidden_size], the final
hidden layer of the Transformer.
Raises:
ValueError: A Tensor shape or parameter is invalid.
"""
if hidden_size % num_attention_heads != 0:
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
"heads (%d)" % (hidden_size, num_attention_heads))
attention_head_size = int(hidden_size / num_attention_heads)
input_shape = bert_utils.get_shape_list(input_tensor, expected_rank=3)
batch_size = input_shape[0]
seq_length = input_shape[1]
input_width = input_shape[2]
# The Transformer performs sum residuals on all layers so the input needs
# to be the same as the hidden size.
if input_width != hidden_size:
raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
(input_width, hidden_size))
# We keep the representation as a 2D tensor to avoid re-shaping it back and
# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
# the GPU/CPU but may not be free on the TPU, so we want to minimize them to
# help the optimizer.
prev_output = bert_utils.reshape_to_matrix(input_tensor)
all_layer_outputs = []
def layer_scope(idx, shared_type):
if shared_type == 'all':
tmp = {
"layer":"layer_shared",
'attention':'attention',
'intermediate':'intermediate',
'output':'output'
}
elif shared_type == 'attention':
tmp = {
"layer":"layer_shared",
'attention':'attention',
'intermediate':'intermediate_{}'.format(idx),
'output':'output_{}'.format(idx)
}
elif shared_type == 'ffn':
tmp = {
"layer":"layer_shared",
'attention':'attention_{}'.format(idx),
'intermediate':'intermediate',
'output':'output'
}
else:
tmp = {
"layer":"layer_{}".format(idx),
'attention':'attention',
'intermediate':'intermediate',
'output':'output'
}
return tmp
all_layer_outputs = []
for layer_idx in range(num_hidden_layers):
idx_scope = layer_scope(layer_idx, shared_type)
with tf.variable_scope(idx_scope['layer'], reuse=tf.AUTO_REUSE):
layer_input = prev_output
with tf.variable_scope(idx_scope['attention'], reuse=tf.AUTO_REUSE):
attention_heads = []
with tf.variable_scope("output", reuse=tf.AUTO_REUSE):
layer_input_pre = layer_norm(layer_input)
with tf.variable_scope("self"):
attention_head = attention_layer(
from_tensor=layer_input_pre,
to_tensor=layer_input_pre,
attention_mask=attention_mask,
num_attention_heads=num_attention_heads,
size_per_head=attention_head_size,
attention_probs_dropout_prob=attention_probs_dropout_prob,
initializer_range=initializer_range,
do_return_2d_tensor=True,
batch_size=batch_size,
from_seq_length=seq_length,
to_seq_length=seq_length)
attention_heads.append(attention_head)
attention_output = None
if len(attention_heads) == 1:
attention_output = attention_heads[0]
else:
# In the case where we have other sequences, we just concatenate
# them to the self-attention head before the projection.
attention_output = tf.concat(attention_heads, axis=-1)
# Run a linear projection of `hidden_size` then add a residual
# with `layer_input`.
with tf.variable_scope("output", reuse=tf.AUTO_REUSE):
attention_output = tf.layers.dense(
attention_output,
hidden_size,
kernel_initializer=create_initializer(initializer_range))
attention_output = dropout(attention_output, hidden_dropout_prob)
# attention_output = layer_norm(attention_output + layer_input)
attention_output = attention_output + layer_input
with tf.variable_scope(idx_scope['output'], reuse=tf.AUTO_REUSE):
attention_output_pre = layer_norm(attention_output)
# The activation is only applied to the "intermediate" hidden layer.
with tf.variable_scope(idx_scope['intermediate'], reuse=tf.AUTO_REUSE):
intermediate_output = tf.layers.dense(
attention_output_pre,
intermediate_size,
activation=intermediate_act_fn,
kernel_initializer=create_initializer(initializer_range))
# Down-project back to `hidden_size` then add the residual.
with tf.variable_scope(idx_scope['output'], reuse=tf.AUTO_REUSE):
layer_output = tf.layers.dense(
intermediate_output,
hidden_size,
kernel_initializer=create_initializer(initializer_range))
layer_output = dropout(layer_output, hidden_dropout_prob)
# layer_output = layer_norm(layer_output + attention_output)
layer_output = layer_output + attention_output
prev_output = layer_output
all_layer_outputs.append(layer_output)
if do_return_all_layers:
final_outputs = []
for layer_output in all_layer_outputs:
final_output = bert_utils.reshape_from_matrix(layer_output, input_shape)
final_outputs.append(final_output)
return final_outputs
else:
final_output = bert_utils.reshape_from_matrix(prev_output, input_shape)
return final_output
================================================
FILE: modeling_google.py
================================================
# coding=utf-8
# Copyright 2019 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Lint as: python2, python3
"""The main ALBERT model and related functions.
For a description of the algorithm, see https://arxiv.org/abs/1909.11942.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import copy
import json
import math
import re
import numpy as np
import six
from six.moves import range
import tensorflow as tf
class AlbertConfig(object):
"""Configuration for `AlbertModel`.
The default settings match the configuration of model `albert_xxlarge`.
"""
def __init__(self,
vocab_size,
embedding_size=128,
hidden_size=4096,
num_hidden_layers=12,
num_hidden_groups=1,
num_attention_heads=64,
intermediate_size=16384,
inner_group_num=1,
down_scale_factor=1,
hidden_act="gelu",
hidden_dropout_prob=0,
attention_probs_dropout_prob=0,
max_position_embeddings=512,
type_vocab_size=2,
initializer_range=0.02):
"""Constructs AlbertConfig.
Args:
vocab_size: Vocabulary size of `inputs_ids` in `AlbertModel`.
embedding_size: size of voc embeddings.
hidden_size: Size of the encoder layers and the pooler layer.
num_hidden_layers: Number of hidden layers in the Transformer encoder.
num_hidden_groups: Number of group for the hidden layers, parameters in
the same group are shared.
num_attention_heads: Number of attention heads for each attention layer in
the Transformer encoder.
intermediate_size: The size of the "intermediate" (i.e., feed-forward)
layer in the Transformer encoder.
inner_group_num: int, number of inner repetition of attention and ffn.
down_scale_factor: float, the scale to apply
hidden_act: The non-linear activation function (function or string) in the
encoder and pooler.
hidden_dropout_prob: The dropout probability for all fully connected
layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob: The dropout ratio for the attention
probabilities.
max_position_embeddings: The maximum sequence length that this model might
ever be used with. Typically set this to something large just in case
(e.g., 512 or 1024 or 2048).
type_vocab_size: The vocabulary size of the `token_type_ids` passed into
`AlbertModel`.
initializer_range: The stdev of the truncated_normal_initializer for
initializing all weight matrices.
"""
self.vocab_size = vocab_size
self.embedding_size = embedding_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_hidden_groups = num_hidden_groups
self.num_attention_heads = num_attention_heads
self.inner_group_num = inner_group_num
self.down_scale_factor = down_scale_factor
self.hidden_act = hidden_act
self.intermediate_size = intermediate_size
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.initializer_range = initializer_range
@classmethod
def from_dict(cls, json_object):
"""Constructs a `AlbertConfig` from a Python dictionary of parameters."""
config = AlbertConfig(vocab_size=None)
for (key, value) in six.iteritems(json_object):
config.__dict__[key] = value
return config
@classmethod
def from_json_file(cls, json_file):
"""Constructs a `AlbertConfig` from a json file of parameters."""
with tf.gfile.GFile(json_file, "r") as reader:
text = reader.read()
return cls.from_dict(json.loads(text))
def to_dict(self):
"""Serializes this instance to a Python dictionary."""
output = copy.deepcopy(self.__dict__)
return output
def to_json_string(self):
"""Serializes this instance to a JSON string."""
return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
class AlbertModel(object):
"""BERT model ("Bidirectional Encoder Representations from Transformers").
Example usage:
```python
# Already been converted from strings into ids
input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])
input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])
token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]])
config = modeling.AlbertConfig(vocab_size=32000, hidden_size=512,
num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)
model = modeling.AlbertModel(config=config, is_training=True,
input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)
label_embeddings = tf.get_variable(...)
pooled_output = model.get_pooled_output()
logits = tf.matmul(pooled_output, label_embeddings)
...
```
"""
def __init__(self,
config,
is_training,
input_ids,
input_mask=None,
token_type_ids=None,
use_one_hot_embeddings=False,
scope=None):
"""Constructor for AlbertModel.
Args:
config: `AlbertConfig` instance.
is_training: bool. true for training model, false for eval model. Controls
whether dropout will be applied.
input_ids: int32 Tensor of shape [batch_size, seq_length].
input_mask: (optional) int32 Tensor of shape [batch_size, seq_length].
token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
use_one_hot_embeddings: (optional) bool. Whether to use one-hot word
embeddings or tf.embedding_lookup() for the word embeddings.
scope: (optional) variable scope. Defaults to "bert".
Raises:
ValueError: The config is invalid or one of the input tensor shapes
is invalid.
"""
config = copy.deepcopy(config)
if not is_training:
config.hidden_dropout_prob = 0.0
config.attention_probs_dropout_prob = 0.0
input_shape = get_shape_list(input_ids, expected_rank=2)
batch_size = input_shape[0]
seq_length = input_shape[1]
if input_mask is None:
input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)
if token_type_ids is None:
token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)
with tf.variable_scope(scope, default_name="bert"):
with tf.variable_scope("embeddings"):
# Perform embedding lookup on the word ids.
(self.word_embedding_output,
self.output_embedding_table) = embedding_lookup(
input_ids=input_ids,
vocab_size=config.vocab_size,
embedding_size=config.embedding_size,
initializer_range=config.initializer_range,
word_embedding_name="word_embeddings",
use_one_hot_embeddings=use_one_hot_embeddings)
# Add positional embeddings and token type embeddings, then layer
# normalize and perform dropout.
self.embedding_output = embedding_postprocessor(
input_tensor=self.word_embedding_output,
use_token_type=True,
token_type_ids=token_type_ids,
token_type_vocab_size=config.type_vocab_size,
token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=config.initializer_range,
max_position_embeddings=config.max_position_embeddings,
dropout_prob=config.hidden_dropout_prob)
with tf.variable_scope("encoder"):
# Run the stacked transformer.
# `sequence_output` shape = [batch_size, seq_length, hidden_size].
self.all_encoder_layers = transformer_model(
input_tensor=self.embedding_output,
attention_mask=input_mask,
hidden_size=config.hidden_size,
num_hidden_layers=config.num_hidden_layers,
num_hidden_groups=config.num_hidden_groups,
num_attention_heads=config.num_attention_heads,
intermediate_size=config.intermediate_size,
inner_group_num=config.inner_group_num,
intermediate_act_fn=get_activation(config.hidden_act),
hidden_dropout_prob=config.hidden_dropout_prob,
attention_probs_dropout_prob=config.attention_probs_dropout_prob,
initializer_range=config.initializer_range,
do_return_all_layers=True)
self.sequence_output = self.all_encoder_layers[-1]
# The "pooler" converts the encoded sequence tensor of shape
# [batch_size, seq_length, hidden_size] to a tensor of shape
# [batch_size, hidden_size]. This is necessary for segment-level
# (or segment-pair-level) classification tasks where we need a fixed
# dimensional representation of the segment.
with tf.variable_scope("pooler"):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token. We assume that this has been pre-trained
first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
self.pooled_output = tf.layers.dense(
first_token_tensor,
config.hidden_size,
activation=tf.tanh,
kernel_initializer=create_initializer(config.initializer_range))
def get_pooled_output(self):
return self.pooled_output
def get_sequence_output(self):
"""Gets final hidden layer of encoder.
Returns:
float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
to the final hidden of the transformer encoder.
"""
return self.sequence_output
def get_all_encoder_layers(self):
return self.all_encoder_layers
def get_word_embedding_output(self):
"""Get output of the word(piece) embedding lookup.
This is BEFORE positional embeddings and token type embeddings have been
added.
Returns:
float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
to the output of the word(piece) embedding layer.
"""
return self.word_embedding_output
def get_embedding_output(self):
"""Gets output of the embedding lookup (i.e., input to the transformer).
Returns:
float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
to the output of the embedding layer, after summing the word
embeddings with the positional embeddings and the token type embeddings,
then performing layer normalization. This is the input to the transformer.
"""
return self.embedding_output
def get_embedding_table(self):
return self.output_embedding_table
def gelu(x):
"""Gaussian Error Linear Unit.
This is a smoother version of the RELU.
Original paper: https://arxiv.org/abs/1606.08415
Args:
x: float Tensor to perform activation.
Returns:
`x` with the GELU activation applied.
"""
cdf = 0.5 * (1.0 + tf.tanh(
(np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))
return x * cdf
def get_activation(activation_string):
"""Maps a string to a Python function, e.g., "relu" => `tf.nn.relu`.
Args:
activation_string: String name of the activation function.
Returns:
A Python function corresponding to the activation function. If
`activation_string` is None, empty, or "linear", this will return None.
If `activation_string` is not a string, it will return `activation_string`.
Raises:
ValueError: The `activation_string` does not correspond to a known
activation.
"""
# We assume that anything that"s not a string is already an activation
# function, so we just return it.
if not isinstance(activation_string, six.string_types):
return activation_string
if not activation_string:
return None
act = activation_string.lower()
if act == "linear":
return None
elif act == "relu":
return tf.nn.relu
elif act == "gelu":
return gelu
elif act == "tanh":
return tf.tanh
else:
raise ValueError("Unsupported activation: %s" % act)
def get_assignment_map_from_checkpoint(tvars, init_checkpoint, num_of_group=0):
"""Compute the union of the current variables and checkpoint variables."""
assignment_map = {}
initialized_variable_names = {}
name_to_variable = collections.OrderedDict()
for var in tvars:
name = var.name
m = re.match("^(.*):\\d+$", name)
if m is not None:
name = m.group(1)
name_to_variable[name] = var
init_vars = tf.train.list_variables(init_checkpoint)
init_vars_name = [name for (name, _) in init_vars]
if num_of_group > 0:
assignment_map = []
for gid in range(num_of_group):
assignment_map.append(collections.OrderedDict())
else:
assignment_map = collections.OrderedDict()
for name in name_to_variable:
if name in init_vars_name:
tvar_name = name
elif (re.sub(r"/group_\d+/", "/group_0/",
six.ensure_str(name)) in init_vars_name and
num_of_group > 1):
tvar_name = re.sub(r"/group_\d+/", "/group_0/", six.ensure_str(name))
elif (re.sub(r"/ffn_\d+/", "/ffn_1/", six.ensure_str(name))
in init_vars_name and num_of_group > 1):
tvar_name = re.sub(r"/ffn_\d+/", "/ffn_1/", six.ensure_str(name))
elif (re.sub(r"/attention_\d+/", "/attention_1/", six.ensure_str(name))
in init_vars_name and num_of_group > 1):
tvar_name = re.sub(r"/attention_\d+/", "/attention_1/",
six.ensure_str(name))
else:
tf.logging.info("name %s does not get matched", name)
continue
tf.logging.info("name %s match to %s", name, tvar_name)
if num_of_group > 0:
group_matched = False
for gid in range(1, num_of_group):
if (("/group_" + str(gid) + "/" in name) or
("/ffn_" + str(gid) + "/" in name) or
("/attention_" + str(gid) + "/" in name)):
group_matched = True
tf.logging.info("%s belongs to %dth", name, gid)
assignment_map[gid][tvar_name] = name
if not group_matched:
assignment_map[0][tvar_name] = name
else:
assignment_map[tvar_name] = name
initialized_variable_names[name] = 1
initialized_variable_names[six.ensure_str(name) + ":0"] = 1
return (assignment_map, initialized_variable_names)
def dropout(input_tensor, dropout_prob):
"""Perform dropout.
Args:
input_tensor: float Tensor.
dropout_prob: Python float. The probability of dropping out a value (NOT of
*keeping* a dimension as in `tf.nn.dropout`).
Returns:
A version of `input_tensor` with dropout applied.
"""
if dropout_prob is None or dropout_prob == 0.0:
return input_tensor
output = tf.nn.dropout(input_tensor, rate=dropout_prob)
return output
def layer_norm(input_tensor, name=None):
"""Run layer normalization on the last dimension of the tensor."""
return tf.contrib.layers.layer_norm(
inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name)
def layer_norm_and_dropout(input_tensor, dropout_prob, name=None):
"""Runs layer normalization followed by dropout."""
output_tensor = layer_norm(input_tensor, name)
output_tensor = dropout(output_tensor, dropout_prob)
return output_tensor
def create_initializer(initializer_range=0.02):
"""Creates a `truncated_normal_initializer` with the given range."""
return tf.truncated_normal_initializer(stddev=initializer_range)
def get_timing_signal_1d_given_position(channels,
position,
min_timescale=1.0,
max_timescale=1.0e4):
"""Get sinusoids of diff frequencies, with timing position given.
Adapted from add_timing_signal_1d_given_position in
//third_party/py/tensor2tensor/layers/common_attention.py
Args:
channels: scalar, size of timing embeddings to create. The number of
different timescales is equal to channels / 2.
position: a Tensor with shape [batch, seq_len]
min_timescale: a float
max_timescale: a float
Returns:
a Tensor of timing signals [batch, seq_len, channels]
"""
num_timescales = channels // 2
log_timescale_increment = (
math.log(float(max_timescale) / float(min_timescale)) /
(tf.to_float(num_timescales) - 1))
inv_timescales = min_timescale * tf.exp(
tf.to_float(tf.range(num_timescales)) * -log_timescale_increment)
scaled_time = (
tf.expand_dims(tf.to_float(position), 2) * tf.expand_dims(
tf.expand_dims(inv_timescales, 0), 0))
signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=2)
signal = tf.pad(signal, [[0, 0], [0, 0], [0, tf.mod(channels, 2)]])
return signal
def embedding_lookup(input_ids,
vocab_size,
embedding_size=128,
initializer_range=0.02,
word_embedding_name="word_embeddings",
use_one_hot_embeddings=False):
"""Looks up words embeddings for id tensor.
Args:
input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
ids.
vocab_size: int. Size of the embedding vocabulary.
embedding_size: int. Width of the word embeddings.
initializer_range: float. Embedding initialization range.
word_embedding_name: string. Name of the embedding table.
use_one_hot_embeddings: bool. If True, use one-hot method for word
embeddings. If False, use `tf.nn.embedding_lookup()`.
Returns:
float Tensor of shape [batch_size, seq_length, embedding_size].
"""
# This function assumes that the input is of shape [batch_size, seq_length,
# num_inputs].
#
# If the input is a 2D tensor of shape [batch_size, seq_length], we
# reshape to [batch_size, seq_length, 1].
if input_ids.shape.ndims == 2:
input_ids = tf.expand_dims(input_ids, axis=[-1])
embedding_table = tf.get_variable(
name=word_embedding_name,
shape=[vocab_size, embedding_size],
initializer=create_initializer(initializer_range))
if use_one_hot_embeddings:
flat_input_ids = tf.reshape(input_ids, [-1])
one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
output = tf.matmul(one_hot_input_ids, embedding_table)
else:
output = tf.nn.embedding_lookup(embedding_table, input_ids)
input_shape = get_shape_list(input_ids)
output = tf.reshape(output,
input_shape[0:-1] + [input_shape[-1] * embedding_size])
return (output, embedding_table)
def embedding_postprocessor(input_tensor,
use_token_type=False,
token_type_ids=None,
token_type_vocab_size=16,
token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=0.02,
max_position_embeddings=512,
dropout_prob=0.1):
"""Performs various post-processing on a word embedding tensor.
Args:
input_tensor: float Tensor of shape [batch_size, seq_length,
embedding_size].
use_token_type: bool. Whether to add embeddings for `token_type_ids`.
token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
Must be specified if `use_token_type` is True.
token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
token_type_embedding_name: string. The name of the embedding table variable
for token type ids.
use_position_embeddings: bool. Whether to add position embeddings for the
position of each token in the sequence.
position_embedding_name: string. The name of the embedding table variable
for positional embeddings.
initializer_range: float. Range of the weight initialization.
max_position_embeddings: int. Maximum sequence length that might ever be
used with this model. This can be longer than the sequence length of
input_tensor, but cannot be shorter.
dropout_prob: float. Dropout probability applied to the final output tensor.
Returns:
float tensor with same shape as `input_tensor`.
Raises:
ValueError: One of the tensor shapes or input values is invalid.
"""
input_shape = get_shape_list(input_tensor, expected_rank=3)
batch_size = input_shape[0]
seq_length = input_shape[1]
width = input_shape[2]
output = input_tensor
if use_token_type:
if token_type_ids is None:
raise ValueError("`token_type_ids` must be specified if"
"`use_token_type` is True.")
token_type_table = tf.get_variable(
name=token_type_embedding_name,
shape=[token_type_vocab_size, width],
initializer=create_initializer(initializer_range))
# This vocab will be small so we always do one-hot here, since it is always
# faster for a small vocabulary.
flat_token_type_ids = tf.reshape(token_type_ids, [-1])
one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
token_type_embeddings = tf.reshape(token_type_embeddings,
[batch_size, seq_length, width])
output += token_type_embeddings
if use_position_embeddings:
assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
with tf.control_dependencies([assert_op]):
full_position_embeddings = tf.get_variable(
name=position_embedding_name,
shape=[max_position_embeddings, width],
initializer=create_initializer(initializer_range))
# Since the position embedding table is a learned variable, we create it
# using a (long) sequence length `max_position_embeddings`. The actual
# sequence length might be shorter than this, for faster training of
# tasks that do not have long sequences.
#
# So `full_position_embeddings` is effectively an embedding table
# for position [0, 1, 2, ..., max_position_embeddings-1], and the current
# sequence has positions [0, 1, 2, ... seq_length-1], so we can just
# perform a slice.
position_embeddings = tf.slice(full_position_embeddings, [0, 0],
[seq_length, -1])
num_dims = len(output.shape.as_list())
# Only the last two dimensions are relevant (`seq_length` and `width`), so
# we broadcast among the first dimensions, which is typically just
# the batch size.
position_broadcast_shape = []
for _ in range(num_dims - 2):
position_broadcast_shape.append(1)
position_broadcast_shape.extend([seq_length, width])
position_embeddings = tf.reshape(position_embeddings,
position_broadcast_shape)
output += position_embeddings
output = layer_norm_and_dropout(output, dropout_prob)
return output
def dense_layer_3d(input_tensor,
num_attention_heads,
head_size,
initializer,
activation,
name=None):
"""A dense layer with 3D kernel.
Args:
input_tensor: float Tensor of shape [batch, seq_length, hidden_size].
num_attention_heads: Number of attention heads.
head_size: The size per attention head.
initializer: Kernel initializer.
activation: Actication function.
name: The name scope of this layer.
Returns:
float logits Tensor.
"""
input_shape = get_shape_list(input_tensor)
hidden_size = input_shape[2]
with tf.variable_scope(name):
w = tf.get_variable(
name="kernel",
shape=[hidden_size, num_attention_heads * head_size],
initializer=initializer)
w = tf.reshape(w, [hidden_size, num_attention_heads, head_size])
b = tf.get_variable(
name="bias",
shape=[num_attention_heads * head_size],
initializer=tf.zeros_initializer)
b = tf.reshape(b, [num_attention_heads, head_size])
ret = tf.einsum("BFH,HND->BFND", input_tensor, w)
ret += b
if activation is not None:
return activation(ret)
else:
return ret
def dense_layer_3d_proj(input_tensor,
hidden_size,
head_size,
initializer,
activation,
name=None):
"""A dense layer with 3D kernel for projection.
Args:
input_tensor: float Tensor of shape [batch,from_seq_length,
num_attention_heads, size_per_head].
hidden_size: The size of hidden layer.
num_attention_heads: The size of output dimension.
head_size: The size of head.
initializer: Kernel initializer.
activation: Actication function.
name: The name scope of this layer.
Returns:
float logits Tensor.
"""
input_shape = get_shape_list(input_tensor)
num_attention_heads= input_shape[2]
with tf.variable_scope(name):
w = tf.get_variable(
name="kernel",
shape=[num_attention_heads * head_size, hidden_size],
initializer=initializer)
w = tf.reshape(w, [num_attention_heads, head_size, hidden_size])
b = tf.get_variable(
name="bias", shape=[hidden_size], initializer=tf.zeros_initializer)
ret = tf.einsum("BFND,NDH->BFH", input_tensor, w)
ret += b
if activation is not None:
return activation(ret)
else:
return ret
def dense_layer_2d(input_tensor,
output_size,
initializer,
activation,
num_attention_heads=1,
name=None):
"""A dense layer with 2D kernel.
Args:
input_tensor: Float tensor with rank 3.
output_size: The size of output dimension.
initializer: Kernel initializer.
activation: Activation function.
num_attention_heads: number of attention head in attention layer.
name: The name scope of this layer.
Returns:
float logits Tensor.
"""
del num_attention_heads # unused
input_shape = get_shape_list(input_tensor)
hidden_size = input_shape[2]
with tf.variable_scope(name):
w = tf.get_variable(
name="kernel",
shape=[hidden_size, output_size],
initializer=initializer)
b = tf.get_variable(
name="bias", shape=[output_size], initializer=tf.zeros_initializer)
ret = tf.einsum("BFH,HO->BFO", input_tensor, w)
ret += b
if activation is not None:
return activation(ret)
else:
return ret
def dot_product_attention(q, k, v, bias, dropout_rate=0.0):
"""Dot-product attention.
Args:
q: Tensor with shape [..., length_q, depth_k].
k: Tensor with shape [..., length_kv, depth_k]. Leading dimensions must
match with q.
v: Tensor with shape [..., length_kv, depth_v] Leading dimensions must
match with q.
bias: bias Tensor (see attention_bias())
dropout_rate: a float.
Returns:
Tensor with shape [..., length_q, depth_v].
"""
logits = tf.matmul(q, k, transpose_b=True) # [..., length_q, length_kv]
logits = tf.multiply(logits, 1.0 / math.sqrt(float(get_shape_list(q)[-1])))
if bias is not None:
# `attention_mask` = [B, T]
from_shape = get_shape_list(q)
if len(from_shape) == 4:
broadcast_ones = tf.ones([from_shape[0], 1, from_shape[2], 1], tf.float32)
elif len(from_shape) == 5:
# from_shape = [B, N, Block_num, block_size, depth]#
broadcast_ones = tf.ones([from_shape[0], 1, from_shape[2], from_shape[3],
1], tf.float32)
bias = tf.matmul(broadcast_ones,
tf.cast(bias, tf.float32), transpose_b=True)
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
# masked positions, this operation will create a tensor which is 0.0 for
# positions we want to attend and -10000.0 for masked positions.
adder = (1.0 - bias) * -10000.0
# Since we are adding it to the raw scores before the softmax, this is
# effectively the same as removing these entirely.
logits += adder
else:
adder = 0.0
attention_probs = tf.nn.softmax(logits, name="attention_probs")
attention_probs = dropout(attention_probs, dropout_rate)
return tf.matmul(attention_probs, v)
def attention_layer(from_tensor,
to_tensor,
attention_mask=None,
num_attention_heads=1,
query_act=None,
key_act=None,
value_act=None,
attention_probs_dropout_prob=0.0,
initializer_range=0.02,
batch_size=None,
from_seq_length=None,
to_seq_length=None):
"""Performs multi-headed attention from `from_tensor` to `to_tensor`.
Args:
from_tensor: float Tensor of shape [batch_size, from_seq_length,
from_width].
to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
attention_mask: (optional) int32 Tensor of shape [batch_size,
from_seq_length, to_seq_length]. The values should be 1 or 0. The
attention scores will effectively be set to -infinity for any positions in
the mask that are 0, and will be unchanged for positions that are 1.
num_attention_heads: int. Number of attention heads.
query_act: (optional) Activation function for the query transform.
key_act: (optional) Activation function for the key transform.
value_act: (optional) Activation function for the value transform.
attention_probs_dropout_prob: (optional) float. Dropout probability of the
attention probabilities.
initializer_range: float. Range of the weight initializer.
batch_size: (Optional) int. If the input is 2D, this might be the batch size
of the 3D version of the `from_tensor` and `to_tensor`.
from_seq_length: (Optional) If the input is 2D, this might be the seq length
of the 3D version of the `from_tensor`.
to_seq_length: (Optional) If the input is 2D, this might be the seq length
of the 3D version of the `to_tensor`.
Returns:
float Tensor of shape [batch_size, from_seq_length, num_attention_heads,
size_per_head].
Raises:
ValueError: Any of the arguments or tensor shapes are invalid.
"""
from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])
size_per_head = int(from_shape[2]/num_attention_heads)
if len(from_shape) != len(to_shape):
raise ValueError(
"The rank of `from_tensor` must match the rank of `to_tensor`.")
if len(from_shape) == 3:
batch_size = from_shape[0]
from_seq_length = from_shape[1]
to_seq_length = to_shape[1]
elif len(from_shape) == 2:
if (batch_size is None or from_seq_length is None or to_seq_length is None):
raise ValueError(
"When passing in rank 2 tensors to attention_layer, the values "
"for `batch_size`, `from_seq_length`, and `to_seq_length` "
"must all be specified.")
# Scalar dimensions referenced here:
# B = batch size (number of sequences)
# F = `from_tensor` sequence length
# T = `to_tensor` sequence length
# N = `num_attention_heads`
# H = `size_per_head`
# `query_layer` = [B, F, N, H]
q = dense_layer_3d(from_tensor, num_attention_heads, size_per_head,
create_initializer(initializer_range), query_act, "query")
# `key_layer` = [B, T, N, H]
k = dense_layer_3d(to_tensor, num_attention_heads, size_per_head,
create_initializer(initializer_range), key_act, "key")
# `value_layer` = [B, T, N, H]
v = dense_layer_3d(to_tensor, num_attention_heads, size_per_head,
create_initializer(initializer_range), value_act, "value")
q = tf.transpose(q, [0, 2, 1, 3])
k = tf.transpose(k, [0, 2, 1, 3])
v = tf.transpose(v, [0, 2, 1, 3])
if attention_mask is not None:
attention_mask = tf.reshape(
attention_mask, [batch_size, 1, to_seq_length, 1])
# 'new_embeddings = [B, N, F, H]'
new_embeddings = dot_product_attention(q, k, v, attention_mask,
attention_probs_dropout_prob)
return tf.transpose(new_embeddings, [0, 2, 1, 3])
def attention_ffn_block(layer_input,
hidden_size=768,
attention_mask=None,
num_attention_heads=1,
attention_head_size=64,
attention_probs_dropout_prob=0.0,
intermediate_size=3072,
intermediate_act_fn=None,
initializer_range=0.02,
hidden_dropout_prob=0.0):
"""A network with attention-ffn as sub-block.
Args:
layer_input: float Tensor of shape [batch_size, from_seq_length,
from_width].
hidden_size: (optional) int, size of hidden layer.
attention_mask: (optional) int32 Tensor of shape [batch_size,
from_seq_length, to_seq_length]. The values should be 1 or 0. The
attention scores will effectively be set to -infinity for any positions in
the mask that are 0, and will be unchanged for positions that are 1.
num_attention_heads: int. Number of attention heads.
attention_head_size: int. Size of attention head.
attention_probs_dropout_prob: float. dropout probability for attention_layer
intermediate_size: int. Size of intermediate hidden layer.
intermediate_act_fn: (optional) Activation function for the intermediate
layer.
initializer_range: float. Range of the weight initializer.
hidden_dropout_prob: (optional) float. Dropout probability of the hidden
layer.
Returns:
layer output
"""
with tf.variable_scope("attention_1"):
with tf.variable_scope("self"):
attention_output = attention_layer(
from_tensor=layer_input,
to_tensor=layer_input,
attention_mask=attention_mask,
num_attention_heads=num_attention_heads,
attention_probs_dropout_prob=attention_probs_dropout_prob,
initializer_range=initializer_range)
# Run a linear projection of `hidden_size` then add a residual
# with `layer_input`.
with tf.variable_scope("output"):
attention_output = dense_layer_3d_proj(
attention_output,
hidden_size,
attention_head_size,
create_initializer(initializer_range),
None,
name="dense")
attention_output = dropout(attention_output, hidden_dropout_prob)
attention_output = layer_norm(attention_output + layer_input)
with tf.variable_scope("ffn_1"):
with tf.variable_scope("intermediate"):
intermediate_output = dense_layer_2d(
attention_output,
intermediate_size,
create_initializer(initializer_range),
intermediate_act_fn,
num_attention_heads=num_attention_heads,
name="dense")
with tf.variable_scope("output"):
ffn_output = dense_layer_2d(
intermediate_output,
hidden_size,
create_initializer(initializer_range),
None,
num_attention_heads=num_attention_heads,
name="dense")
ffn_output = dropout(ffn_output, hidden_dropout_prob)
ffn_output = layer_norm(ffn_output + attention_output)
return ffn_output
def transformer_model(input_tensor,
attention_mask=None,
hidden_size=768,
num_hidden_layers=12,
num_hidden_groups=12,
num_attention_heads=12,
intermediate_size=3072,
inner_group_num=1,
intermediate_act_fn="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
initializer_range=0.02,
do_return_all_layers=False):
"""Multi-headed, multi-layer Transformer from "Attention is All You Need".
This is almost an exact implementation of the original Transformer encoder.
See the original paper:
https://arxiv.org/abs/1706.03762
Also see:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py
Args:
input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
seq_length], with 1 for positions that can be attended to and 0 in
positions that should not be.
hidden_size: int. Hidden size of the Transformer.
num_hidden_layers: int. Number of layers (blocks) in the Transformer.
num_hidden_groups: int. Number of group for the hidden layers, parameters
in the same group are shared.
num_attention_heads: int. Number of attention heads in the Transformer.
intermediate_size: int. The size of the "intermediate" (a.k.a., feed
forward) layer.
inner_group_num: int, number of inner repetition of attention and ffn.
intermediate_act_fn: function. The non-linear activation function to apply
to the output of the intermediate/feed-forward layer.
hidden_dropout_prob: float. Dropout probability for the hidden layers.
attention_probs_dropout_prob: float. Dropout probability of the attention
probabilities.
initializer_range: float. Range of the initializer (stddev of truncated
normal).
do_return_all_layers: Whether to also return all layers or just the final
layer.
Returns:
float Tensor of shape [batch_size, seq_length, hidden_size], the final
hidden layer of the Transformer.
Raises:
ValueError: A Tensor shape or parameter is invalid.
"""
if hidden_size % num_attention_heads != 0:
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
"heads (%d)" % (hidden_size, num_attention_heads))
attention_head_size = hidden_size // num_attention_heads
input_shape = get_shape_list(input_tensor, expected_rank=3)
input_width = input_shape[2]
all_layer_outputs = []
if input_width != hidden_size:
prev_output = dense_layer_2d(
input_tensor, hidden_size, create_initializer(initializer_range),
None, name="embedding_hidden_mapping_in")
else:
prev_output = input_tensor
with tf.variable_scope("transformer", reuse=tf.AUTO_REUSE):
for layer_idx in range(num_hidden_layers):
group_idx = int(layer_idx / num_hidden_layers * num_hidden_groups)
with tf.variable_scope("group_%d" % group_idx):
with tf.name_scope("layer_%d" % layer_idx):
layer_output = prev_output
for inner_group_idx in range(inner_group_num):
with tf.variable_scope("inner_group_%d" % inner_group_idx):
layer_output = attention_ffn_block(
layer_output, hidden_size, attention_mask,
num_attention_heads, attention_head_size,
attention_probs_dropout_prob, intermediate_size,
intermediate_act_fn, initializer_range, hidden_dropout_prob)
prev_output = layer_output
all_layer_outputs.append(layer_output)
if do_return_all_layers:
return all_layer_outputs
else:
return all_layer_outputs[-1]
def get_shape_list(tensor, expected_rank=None, name=None):
"""Returns a list of the shape of tensor, preferring static dimensions.
Args:
tensor: A tf.Tensor object to find the shape of.
expected_rank: (optional) int. The expected rank of `tensor`. If this is
specified and the `tensor` has a different rank, and exception will be
thrown.
name: Optional name of the tensor for the error message.
Returns:
A list of dimensions of the shape of tensor. All static dimensions will
be returned as python integers, and dynamic dimensions will be returned
as tf.Tensor scalars.
"""
if name is None:
name = tensor.name
if expected_rank is not None:
assert_rank(tensor, expected_rank, name)
shape = tensor.shape.as_list()
non_static_indexes = []
for (index, dim) in enumerate(shape):
if dim is None:
non_static_indexes.append(index)
if not non_static_indexes:
return shape
dyn_shape = tf.shape(tensor)
for index in non_static_indexes:
shape[index] = dyn_shape[index]
return shape
def reshape_to_matrix(input_tensor):
"""Reshapes a >= rank 2 tensor to a rank 2 tensor (i.e., a matrix)."""
ndims = input_tensor.shape.ndims
if ndims < 2:
raise ValueError("Input tensor must have at least rank 2. Shape = %s" %
(input_tensor.shape))
if ndims == 2:
return input_tensor
width = input_tensor.shape[-1]
output_tensor = tf.reshape(input_tensor, [-1, width])
return output_tensor
def reshape_from_matrix(output_tensor, orig_shape_list):
"""Reshapes a rank 2 tensor back to its original rank >= 2 tensor."""
if len(orig_shape_list) == 2:
return output_tensor
output_shape = get_shape_list(output_tensor)
orig_dims = orig_shape_list[0:-1]
width = output_shape[-1]
return tf.reshape(output_tensor, orig_dims + [width])
def assert_rank(tensor, expected_rank, name=None):
"""Raises an exception if the tensor rank is not of the expected rank.
Args:
tensor: A tf.Tensor to check the rank of.
expected_rank: Python integer or list of integers, expected rank.
name: Optional name of the tensor for the error message.
Raises:
ValueError: If the expected shape doesn't match the actual shape.
"""
if name is None:
name = tensor.name
expected_rank_dict = {}
if isinstance(expected_rank, six.integer_types):
expected_rank_dict[expected_rank] = True
else:
for x in expected_rank:
expected_rank_dict[x] = True
actual_rank = tensor.shape.ndims
if actual_rank not in expected_rank_dict:
scope_name = tf.get_variable_scope().name
raise ValueError(
"For the tensor `%s` in scope `%s`, the actual rank "
"`%d` (shape = %s) is not equal to the expected rank `%s`" %
(name, scope_name, actual_rank, str(tensor.shape), str(expected_rank)))
================================================
FILE: modeling_google_fast.py
================================================
# coding=utf-8
# Copyright 2019 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Lint as: python2, python3
"""The main ALBERT model and related functions.
For a description of the algorithm, see https://arxiv.org/abs/1909.11942.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import copy
import json
import math
import re
import numpy as np
import six
from six.moves import range
import tensorflow as tf
class AlbertConfig(object):
"""Configuration for `AlbertModel`.
The default settings match the configuration of model `albert_xxlarge`.
"""
def __init__(self,
vocab_size,
embedding_size=128,
hidden_size=4096,
num_hidden_layers=12,
num_hidden_groups=1,
num_attention_heads=64,
intermediate_size=16384,
inner_group_num=1,
down_scale_factor=1,
hidden_act="gelu",
hidden_dropout_prob=0,
attention_probs_dropout_prob=0,
max_position_embeddings=512,
type_vocab_size=2,
initializer_range=0.02):
"""Constructs AlbertConfig.
Args:
vocab_size: Vocabulary size of `inputs_ids` in `AlbertModel`.
embedding_size: size of voc embeddings.
hidden_size: Size of the encoder layers and the pooler layer.
num_hidden_layers: Number of hidden layers in the Transformer encoder.
num_hidden_groups: Number of group for the hidden layers, parameters in
the same group are shared.
num_attention_heads: Number of attention heads for each attention layer in
the Transformer encoder.
intermediate_size: The size of the "intermediate" (i.e., feed-forward)
layer in the Transformer encoder.
inner_group_num: int, number of inner repetition of attention and ffn.
down_scale_factor: float, the scale to apply
hidden_act: The non-linear activation function (function or string) in the
encoder and pooler.
hidden_dropout_prob: The dropout probability for all fully connected
layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob: The dropout ratio for the attention
probabilities.
max_position_embeddings: The maximum sequence length that this model might
ever be used with. Typically set this to something large just in case
(e.g., 512 or 1024 or 2048).
type_vocab_size: The vocabulary size of the `token_type_ids` passed into
`AlbertModel`.
initializer_range: The stdev of the truncated_normal_initializer for
initializing all weight matrices.
"""
self.vocab_size = vocab_size
self.embedding_size = embedding_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_hidden_groups = num_hidden_groups
self.num_attention_heads = num_attention_heads
self.inner_group_num = inner_group_num
self.down_scale_factor = down_scale_factor
self.hidden_act = hidden_act
self.intermediate_size = intermediate_size
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.initializer_range = initializer_range
@classmethod
def from_dict(cls, json_object):
"""Constructs a `AlbertConfig` from a Python dictionary of parameters."""
config = AlbertConfig(vocab_size=None)
for (key, value) in six.iteritems(json_object):
config.__dict__[key] = value
return config
@classmethod
def from_json_file(cls, json_file):
"""Constructs a `AlbertConfig` from a json file of parameters."""
with tf.gfile.GFile(json_file, "r") as reader:
text = reader.read()
return cls.from_dict(json.loads(text))
def to_dict(self):
"""Serializes this instance to a Python dictionary."""
output = copy.deepcopy(self.__dict__)
return output
def to_json_string(self):
"""Serializes this instance to a JSON string."""
return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
class AlbertModel(object):
"""BERT model ("Bidirectional Encoder Representations from Transformers").
Example usage:
```python
# Already been converted from strings into ids
input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])
input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])
token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]])
config = modeling.AlbertConfig(vocab_size=32000, hidden_size=512,
num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)
model = modeling.AlbertModel(config=config, is_training=True,
input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)
label_embeddings = tf.get_variable(...)
pooled_output = model.get_pooled_output()
logits = tf.matmul(pooled_output, label_embeddings)
...
```
"""
def __init__(self,
config,
is_training,
input_ids,
input_mask=None,
token_type_ids=None,
use_one_hot_embeddings=False,
scope=None):
"""Constructor for AlbertModel.
Args:
config: `AlbertConfig` instance.
is_training: bool. true for training model, false for eval model. Controls
whether dropout will be applied.
input_ids: int32 Tensor of shape [batch_size, seq_length].
input_mask: (optional) int32 Tensor of shape [batch_size, seq_length].
token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
use_one_hot_embeddings: (optional) bool. Whether to use one-hot word
embeddings or tf.embedding_lookup() for the word embeddings.
scope: (optional) variable scope. Defaults to "bert".
Raises:
ValueError: The config is invalid or one of the input tensor shapes
is invalid.
"""
config = copy.deepcopy(config)
if not is_training:
config.hidden_dropout_prob = 0.0
config.attention_probs_dropout_prob = 0.0
input_shape = get_shape_list(input_ids, expected_rank=2)
batch_size = input_shape[0]
seq_length = input_shape[1]
if input_mask is None:
input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)
if token_type_ids is None:
token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)
with tf.variable_scope(scope, default_name="bert"):
with tf.variable_scope("embeddings"):
# Perform embedding lookup on the word ids.
(self.word_embedding_output,
self.output_embedding_table) = embedding_lookup(
input_ids=input_ids,
vocab_size=config.vocab_size,
embedding_size=config.embedding_size,
initializer_range=config.initializer_range,
word_embedding_name="word_embeddings",
use_one_hot_embeddings=use_one_hot_embeddings)
# Add positional embeddings and token type embeddings, then layer
# normalize and perform dropout.
self.embedding_output = embedding_postprocessor(
input_tensor=self.word_embedding_output,
use_token_type=True,
token_type_ids=token_type_ids,
token_type_vocab_size=config.type_vocab_size,
token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=config.initializer_range,
max_position_embeddings=config.max_position_embeddings,
dropout_prob=config.hidden_dropout_prob)
with tf.variable_scope("encoder"):
# Run the stacked transformer.
# `sequence_output` shape = [batch_size, seq_length, hidden_size].
self.all_encoder_layers = transformer_model(
input_tensor=self.embedding_output,
attention_mask=input_mask,
hidden_size=config.hidden_size,
num_hidden_layers=config.num_hidden_layers,
num_hidden_groups=config.num_hidden_groups,
num_attention_heads=config.num_attention_heads,
intermediate_size=config.intermediate_size,
inner_group_num=config.inner_group_num,
intermediate_act_fn=get_activation(config.hidden_act),
hidden_dropout_prob=config.hidden_dropout_prob,
attention_probs_dropout_prob=config.attention_probs_dropout_prob,
initializer_range=config.initializer_range,
do_return_all_layers=True)
self.sequence_output = self.all_encoder_layers[-1]
# The "pooler" converts the encoded sequence tensor of shape
# [batch_size, seq_length, hidden_size] to a tensor of shape
# [batch_size, hidden_size]. This is necessary for segment-level
# (or segment-pair-level) classification tasks where we need a fixed
# dimensional representation of the segment.
with tf.variable_scope("pooler"):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token. We assume that this has been pre-trained
first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
self.pooled_output = tf.layers.dense(
first_token_tensor,
config.hidden_size,
activation=tf.tanh,
kernel_initializer=create_initializer(config.initializer_range))
def get_pooled_output(self):
return self.pooled_output
def get_sequence_output(self):
"""Gets final hidden layer of encoder.
Returns:
float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
to the final hidden of the transformer encoder.
"""
return self.sequence_output
def get_all_encoder_layers(self):
return self.all_encoder_layers
def get_word_embedding_output(self):
"""Get output of the word(piece) embedding lookup.
This is BEFORE positional embeddings and token type embeddings have been
added.
Returns:
float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
to the output of the word(piece) embedding layer.
"""
return self.word_embedding_output
def get_embedding_output(self):
"""Gets output of the embedding lookup (i.e., input to the transformer).
Returns:
float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
to the output of the embedding layer, after summing the word
embeddings with the positional embeddings and the token type embeddings,
then performing layer normalization. This is the input to the transformer.
"""
return self.embedding_output
def get_embedding_table(self):
return self.output_embedding_table
def gelu(x):
"""Gaussian Error Linear Unit.
This is a smoother version of the RELU.
Original paper: https://arxiv.org/abs/1606.08415
Args:
x: float Tensor to perform activation.
Returns:
`x` with the GELU activation applied.
"""
cdf = 0.5 * (1.0 + tf.tanh(
(np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))
return x * cdf
def get_activation(activation_string):
"""Maps a string to a Python function, e.g., "relu" => `tf.nn.relu`.
Args:
activation_string: String name of the activation function.
Returns:
A Python function corresponding to the activation function. If
`activation_string` is None, empty, or "linear", this will return None.
If `activation_string` is not a string, it will return `activation_string`.
Raises:
ValueError: The `activation_string` does not correspond to a known
activation.
"""
# We assume that anything that"s not a string is already an activation
# function, so we just return it.
if not isinstance(activation_string, six.string_types):
return activation_string
if not activation_string:
return None
act = activation_string.lower()
if act == "linear":
return None
elif act == "relu":
return tf.nn.relu
elif act == "gelu":
return gelu
elif act == "tanh":
return tf.tanh
elif act == "swish":
return lambda x: x * tf.sigmoid(x)
else:
raise ValueError("Unsupported activation: %s" % act)
def get_assignment_map_from_checkpoint(tvars, init_checkpoint, num_of_group=0):
"""Compute the union of the current variables and checkpoint variables."""
assignment_map = {}
initialized_variable_names = {}
name_to_variable = collections.OrderedDict()
for var in tvars:
name = var.name
m = re.match("^(.*):\\d+$", name)
if m is not None:
name = m.group(1)
name_to_variable[name] = var
init_vars = tf.train.list_variables(init_checkpoint)
init_vars_name = [name for (name, _) in init_vars]
if num_of_group > 0:
assignment_map = []
for gid in range(num_of_group):
assignment_map.append(collections.OrderedDict())
else:
assignment_map = collections.OrderedDict()
for name in name_to_variable:
if name in init_vars_name:
tvar_name = name
elif (re.sub(r"/group_\d+/", "/group_0/",
six.ensure_str(name)) in init_vars_name and
num_of_group > 1):
tvar_name = re.sub(r"/group_\d+/", "/group_0/", six.ensure_str(name))
elif (re.sub(r"/ffn_\d+/", "/ffn_1/", six.ensure_str(name))
in init_vars_name and num_of_group > 1):
tvar_name = re.sub(r"/ffn_\d+/", "/ffn_1/", six.ensure_str(name))
elif (re.sub(r"/attention_\d+/", "/attention_1/", six.ensure_str(name))
in init_vars_name and num_of_group > 1):
tvar_name = re.sub(r"/attention_\d+/", "/attention_1/",
six.ensure_str(name))
else:
tf.logging.info("name %s does not get matched", name)
continue
tf.logging.info("name %s match to %s", name, tvar_name)
if num_of_group > 0:
group_matched = False
for gid in range(1, num_of_group):
if (("/group_" + str(gid) + "/" in name) or
("/ffn_" + str(gid) + "/" in name) or
("/attention_" + str(gid) + "/" in name)):
group_matched = True
tf.logging.info("%s belongs to %dth", name, gid)
assignment_map[gid][tvar_name] = name
if not group_matched:
assignment_map[0][tvar_name] = name
else:
assignment_map[tvar_name] = name
initialized_variable_names[name] = 1
initialized_variable_names[six.ensure_str(name) + ":0"] = 1
return (assignment_map, initialized_variable_names)
def dropout(input_tensor, dropout_prob):
"""Perform dropout.
Args:
input_tensor: float Tensor.
dropout_prob: Python float. The probability of dropping out a value (NOT of
*keeping* a dimension as in `tf.nn.dropout`).
Returns:
A version of `input_tensor` with dropout applied.
"""
if dropout_prob is None or dropout_prob == 0.0:
return input_tensor
output = tf.nn.dropout(input_tensor, rate=dropout_prob)
return output
def layer_norm(input_tensor, name=None):
"""Run layer normalization on the last dimension of the tensor."""
return tf.contrib.layers.layer_norm(
inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name)
def layer_norm_and_dropout(input_tensor, dropout_prob, name=None):
"""Runs layer normalization followed by dropout."""
output_tensor = layer_norm(input_tensor, name)
output_tensor = dropout(output_tensor, dropout_prob)
return output_tensor
def create_initializer(initializer_range=0.02):
"""Creates a `truncated_normal_initializer` with the given range."""
return tf.truncated_normal_initializer(stddev=initializer_range)
def get_timing_signal_1d_given_position(channels,
position,
min_timescale=1.0,
max_timescale=1.0e4):
"""Get sinusoids of diff frequencies, with timing position given.
Adapted from add_timing_signal_1d_given_position in
//third_party/py/tensor2tensor/layers/common_attention.py
Args:
channels: scalar, size of timing embeddings to create. The number of
different timescales is equal to channels / 2.
position: a Tensor with shape [batch, seq_len]
min_timescale: a float
max_timescale: a float
Returns:
a Tensor of timing signals [batch, seq_len, channels]
"""
num_timescales = channels // 2
log_timescale_increment = (
math.log(float(max_timescale) / float(min_timescale)) /
(tf.to_float(num_timescales) - 1))
inv_timescales = min_timescale * tf.exp(
tf.to_float(tf.range(num_timescales)) * -log_timescale_increment)
scaled_time = (
tf.expand_dims(tf.to_float(position), 2) * tf.expand_dims(
tf.expand_dims(inv_timescales, 0), 0))
signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=2)
signal = tf.pad(signal, [[0, 0], [0, 0], [0, tf.mod(channels, 2)]])
return signal
def embedding_lookup(input_ids,
vocab_size,
embedding_size=128,
initializer_range=0.02,
word_embedding_name="word_embeddings",
use_one_hot_embeddings=False):
"""Looks up words embeddings for id tensor.
Args:
input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
ids.
vocab_size: int. Size of the embedding vocabulary.
embedding_size: int. Width of the word embeddings.
initializer_range: float. Embedding initialization range.
word_embedding_name: string. Name of the embedding table.
use_one_hot_embeddings: bool. If True, use one-hot method for word
embeddings. If False, use `tf.nn.embedding_lookup()`.
Returns:
float Tensor of shape [batch_size, seq_length, embedding_size].
"""
# This function assumes that the input is of shape [batch_size, seq_length,
# num_inputs].
#
# If the input is a 2D tensor of shape [batch_size, seq_length], we
# reshape to [batch_size, seq_length, 1].
if input_ids.shape.ndims == 2:
input_ids = tf.expand_dims(input_ids, axis=[-1])
embedding_table = tf.get_variable(
name=word_embedding_name,
shape=[vocab_size, embedding_size],
initializer=create_initializer(initializer_range))
if use_one_hot_embeddings:
flat_input_ids = tf.reshape(input_ids, [-1])
one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
output = tf.matmul(one_hot_input_ids, embedding_table)
else:
output = tf.nn.embedding_lookup(embedding_table, input_ids)
input_shape = get_shape_list(input_ids)
output = tf.reshape(output,
input_shape[0:-1] + [input_shape[-1] * embedding_size])
return (output, embedding_table)
def embedding_postprocessor(input_tensor,
use_token_type=False,
token_type_ids=None,
token_type_vocab_size=16,
token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=0.02,
max_position_embeddings=512,
dropout_prob=0.1):
"""Performs various post-processing on a word embedding tensor.
Args:
input_tensor: float Tensor of shape [batch_size, seq_length,
embedding_size].
use_token_type: bool. Whether to add embeddings for `token_type_ids`.
token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
Must be specified if `use_token_type` is True.
token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
token_type_embedding_name: string. The name of the embedding table variable
for token type ids.
use_position_embeddings: bool. Whether to add position embeddings for the
position of each token in the sequence.
position_embedding_name: string. The name of the embedding table variable
for positional embeddings.
initializer_range: float. Range of the weight initialization.
max_position_embeddings: int. Maximum sequence length that might ever be
used with this model. This can be longer than the sequence length of
input_tensor, but cannot be shorter.
dropout_prob: float. Dropout probability applied to the final output tensor.
Returns:
float tensor with same shape as `input_tensor`.
Raises:
ValueError: One of the tensor shapes or input values is invalid.
"""
input_shape = get_shape_list(input_tensor, expected_rank=3)
batch_size = input_shape[0]
seq_length = input_shape[1]
width = input_shape[2]
output = input_tensor
if use_token_type:
if token_type_ids is None:
raise ValueError("`token_type_ids` must be specified if"
"`use_token_type` is True.")
token_type_table = tf.get_variable(
name=token_type_embedding_name,
shape=[token_type_vocab_size, width],
initializer=create_initializer(initializer_range))
# This vocab will be small so we always do one-hot here, since it is always
# faster for a small vocabulary.
flat_token_type_ids = tf.reshape(token_type_ids, [-1])
one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
token_type_embeddings = tf.reshape(token_type_embeddings,
[batch_size, seq_length, width])
output += token_type_embeddings
if use_position_embeddings:
assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
with tf.control_dependencies([assert_op]):
full_position_embeddings = tf.get_variable(
name=position_embedding_name,
shape=[max_position_embeddings, width],
initializer=create_initializer(initializer_range))
# Since the position embedding table is a learned variable, we create it
# using a (long) sequence length `max_position_embeddings`. The actual
# sequence length might be shorter than this, for faster training of
# tasks that do not have long sequences.
#
# So `full_position_embeddings` is effectively an embedding table
# for position [0, 1, 2, ..., max_position_embeddings-1], and the current
# sequence has positions [0, 1, 2, ... seq_length-1], so we can just
# perform a slice.
position_embeddings = tf.slice(full_position_embeddings, [0, 0],
[seq_length, -1])
num_dims = len(output.shape.as_list())
# Only the last two dimensions are relevant (`seq_length` and `width`), so
# we broadcast among the first dimensions, which is typically just
# the batch size.
position_broadcast_shape = []
for _ in range(num_dims - 2):
position_broadcast_shape.append(1)
position_broadcast_shape.extend([seq_length, width])
position_embeddings = tf.reshape(position_embeddings,
position_broadcast_shape)
output += position_embeddings
output = layer_norm_and_dropout(output, dropout_prob)
return output
def dense_layer_3d(input_tensor,
num_attention_heads,
head_size,
initializer,
activation,
name=None):
"""A dense layer with 3D kernel.
Args:
input_tensor: float Tensor of shape [batch, seq_length, hidden_size].
num_attention_heads: Number of attention heads.
head_size: The size per attention head.
initializer: Kernel initializer.
activation: Actication function.
name: The name scope of this layer.
Returns:
float logits Tensor.
"""
input_shape = get_shape_list(input_tensor)
hidden_size = input_shape[2]
with tf.variable_scope(name):
w = tf.get_variable(
name="kernel",
shape=[hidden_size, num_attention_heads * head_size],
initializer=initializer)
w = tf.reshape(w, [hidden_size, num_attention_heads, head_size])
b = tf.get_variable(
name="bias",
shape=[num_attention_heads * head_size],
initializer=tf.zeros_initializer)
b = tf.reshape(b, [num_attention_heads, head_size])
ret = tf.einsum("BFH,HND->BFND", input_tensor, w)
ret += b
if activation is not None:
return activation(ret)
else:
return ret
def dense_layer_3d_proj(input_tensor,
hidden_size,
head_size,
initializer,
activation,
name=None):
"""A dense layer with 3D kernel for projection.
Args:
input_tensor: float Tensor of shape [batch,from_seq_length,
num_attention_heads, size_per_head].
hidden_size: The size of hidden layer.
num_attention_heads: The size of output dimension.
head_size: The size of head.
initializer: Kernel initializer.
activation: Actication function.
name: The name scope of this layer.
Returns:
float logits Tensor.
"""
input_shape = get_shape_list(input_tensor)
num_attention_heads= input_shape[2]
with tf.variable_scope(name):
w = tf.get_variable(
name="kernel",
shape=[num_attention_heads * head_size, hidden_size],
initializer=initializer)
w = tf.reshape(w, [num_attention_heads, head_size, hidden_size])
b = tf.get_variable(
name="bias", shape=[hidden_size], initializer=tf.zeros_initializer)
ret = tf.einsum("BFND,NDH->BFH", input_tensor, w)
ret += b
if activation is not None:
return activation(ret)
else:
return ret
def dense_layer_2d(input_tensor,
output_size,
initializer,
activation,
num_attention_heads=1,
name=None,
num_groups=1):
"""A dense layer with 2D kernel.
Args:
input_tensor: Float tensor with rank 3.
output_size: The size of output dimension.
initializer: Kernel initializer.
activation: Activation function.
num_groups: number of groups in dense layer
num_attention_heads: number of attention head in attention layer.
name: The name scope of this layer.
Returns:
float logits Tensor.
"""
del num_attention_heads # unused
input_shape = get_shape_list(input_tensor)
hidden_size = input_shape[2]
if num_groups == 1:
with tf.variable_scope(name):
w = tf.get_variable(
name="kernel",
shape=[hidden_size, output_size],
initializer=initializer)
b = tf.get_variable(
name="bias", shape=[output_size], initializer=tf.zeros_initializer)
ret = tf.einsum("BFH,HO->BFO", input_tensor, w)
ret += b
else:
assert hidden_size % num_groups == 0
assert output_size % num_groups == 0
with tf.variable_scope(name):
w = tf.get_variable(
name="kernel",
shape=[hidden_size//num_groups, output_size//num_groups, num_groups],
initializer=initializer)
b = tf.get_variable(
name="bias", shape=[output_size], initializer=tf.zeros_initializer)
input_tensor = tf.reshape(input_tensor, input_shape[:2] + [hidden_size//num_groups, num_groups])
ret = tf.einsum("BFHG,HOG->BFGO", input_tensor, w)
ret = tf.reshape(ret, input_shape[:2] + [output_size])
ret += b
if activation is not None:
return activation(ret)
else:
return ret
def dense_layer_2d_old(input_tensor,
output_size,
initializer,
activation,
num_attention_heads=1,
name=None,
num_groups=1):
"""A dense layer with 2D kernel. 添加分组全连接的方式
Args:
input_tensor: Float tensor with rank 3. [ batch_size,sequence_length, hidden_size]
output_size: The size of output dimension.
initializer: Kernel initializer.
activation: Activation function.
num_groups: number of groups in dense layer
num_attention_heads: number of attention head in attention layer.
name: The name scope of this layer.
Returns:
float logits Tensor.
"""
del num_attention_heads # unused
input_shape = get_shape_list(input_tensor)
# print("#dense_layer_2d.1.input_shape of input_tensor:",input_shape) # e.g. [2, 512, 768] = [ batch_size,sequence_length, hidden_size]
hidden_size = input_shape[2]
if num_groups == 1:
with tf.variable_scope(name):
w = tf.get_variable(
name="kernel",
shape=[hidden_size, output_size],
initializer=initializer)
b = tf.get_variable(
name="bias", shape=[output_size], initializer=tf.zeros_initializer)
ret = tf.einsum("BFH,HO->BFO", input_tensor, w)
ret += b
else: # e.g. input_shape = [2, 512, 768] = [ batch_size,sequence_length, hidden_size]
assert hidden_size % num_groups == 0
assert output_size % num_groups == 0
# print("#dense_layer_2d.output_size:",output_size,";hidden_size:",hidden_size) # output_size = 3072; hidden_size = 768
with tf.variable_scope(name):
w = tf.get_variable(
name="kernel",
shape=[num_groups, hidden_size//num_groups, output_size//num_groups],
initializer=initializer)
# print("#dense_layer_2d.2'w:",w.shape) # (16, 48, 192)
b = tf.get_variable(
name="bias", shape=[num_groups, output_size//num_groups], initializer=tf.zeros_initializer)
# input_tensor = [ batch_size,sequence_length, hidden_size].
# input_shape[:2] + [hidden_size//num_groups, num_groups] = [batch_size, sequence_length, hidden_size/num_groups, num_groups]
input_tensor = tf.reshape(input_tensor, input_shape[:2] + [hidden_size//num_groups, num_groups])
# print("#dense_layer_2d.2.input_shape of input_tensor:", input_tensor.shape)
input_tensor = tf.transpose(input_tensor, [3, 0, 1, 2]) # [num_groups, batch_size, sequence_length, hidden_size/num_groups]
# print("#dense_layer_2d.3.input_shape of input_tensor:", input_tensor.shape) # input_tensor=(16, 2, 512, 192)
# input_tensor=[num_groups, batch_size, sequence_length, hidden_size/num_groups], w=[num_groups, hidden_size/num_groups, output_size/num_groups]
ret = tf.einsum("GBFH,GHO->GBFO", input_tensor, w)
# print("#dense_layer_2d.4. shape of ret:", ret.shape) # (16, 2, 512, 48) = [num_groups, batch_size, sequence_length ,output_size]
b = tf.expand_dims(b, 1)
b = tf.expand_dims(b, 1)
# print("#dense_layer_2d.4.2.b:",b.shape) # (16, 1, 1, 48)
ret += b
ret = tf.transpose(ret, [1, 2, 0, 3]) # (2, 512, 16, 48)
# print("#dense_layer_2d.5. shape of ret:", ret.shape)
ret = tf.reshape(ret, input_shape[:2] + [output_size]) # [2, 512, 768]
if activation is not None:
return activation(ret)
else:
return ret
def dot_product_attention(q, k, v, bias, dropout_rate=0.0):
"""Dot-product attention.
Args:
q: Tensor with shape [..., length_q, depth_k].
k: Tensor with shape [..., length_kv, depth_k]. Leading dimensions must
match with q.
v: Tensor with shape [..., length_kv, depth_v] Leading dimensions must
match with q.
bias: bias Tensor (see attention_bias())
dropout_rate: a float.
Returns:
Tensor with shape [..., length_q, depth_v].
"""
logits = tf.matmul(q, k, transpose_b=True) # [..., length_q, length_kv]
logits = tf.multiply(logits, 1.0 / math.sqrt(float(get_shape_list(q)[-1])))
if bias is not None:
# `attention_mask` = [B, T]
from_shape = get_shape_list(q)
if len(from_shape) == 4:
broadcast_ones = tf.ones([from_shape[0], 1, from_shape[2], 1], tf.float32)
elif len(from_shape) == 5:
# from_shape = [B, N, Block_num, block_size, depth]#
broadcast_ones = tf.ones([from_shape[0], 1, from_shape[2], from_shape[3],
1], tf.float32)
bias = tf.matmul(broadcast_ones,
tf.cast(bias, tf.float32), transpose_b=True)
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
# masked positions, this operation will create a tensor which is 0.0 for
# positions we want to attend and -10000.0 for masked positions.
adder = (1.0 - bias) * -10000.0
# Since we are adding it to the raw scores before the softmax, this is
# effectively the same as removing these entirely.
logits += adder
else:
adder = 0.0
attention_probs = tf.nn.softmax(logits, name="attention_probs")
attention_probs = dropout(attention_probs, dropout_rate)
return tf.matmul(attention_probs, v)
def attention_layer(from_tensor,
to_tensor,
attention_mask=None,
num_attention_heads=1,
query_act=None,
key_act=None,
value_act=None,
attention_probs_dropout_prob=0.0,
initializer_range=0.02,
batch_size=None,
from_seq_length=None,
to_seq_length=None):
"""Performs multi-headed attention from `from_tensor` to `to_tensor`.
Args:
from_tensor: float Tensor of shape [batch_size, from_seq_length,
from_width].
to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
attention_mask: (optional) int32 Tensor of shape [batch_size,
from_seq_length, to_seq_length]. The values should be 1 or 0. The
attention scores will effectively be set to -infinity for any positions in
the mask that are 0, and will be unchanged for positions that are 1.
num_attention_heads: int. Number of attention heads.
query_act: (optional) Activation function for the query transform.
key_act: (optional) Activation function for the key transform.
value_act: (optional) Activation function for the value transform.
attention_probs_dropout_prob: (optional) float. Dropout probability of the
attention probabilities.
initializer_range: float. Range of the weight initializer.
batch_size: (Optional) int. If the input is 2D, this might be the batch size
of the 3D version of the `from_tensor` and `to_tensor`.
from_seq_length: (Optional) If the input is 2D, this might be the seq length
of the 3D version of the `from_tensor`.
to_seq_length: (Optional) If the input is 2D, this might be the seq length
of the 3D version of the `to_tensor`.
Returns:
float Tensor of shape [batch_size, from_seq_length, num_attention_heads,
size_per_head].
Raises:
ValueError: Any of the arguments or tensor shapes are invalid.
"""
from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])
size_per_head = int(from_shape[2]/num_attention_heads)
if len(from_shape) != len(to_shape):
raise ValueError(
"The rank of `from_tensor` must match the rank of `to_tensor`.")
if len(from_shape) == 3:
batch_size = from_shape[0]
from_seq_length = from_shape[1]
to_seq_length = to_shape[1]
elif len(from_shape) == 2:
if (batch_size is None or from_seq_length is None or to_seq_length is None):
raise ValueError(
"When passing in rank 2 tensors to attention_layer, the values "
"for `batch_size`, `from_seq_length`, and `to_seq_length` "
"must all be specified.")
# Scalar dimensions referenced here:
# B = batch size (number of sequences)
# F = `from_tensor` sequence length
# T = `to_tensor` sequence length
# N = `num_attention_heads`
# H = `size_per_head`
# `query_layer` = [B, F, N, H]
q = dense_layer_3d(from_tensor, num_attention_heads, size_per_head,
create_initializer(initializer_range), query_act, "query")
# `key_layer` = [B, T, N, H]
k = dense_layer_3d(to_tensor, num_attention_heads, size_per_head,
create_initializer(initializer_range), key_act, "key")
# `value_layer` = [B, T, N, H]
v = dense_layer_3d(to_tensor, num_attention_heads, size_per_head,
create_initializer(initializer_range), value_act, "value")
q = tf.transpose(q, [0, 2, 1, 3])
k = tf.transpose(k, [0, 2, 1, 3])
v = tf.transpose(v, [0, 2, 1, 3])
if attention_mask is not None:
attention_mask = tf.reshape(
attention_mask, [batch_size, 1, to_seq_length, 1])
# 'new_embeddings = [B, N, F, H]'
new_embeddings = dot_product_attention(q, k, v, attention_mask,
attention_probs_dropout_prob)
return tf.transpose(new_embeddings, [0, 2, 1, 3])
def attention_ffn_block(layer_input,
hidden_size=768,
attention_mask=None,
num_attention_heads=1,
attention_head_size=64,
attention_probs_dropout_prob=0.0,
intermediate_size=3072,
intermediate_act_fn=None,
initializer_range=0.02,
hidden_dropout_prob=0.0):
"""A network with attention-ffn as sub-block.
Args:
layer_input: float Tensor of shape [batch_size, from_seq_length,
from_width].
hidden_size: (optional) int, size of hidden layer.
attention_mask: (optional) int32 Tensor of shape [batch_size,
from_seq_length, to_seq_length]. The values should be 1 or 0. The
attention scores will effectively be set to -infinity for any positions in
the mask that are 0, and will be unchanged for positions that are 1.
num_attention_heads: int. Number of attention heads.
attention_head_size: int. Size of attention head.
attention_probs_dropout_prob: float. dropout probability for attention_layer
intermediate_size: int. Size of intermediate hidden layer.
intermediate_act_fn: (optional) Activation function for the intermediate
layer.
initializer_range: float. Range of the weight initializer.
hidden_dropout_prob: (optional) float. Dropout probability of the hidden
layer.
Returns:
layer output
"""
with tf.variable_scope("attention_1"):
with tf.variable_scope("self"):
attention_output = attention_layer(
from_tensor=layer_input,
to_tensor=layer_input,
attention_mask=attention_mask,
num_attention_heads=num_attention_heads,
attention_probs_dropout_prob=attention_probs_dropout_prob,
initializer_range=initializer_range)
# Run a linear projection of `hidden_size` then add a residual
# with `layer_input`.
with tf.variable_scope("output"):
attention_output = dense_layer_3d_proj(
attention_output,
hidden_size,
attention_head_size,
create_initializer(initializer_range),
None,
name="dense")
attention_output = dropout(attention_output, hidden_dropout_prob)
attention_output = layer_norm(attention_output + layer_input)
with tf.variable_scope("ffn_1"):
with tf.variable_scope("intermediate"):
intermediate_output = dense_layer_2d(
attention_output,
intermediate_size,
create_initializer(initializer_range),
intermediate_act_fn,
num_attention_heads=num_attention_heads,
name="dense",
num_groups=16)
with tf.variable_scope("output"):
ffn_output = dense_layer_2d(
intermediate_output,
hidden_size,
create_initializer(initializer_range),
None,
num_attention_heads=num_attention_heads,
name="dense",
num_groups=16)
ffn_output = dropout(ffn_output, hidden_dropout_prob)
ffn_output = layer_norm(ffn_output + attention_output)
return ffn_output
def transformer_model(input_tensor,
attention_mask=None,
hidden_size=768,
num_hidden_layers=12,
num_hidden_groups=12,
num_attention_heads=12,
intermediate_size=3072,
inner_group_num=1,
intermediate_act_fn="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
initializer_range=0.02,
do_return_all_layers=False):
"""Multi-headed, multi-layer Transformer from "Attention is All You Need".
This is almost an exact implementation of the original Transformer encoder.
See the original paper:
https://arxiv.org/abs/1706.03762
Also see:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py
Args:
input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
seq_length], with 1 for positions that can be attended to and 0 in
positions that should not be.
hidden_size: int. Hidden size of the Transformer.
num_hidden_layers: int. Number of layers (blocks) in the Transformer.
num_hidden_groups: int. Number of group for the hidden layers, parameters
in the same group are shared.
num_attention_heads: int. Number of attention heads in the Transformer.
intermediate_size: int. The size of the "intermediate" (a.k.a., feed
forward) layer.
inner_group_num: int, number of inner repetition of attention and ffn.
intermediate_act_fn: function. The non-linear activation function to apply
to the output of the intermediate/feed-forward layer.
hidden_dropout_prob: float. Dropout probability for the hidden layers.
attention_probs_dropout_prob: float. Dropout probability of the attention
probabilities.
initializer_range: float. Range of the initializer (stddev of truncated
normal).
do_return_all_layers: Whether to also return all layers or just the final
layer.
Returns:
float Tensor of shape [batch_size, seq_length, hidden_size], the final
hidden layer of the Transformer.
Raises:
ValueError: A Tensor shape or parameter is invalid.
"""
if hidden_size % num_attention_heads != 0:
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
"heads (%d)" % (hidden_size, num_attention_heads))
attention_head_size = hidden_size // num_attention_heads
input_shape = get_shape_list(input_tensor, expected_rank=3)
input_width = input_shape[2]
all_layer_outputs = []
if input_width != hidden_size:
prev_output = dense_layer_2d(
input_tensor, hidden_size, create_initializer(initializer_range),
None, name="embedding_hidden_mapping_in")
else:
prev_output = input_tensor
with tf.variable_scope("transformer", reuse=tf.AUTO_REUSE):
for layer_idx in range(num_hidden_layers):
group_idx = int(layer_idx / num_hidden_layers * num_hidden_groups)
with tf.variable_scope("group_%d" % group_idx):
with tf.name_scope("layer_%d" % layer_idx):
layer_output = prev_output
for inner_group_idx in range(inner_group_num):
with tf.variable_scope("inner_group_%d" % inner_group_idx):
layer_output = attention_ffn_block(
layer_output, hidden_size, attention_mask,
num_attention_heads, attention_head_size,
attention_probs_dropout_prob, intermediate_size,
intermediate_act_fn, initializer_range, hidden_dropout_prob)
prev_output = layer_output
all_layer_outputs.append(layer_output)
if do_return_all_layers:
return all_layer_outputs
else:
return all_layer_outputs[-1]
def get_shape_list(tensor, expected_rank=None, name=None):
"""Returns a list of the shape of tensor, preferring static dimensions.
Args:
tensor: A tf.Tensor object to find the shape of.
expected_rank: (optional) int. The expected rank of `tensor`. If this is
specified and the `tensor` has a different rank, and exception will be
thrown.
name: Optional name of the tensor for the error message.
Returns:
A list of dimensions of the shape of tensor. All static dimensions will
be returned as python integers, and dynamic dimensions will be returned
as tf.Tensor scalars.
"""
if name is None:
name = tensor.name
if expected_rank is not None:
assert_rank(tensor, expected_rank, name)
shape = tensor.shape.as_list()
non_static_indexes = []
for (index, dim) in enumerate(shape):
if dim is None:
non_static_indexes.append(index)
if not non_static_indexes:
return shape
dyn_shape = tf.shape(tensor)
for index in non_static_indexes:
shape[index] = dyn_shape[index]
return shape
def reshape_to_matrix(input_tensor):
"""Reshapes a >= rank 2 tensor to a rank 2 tensor (i.e., a matrix)."""
ndims = input_tensor.shape.ndims
if ndims < 2:
raise ValueError("Input tensor must have at least rank 2. Shape = %s" %
(input_tensor.shape))
if ndims == 2:
return input_tensor
width = input_tensor.shape[-1]
output_tensor = tf.reshape(input_tensor, [-1, width])
return output_tensor
def reshape_from_matrix(output_tensor, orig_shape_list):
"""Reshapes a rank 2 tensor back to its original rank >= 2 tensor."""
if len(orig_shape_list) == 2:
return output_tensor
output_shape = get_shape_list(output_tensor)
orig_dims = orig_shape_list[0:-1]
width = output_shape[-1]
return tf.reshape(output_tensor, orig_dims + [width])
def assert_rank(tensor, expected_rank, name=None):
"""Raises an exception if the tensor rank is not of the expected rank.
Args:
tensor: A tf.Tensor to check the rank of.
expected_rank: Python integer or list of integers, expected rank.
name: Optional name of the tensor for the error message.
Raises:
ValueError: If the expected shape doesn't match the actual shape.
"""
if name is None:
name = tensor.name
expected_rank_dict = {}
if isinstance(expected_rank, six.integer_types):
expected_rank_dict[expected_rank] = True
else:
for x in expected_rank:
expected_rank_dict[x] = True
actual_rank = tensor.shape.ndims
if actual_rank not in expected_rank_dict:
scope_name = tf.get_variable_scope().name
raise ValueError(
"For the tensor `%s` in scope `%s`, the actual rank "
"`%d` (shape = %s) is not equal to the expected rank `%s`" %
(name, scope_name, actual_rank, str(tensor.shape), str(expected_rank)))
================================================
FILE: optimization.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Functions and classes related to optimization (weight updates)."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import re
import tensorflow as tf
def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu):
"""Creates an optimizer training op."""
global_step = tf.train.get_or_create_global_step()
learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32)
# Implements linear decay of the learning rate.
learning_rate = tf.train.polynomial_decay(
learning_rate,
global_step,
num_train_steps,
end_learning_rate=0.0,
power=1.0,
cycle=False)
# Implements linear warmup. I.e., if global_step < num_warmup_steps, the
# learning rate will be `global_step/num_warmup_steps * init_lr`.
if num_warmup_steps:
global_steps_int = tf.cast(global_step, tf.int32)
warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)
global_steps_float = tf.cast(global_steps_int, tf.float32)
warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)
warmup_percent_done = global_steps_float / warmup_steps_float
warmup_learning_rate = init_lr * warmup_percent_done
is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
learning_rate = (
(1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)
# It is recommended that you use this optimizer for fine tuning, since this
# is how the model was trained (note that the Adam m/v variables are NOT
# loaded from init_checkpoint.)
optimizer = LAMBOptimizer(
learning_rate=learning_rate,
weight_decay_rate=0.01,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-6,
exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])
if use_tpu:
optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer)
tvars = tf.trainable_variables()
grads = tf.gradients(loss, tvars)
# This is how the model was pre-trained.
(grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
train_op = optimizer.apply_gradients(
zip(grads, tvars), global_step=global_step)
# Normally the global step update is done inside of `apply_gradients`.
# However, `AdamWeightDecayOptimizer` doesn't do this. But if you use
# a different optimizer, you should probably take this line out.
new_global_step = global_step + 1
train_op = tf.group(train_op, [global_step.assign(new_global_step)])
return train_op
class AdamWeightDecayOptimizer(tf.train.Optimizer):
"""A basic Adam optimizer that includes "correct" L2 weight decay."""
def __init__(self,
learning_rate,
weight_decay_rate=0.0,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-6,
exclude_from_weight_decay=None,
name="AdamWeightDecayOptimizer"):
"""Constructs a AdamWeightDecayOptimizer."""
super(AdamWeightDecayOptimizer, self).__init__(False, name)
self.learning_rate = learning_rate
self.weight_decay_rate = weight_decay_rate
self.beta_1 = beta_1
self.beta_2 = beta_2
self.epsilon = epsilon
self.exclude_from_weight_decay = exclude_from_weight_decay
def apply_gradients(self, grads_and_vars, global_step=None, name=None):
"""See base class."""
assignments = []
for (grad, param) in grads_and_vars:
if grad is None or param is None:
continue
param_name = self._get_variable_name(param.name)
m = tf.get_variable(
name=param_name + "/adam_m",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
v = tf.get_variable(
name=param_name + "/adam_v",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
# Standard Adam update.
next_m = (
tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
next_v = (
tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
tf.square(grad)))
update = next_m / (tf.sqrt(next_v) + self.epsilon)
# Just adding the square of the weights to the loss function is *not*
# the correct way of using L2 regularization/weight decay with Adam,
# since that will interact with the m and v parameters in strange ways.
#
# Instead we want ot decay the weights in a manner that doesn't interact
# with the m/v parameters. This is equivalent to adding the square
# of the weights to the loss with plain (non-momentum) SGD.
if self._do_use_weight_decay(param_name):
update += self.weight_decay_rate * param
update_with_lr = self.learning_rate * update
next_param = param - update_with_lr
assignments.extend(
[param.assign(next_param),
m.assign(next_m),
v.assign(next_v)])
return tf.group(*assignments, name=name)
def _do_use_weight_decay(self, param_name):
"""Whether to use L2 weight decay for `param_name`."""
if not self.weight_decay_rate:
return False
if self.exclude_from_weight_decay:
for r in self.exclude_from_weight_decay:
if re.search(r, param_name) is not None:
return False
return True
def _get_variable_name(self, param_name):
"""Get the variable name from the tensor name."""
m = re.match("^(.*):\\d+$", param_name)
if m is not None:
param_name = m.group(1)
return param_name
#
class LAMBOptimizer(tf.train.Optimizer):
"""
LAMBOptimizer optimizer.
https://github.com/ymcui/LAMB_Optimizer_TF
# IMPORTANT NOTE
- This is NOT an official implementation.
- LAMB optimizer is changed from arXiv v1 ~ v3.
- We implement v3 version (which is the latest version on June, 2019.).
- Our implementation is based on `AdamWeightDecayOptimizer` in BERT (provided by Google).
# References
- Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. https://arxiv.org/abs/1904.00962v3
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805
# Parameters
- There is nothing special, just the same as `AdamWeightDecayOptimizer`.
"""
def __init__(self,
learning_rate,
weight_decay_rate=0.01,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-6,
exclude_from_weight_decay=None,
name="LAMBOptimizer"):
"""Constructs a LAMBOptimizer."""
super(LAMBOptimizer, self).__init__(False, name)
self.learning_rate = learning_rate
self.weight_decay_rate = weight_decay_rate
self.beta_1 = beta_1
self.beta_2 = beta_2
self.epsilon = epsilon
self.exclude_from_weight_decay = exclude_from_weight_decay
def apply_gradients(self, grads_and_vars, global_step=None, name=None):
"""See base class."""
assignments = []
for (grad, param) in grads_and_vars:
if grad is None or param is None:
continue
param_name = self._get_variable_name(param.name)
m = tf.get_variable(
name=param_name + "/lamb_m",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
v = tf.get_variable(
name=param_name + "/lamb_v",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
# Standard Adam update.
next_m = (
tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
next_v = (
tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
tf.square(grad)))
update = next_m / (tf.sqrt(next_v) + self.epsilon)
# Just adding the square of the weights to the loss function is *not*
# the correct way of using L2 regularization/weight decay with Adam,
# since that will interact with the m and v parameters in strange ways.
#
# Instead we want ot decay the weights in a manner that doesn't interact
# with the m/v parameters. This is equivalent to adding the square
# of the weights to the loss with plain (non-momentum) SGD.
if self._do_use_weight_decay(param_name):
update += self.weight_decay_rate * param
############## BELOW ARE THE SPECIFIC PARTS FOR LAMB ##############
# Note: Here are two choices for scaling function \phi(z)
# minmax: \phi(z) = min(max(z, \gamma_l), \gamma_u)
# identity: \phi(z) = z
# The authors does not mention what is \gamma_l and \gamma_u
# UPDATE: after asking authors, they provide me the code below.
# ratio = array_ops.where(math_ops.greater(w_norm, 0), array_ops.where(
# math_ops.greater(g_norm, 0), (w_norm / g_norm), 1.0), 1.0)
r1 = tf.sqrt(tf.reduce_sum(tf.square(param)))
r2 = tf.sqrt(tf.reduce_sum(tf.square(update)))
r = tf.where(tf.greater(r1, 0.0),
tf.where(tf.greater(r2, 0.0),
r1 / r2,
1.0),
1.0)
eta = self.learning_rate * r
update_with_lr = eta * update
next_param = param - update_with_lr
assignments.extend(
[param.assign(next_param),
m.assign(next_m),
v.assign(next_v)])
return tf.group(*assignments, name=name)
def _do_use_weight_decay(self, param_name):
"""Whether to use L2 weight decay for `param_name`."""
if not self.weight_decay_rate:
return False
if self.exclude_from_weight_decay:
for r in self.exclude_from_weight_decay:
if re.search(r, param_name) is not None:
return False
return True
def _get_variable_name(self, param_name):
"""Get the variable name from the tensor name."""
m = re.match("^(.*):\\d+$", param_name)
if m is not None:
param_name = m.group(1)
return param_name
================================================
FILE: optimization_finetuning.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Functions and classes related to optimization (weight updates)."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import re
import tensorflow as tf
def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu):
"""Creates an optimizer training op."""
global_step = tf.train.get_or_create_global_step()
learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32)
# Implements linear decay of the learning rate.
learning_rate = tf.train.polynomial_decay(
learning_rate,
global_step,
num_train_steps,
end_learning_rate=0.0,
power=1.0,
cycle=False)
# Implements linear warmup. I.e., if global_step < num_warmup_steps, the
# learning rate will be `global_step/num_warmup_steps * init_lr`.
if num_warmup_steps:
global_steps_int = tf.cast(global_step, tf.int32)
warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)
global_steps_float = tf.cast(global_steps_int, tf.float32)
warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)
warmup_percent_done = global_steps_float / warmup_steps_float
warmup_learning_rate = init_lr * warmup_percent_done
is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
learning_rate = (
(1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)
# It is recommended that you use this optimizer for fine tuning, since this
# is how the model was trained (note that the Adam m/v variables are NOT
# loaded from init_checkpoint.)
optimizer = AdamWeightDecayOptimizer(
learning_rate=learning_rate,
weight_decay_rate=0.01,
beta_1=0.9,
beta_2=0.999, # 0.98 ONLY USED FOR PRETRAIN. MUST CHANGE AT FINE-TUNING 0.999,
epsilon=1e-6,
exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])
if use_tpu:
optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer)
tvars = tf.trainable_variables()
grads = tf.gradients(loss, tvars)
# This is how the model was pre-trained.
(grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
train_op = optimizer.apply_gradients(
zip(grads, tvars), global_step=global_step)
# Normally the global step update is done inside of `apply_gradients`.
# However, `AdamWeightDecayOptimizer` doesn't do this. But if you use
# a different optimizer, you should probably take this line out.
new_global_step = global_step + 1
train_op = tf.group(train_op, [global_step.assign(new_global_step)])
return train_op
class AdamWeightDecayOptimizer(tf.train.Optimizer):
"""A basic Adam optimizer that includes "correct" L2 weight decay."""
def __init__(self,
learning_rate,
weight_decay_rate=0.0,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-6,
exclude_from_weight_decay=None,
name="AdamWeightDecayOptimizer"):
"""Constructs a AdamWeightDecayOptimizer."""
super(AdamWeightDecayOptimizer, self).__init__(False, name)
self.learning_rate = learning_rate
self.weight_decay_rate = weight_decay_rate
self.beta_1 = beta_1
self.beta_2 = beta_2
self.epsilon = epsilon
self.exclude_from_weight_decay = exclude_from_weight_decay
def apply_gradients(self, grads_and_vars, global_step=None, name=None):
"""See base class."""
assignments = []
for (grad, param) in grads_and_vars:
if grad is None or param is None:
continue
param_name = self._get_variable_name(param.name)
m = tf.get_variable(
name=param_name + "/adam_m",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
v = tf.get_variable(
name=param_name + "/adam_v",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
# Standard Adam update.
next_m = (
tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
next_v = (
tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
tf.square(grad)))
update = next_m / (tf.sqrt(next_v) + self.epsilon)
# Just adding the square of the weights to the loss function is *not*
# the correct way of using L2 regularization/weight decay with Adam,
# since that will interact with the m and v parameters in strange ways.
#
# Instead we want ot decay the weights in a manner that doesn't interact
# with the m/v parameters. This is equivalent to adding the square
# of the weights to the loss with plain (non-momentum) SGD.
if self._do_use_weight_decay(param_name):
update += self.weight_decay_rate * param
update_with_lr = self.learning_rate * update
next_param = param - update_with_lr
assignments.extend(
[param.assign(next_param),
m.assign(next_m),
v.assign(next_v)])
return tf.group(*assignments, name=name)
def _do_use_weight_decay(self, param_name):
"""Whether to use L2 weight decay for `param_name`."""
if not self.weight_decay_rate:
return False
if self.exclude_from_weight_decay:
for r in self.exclude_from_weight_decay:
if re.search(r, param_name) is not None:
return False
return True
def _get_variable_name(self, param_name):
"""Get the variable name from the tensor name."""
m = re.match("^(.*):\\d+$", param_name)
if m is not None:
param_name = m.group(1)
return param_name
================================================
FILE: optimization_google.py
================================================
# coding=utf-8
# Copyright 2019 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Lint as: python2, python3
"""Functions and classes related to optimization (weight updates)."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import re
import six
from six.moves import zip
import tensorflow as tf
import lamb_optimizer_google as lamb_optimizer
def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu,
optimizer="adamw", poly_power=1.0, start_warmup_step=0):
"""Creates an optimizer training op."""
global_step = tf.train.get_or_create_global_step()
learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32)
# Implements linear decay of the learning rate.
learning_rate = tf.train.polynomial_decay(
learning_rate,
global_step,
num_train_steps,
end_learning_rate=0.0,
power=poly_power,
cycle=False)
# Implements linear warmup. I.e., if global_step - start_warmup_step <
# num_warmup_steps, the learning rate will be
# `(global_step - start_warmup_step)/num_warmup_steps * init_lr`.
if num_warmup_steps:
tf.logging.info("++++++ warmup starts at step " + str(start_warmup_step)
+ ", for " + str(num_warmup_steps) + " steps ++++++")
global_steps_int = tf.cast(global_step, tf.int32)
start_warm_int = tf.constant(start_warmup_step, dtype=tf.int32)
global_steps_int = global_steps_int - start_warm_int
warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)
global_steps_float = tf.cast(global_steps_int, tf.float32)
warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)
warmup_percent_done = global_steps_float / warmup_steps_float
warmup_learning_rate = init_lr * warmup_percent_done
is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
learning_rate = (
(1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)
# It is OK that you use this optimizer for finetuning, since this
# is how the model was trained (note that the Adam m/v variables are NOT
# loaded from init_checkpoint.)
# It is OK to use AdamW in the finetuning even the model is trained by LAMB.
# As report in the Bert pulic github, the learning rate for SQuAD 1.1 finetune
# is 3e-5, 4e-5 or 5e-5. For LAMB, the users can use 3e-4, 4e-4,or 5e-4 for a
# batch size of 64 in the finetune.
if optimizer == "adamw":
tf.logging.info("using adamw")
optimizer = AdamWeightDecayOptimizer(
learning_rate=learning_rate,
weight_decay_rate=0.01,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-6,
exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])
elif optimizer == "lamb":
tf.logging.info("using lamb")
optimizer = lamb_optimizer.LAMBOptimizer(
learning_rate=learning_rate,
weight_decay_rate=0.01,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-6,
exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])
else:
raise ValueError("Not supported optimizer: ", optimizer)
if use_tpu:
optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer)
tvars = tf.trainable_variables()
grads = tf.gradients(loss, tvars)
# This is how the model was pre-trained.
(grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
train_op = optimizer.apply_gradients(
list(zip(grads, tvars)), global_step=global_step)
# Normally the global step update is done inside of `apply_gradients`.
# However, neither `AdamWeightDecayOptimizer` nor `LAMBOptimizer` do this.
# But if you use a different optimizer, you should probably take this line
# out.
new_global_step = global_step + 1
train_op = tf.group(train_op, [global_step.assign(new_global_step)])
return train_op
class AdamWeightDecayOptimizer(tf.train.Optimizer):
"""A basic Adam optimizer that includes "correct" L2 weight decay."""
def __init__(self,
learning_rate,
weight_decay_rate=0.0,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-6,
exclude_from_weight_decay=None,
name="AdamWeightDecayOptimizer"):
"""Constructs a AdamWeightDecayOptimizer."""
super(AdamWeightDecayOptimizer, self).__init__(False, name)
self.learning_rate = learning_rate
self.weight_decay_rate = weight_decay_rate
self.beta_1 = beta_1
self.beta_2 = beta_2
self.epsilon = epsilon
self.exclude_from_weight_decay = exclude_from_weight_decay
def apply_gradients(self, grads_and_vars, global_step=None, name=None):
"""See base class."""
assignments = []
for (grad, param) in grads_and_vars:
if grad is None or param is None:
continue
param_name = self._get_variable_name(param.name)
m = tf.get_variable(
name=six.ensure_str(param_name) + "/adam_m",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
v = tf.get_variable(
name=six.ensure_str(param_name) + "/adam_v",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
# Standard Adam update.
next_m = (
tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
next_v = (
tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
tf.square(grad)))
update = next_m / (tf.sqrt(next_v) + self.epsilon)
# Just adding the square of the weights to the loss function is *not*
# the correct way of using L2 regularization/weight decay with Adam,
# since that will interact with the m and v parameters in strange ways.
#
# Instead we want ot decay the weights in a manner that doesn't interact
# with the m/v parameters. This is equivalent to adding the square
# of the weights to the loss with plain (non-momentum) SGD.
if self._do_use_weight_decay(param_name):
update += self.weight_decay_rate * param
update_with_lr = self.learning_rate * update
next_param = param - update_with_lr
assignments.extend(
[param.assign(next_param),
m.assign(next_m),
v.assign(next_v)])
return tf.group(*assignments, name=name)
def _do_use_weight_decay(self, param_name):
"""Whether to use L2 weight decay for `param_name`."""
if not self.weight_decay_rate:
return False
if self.exclude_from_weight_decay:
for r in self.exclude_from_weight_decay:
if re.search(r, param_name) is not None:
return False
return True
def _get_variable_name(self, param_name):
"""Get the variable name from the tensor name."""
m = re.match("^(.*):\\d+$", six.ensure_str(param_name))
if m is not None:
param_name = m.group(1)
return param_name
================================================
FILE: resources/create_pretraining_data_roberta.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Create masked LM/next sentence masked_lm TF examples for BERT."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import random
import re
import tokenization
import tensorflow as tf
import jieba
flags = tf.flags
FLAGS = flags.FLAGS
flags.DEFINE_string("input_file", None,
"Input raw text file (or comma-separated list of files).")
flags.DEFINE_string(
"output_file", None,
"Output TF example file (or comma-separated list of files).")
flags.DEFINE_string("vocab_file", None,
"The vocabulary file that the BERT model was trained on.")
flags.DEFINE_bool(
"do_lower_case", True,
"Whether to lower case the input text. Should be True for uncased "
"models and False for cased models.")
flags.DEFINE_bool(
"do_whole_word_mask", False,
"Whether to use whole word masking rather than per-WordPiece masking.")
flags.DEFINE_integer("max_seq_length", 128, "Maximum sequence length.")
flags.DEFINE_integer("max_predictions_per_seq", 20,
"Maximum number of masked LM predictions per sequence.")
flags.DEFINE_integer("random_seed", 12345, "Random seed for data generation.")
flags.DEFINE_integer(
"dupe_factor", 10,
"Number of times to duplicate the input data (with different masks).")
flags.DEFINE_float("masked_lm_prob", 0.15, "Masked LM probability.")
flags.DEFINE_float(
"short_seq_prob", 0.1,
"Probability of creating sequences which are shorter than the "
"maximum length.")
class TrainingInstance(object):
"""A single training instance (sentence pair)."""
def __init__(self, tokens, segment_ids, masked_lm_positions, masked_lm_labels,
is_random_next):
self.tokens = tokens
self.segment_ids = segment_ids
self.is_random_next = is_random_next
self.masked_lm_positions = masked_lm_positions
self.masked_lm_labels = masked_lm_labels
def __str__(self):
s = ""
s += "tokens: %s\n" % (" ".join(
[tokenization.printable_text(x) for x in self.tokens]))
s += "segment_ids: %s\n" % (" ".join([str(x) for x in self.segment_ids]))
s += "is_random_next: %s\n" % self.is_random_next
s += "masked_lm_positions: %s\n" % (" ".join(
[str(x) for x in self.masked_lm_positions]))
s += "masked_lm_labels: %s\n" % (" ".join(
[tokenization.printable_text(x) for x in self.masked_lm_labels]))
s += "\n"
return s
def __repr__(self):
return self.__str__()
def write_instance_to_example_files(instances, tokenizer, max_seq_length,
max_predictions_per_seq, output_files):
"""Create TF example files from `TrainingInstance`s."""
writers = []
for output_file in output_files:
writers.append(tf.python_io.TFRecordWriter(output_file))
writer_index = 0
total_written = 0
for (inst_index, instance) in enumerate(instances):
input_ids = tokenizer.convert_tokens_to_ids(instance.tokens)
input_mask = [1] * len(input_ids)
segment_ids = list(instance.segment_ids)
assert len(input_ids) <= max_seq_length
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
# print("length of segment_ids:",len(segment_ids),"max_seq_length:", max_seq_length)
assert len(segment_ids) == max_seq_length
masked_lm_positions = list(instance.masked_lm_positions)
masked_lm_ids = tokenizer.convert_tokens_to_ids(instance.masked_lm_labels)
masked_lm_weights = [1.0] * len(masked_lm_ids)
while len(masked_lm_positions) < max_predictions_per_seq:
masked_lm_positions.append(0)
masked_lm_ids.append(0)
masked_lm_weights.append(0.0)
next_sentence_label = 1 if instance.is_random_next else 0
features = collections.OrderedDict()
features["input_ids"] = create_int_feature(input_ids)
features["input_mask"] = create_int_feature(input_mask)
features["segment_ids"] = create_int_feature(segment_ids)
features["masked_lm_positions"] = create_int_feature(masked_lm_positions)
features["masked_lm_ids"] = create_int_feature(masked_lm_ids)
features["masked_lm_weights"] = create_float_feature(masked_lm_weights)
features["next_sentence_labels"] = create_int_feature([next_sentence_label])
tf_example = tf.train.Example(features=tf.train.Features(feature=features))
writers[writer_index].write(tf_example.SerializeToString())
writer_index = (writer_index + 1) % len(writers)
total_written += 1
if inst_index < 20:
tf.logging.info("*** Example ***")
tf.logging.info("tokens: %s" % " ".join(
[tokenization.printable_text(x) for x in instance.tokens]))
for feature_name in features.keys():
feature = features[feature_name]
values = []
if feature.int64_list.value:
values = feature.int64_list.value
elif feature.float_list.value:
values = feature.float_list.value
tf.logging.info(
"%s: %s" % (feature_name, " ".join([str(x) for x in values])))
for writer in writers:
writer.close()
tf.logging.info("Wrote %d total instances", total_written)
def create_int_feature(values):
feature = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
return feature
def create_float_feature(values):
feature = tf.train.Feature(float_list=tf.train.FloatList(value=list(values)))
return feature
def create_training_instances(input_files, tokenizer, max_seq_length,
dupe_factor, short_seq_prob, masked_lm_prob,
max_predictions_per_seq, rng):
"""Create `TrainingInstance`s from raw text."""
all_documents = [[]]
# Input file format:
# (1) One sentence per line. These should ideally be actual sentences, not
# entire paragraphs or arbitrary spans of text. (Because we use the
# sentence boundaries for the "next sentence prediction" task).
# (2) Blank lines between documents. Document boundaries are needed so
# that the "next sentence prediction" task doesn't span between documents.
print("create_training_instances.started...")
for input_file in input_files:
with tf.gfile.GFile(input_file, "r") as reader:
while True:
line = tokenization.convert_to_unicode(reader.readline().replace("",""))# .replace("”","")) # 将、”替换掉。
if not line:
break
line = line.strip()
# Empty lines are used as document delimiters
if not line:
all_documents.append([])
tokens = tokenizer.tokenize(line)
if tokens:
all_documents[-1].append(tokens)
# Remove empty documents
all_documents = [x for x in all_documents if x]
rng.shuffle(all_documents)
vocab_words = list(tokenizer.vocab.keys())
instances = []
for _ in range(dupe_factor):
for document_index in range(len(all_documents)):
instances.extend(
create_instances_from_document(
all_documents, document_index, max_seq_length, short_seq_prob,
masked_lm_prob, max_predictions_per_seq, vocab_words, rng))
rng.shuffle(instances)
print("create_training_instances.ended...")
return instances
def _is_chinese_char(cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
(cp >= 0x3400 and cp <= 0x4DBF) or #
(cp >= 0x20000 and cp <= 0x2A6DF) or #
(cp >= 0x2A700 and cp <= 0x2B73F) or #
(cp >= 0x2B740 and cp <= 0x2B81F) or #
(cp >= 0x2B820 and cp <= 0x2CEAF) or
(cp >= 0xF900 and cp <= 0xFAFF) or #
(cp >= 0x2F800 and cp <= 0x2FA1F)): #
return True
def get_new_segment(segment): # 新增的方法 ####
"""
输入一句话,返回一句经过处理的话: 为了支持中文全称mask,将被分开的词,将上特殊标记("#"),使得后续处理模块,能够知道哪些字是属于同一个词的。
:param segment: 一句话
:return: 一句处理过的话
"""
seq_cws = jieba.lcut("".join(segment))
seq_cws_dict = {x: 1 for x in seq_cws}
new_segment = []
i = 0
while i < len(segment):
if len(re.findall('[\u4E00-\u9FA5]', segment[i]))==0: # 不是中文的,原文加进去。
new_segment.append(segment[i])
i += 1
continue
has_add = False
for length in range(3,0,-1):
if i+length>len(segment):
continue
if ''.join(segment[i:i+length]) in seq_cws_dict:
new_segment.append(segment[i])
for l in range(1, length):
new_segment.append('##' + segment[i+l])
i += length
has_add = True
break
if not has_add:
new_segment.append(segment[i])
i += 1
return new_segment
def get_raw_instance(document,max_sequence_length): # 新增的方法 TODO need check again to ensure full use of data
"""
获取初步的训练实例,将整段按照max_sequence_length切分成多个部分,并以多个处理好的实例的形式返回。
:param document: 一整段
:param max_sequence_length:
:return: a list. each element is a sequence of text
"""
max_sequence_length_allowed=max_sequence_length-2
document = [seq for seq in document if len(seq)max_sequence_length_allowed/2: # /2
result_list.append(curr_seq)
# # 计算总共可以得到多少份
# num_instance=int(len(big_list)/max_sequence_length_allowed)+1
# print("num_instance:",num_instance)
# # 切分成多份,添加到列表中
# result_list=[]
# for j in range(num_instance):
# index=j*max_sequence_length_allowed
# end_index=index+max_sequence_length_allowed if j!=num_instance-1 else -1
# result_list.append(big_list[index:end_index])
return result_list
def create_instances_from_document( # 新增的方法
# 目标按照RoBERTa的思路,使用DOC-SENTENCES,并会去掉NSP任务: 从一个文档中连续的获得文本,直到达到最大长度。如果是从下一个文档中获得,那么加上一个分隔符
# document即一整段话,包含多个句子。每个句子叫做segment.
# 给定一个document即一整段话,生成一些instance.
all_documents, document_index, max_seq_length, short_seq_prob,
masked_lm_prob, max_predictions_per_seq, vocab_words, rng):
"""Creates `TrainingInstance`s for a single document."""
document = all_documents[document_index]
# Account for [CLS], [SEP], [SEP]
max_num_tokens = max_seq_length - 3
# We *usually* want to fill up the entire sequence since we are padding
# to `max_seq_length` anyways, so short sequences are generally wasted
# computation. However, we *sometimes*
# (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
# sequences to minimize the mismatch between pre-training and fine-tuning.
# The `target_seq_length` is just a rough target however, whereas
# `max_seq_length` is a hard limit.
#target_seq_length = max_num_tokens
#if rng.random() < short_seq_prob:
# target_seq_length = rng.randint(2, max_num_tokens)
instances = []
raw_text_list_list=get_raw_instance(document, max_seq_length) # document即一整段话,包含多个句子。每个句子叫做segment.
for j, raw_text_list in enumerate(raw_text_list_list):
####################################################################################################################
raw_text_list = get_new_segment(raw_text_list) # 结合分词的中文的whole mask设置即在需要的地方加上“##”
# 1、设置token, segment_ids
is_random_next=True # this will not be used, so it's value doesn't matter
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in raw_text_list:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
################################################################################################################
# 2、调用原有的方法
(tokens, masked_lm_positions,
masked_lm_labels) = create_masked_lm_predictions(
tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng)
instance = TrainingInstance(
tokens=tokens,
segment_ids=segment_ids,
is_random_next=is_random_next,
masked_lm_positions=masked_lm_positions,
masked_lm_labels=masked_lm_labels)
instances.append(instance)
return instances
def create_instances_from_document_original(
all_documents, document_index, max_seq_length, short_seq_prob,
masked_lm_prob, max_predictions_per_seq, vocab_words, rng):
"""Creates `TrainingInstance`s for a single document."""
document = all_documents[document_index]
# Account for [CLS], [SEP], [SEP]
max_num_tokens = max_seq_length - 3
# We *usually* want to fill up the entire sequence since we are padding
# to `max_seq_length` anyways, so short sequences are generally wasted
# computation. However, we *sometimes*
# (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
# sequences to minimize the mismatch between pre-training and fine-tuning.
# The `target_seq_length` is just a rough target however, whereas
# `max_seq_length` is a hard limit.
target_seq_length = max_num_tokens
if rng.random() < short_seq_prob:
target_seq_length = rng.randint(2, max_num_tokens)
# We DON'T just concatenate all of the tokens from a document into a long
# sequence and choose an arbitrary split point because this would make the
# next sentence prediction task too easy. Instead, we split the input into
# segments "A" and "B" based on the actual "sentences" provided by the user
# input.
instances = []
current_chunk = []
current_length = 0
i = 0
print("document_index:",document_index,"document:",type(document)," ;document:",document) # document即一整段话,包含多个句子。每个句子叫做segment.
while i < len(document):
segment = document[i] # 取到一个部分(可能是一段话)
print("i:",i," ;segment:",segment)
####################################################################################################################
segment = get_new_segment(segment) # 结合分词的中文的whole mask设置即在需要的地方加上“##”
###################################################################################################################
current_chunk.append(segment)
current_length += len(segment)
print("#####condition:",i == len(document) - 1 or current_length >= target_seq_length)
if i == len(document) - 1 or current_length >= target_seq_length:
if current_chunk:
# `a_end` is how many segments from `current_chunk` go into the `A`
# (first) sentence.
a_end = 1
if len(current_chunk) >= 2:
a_end = rng.randint(1, len(current_chunk) - 1)
tokens_a = []
for j in range(a_end):
tokens_a.extend(current_chunk[j])
tokens_b = []
# Random next
is_random_next = False
if len(current_chunk) == 1 or rng.random() < 0.5:
is_random_next = True
target_b_length = target_seq_length - len(tokens_a)
# This should rarely go for more than one iteration for large
# corpora. However, just to be careful, we try to make sure that
# the random document is not the same as the document
# we're processing.
for _ in range(10):
random_document_index = rng.randint(0, len(all_documents) - 1)
if random_document_index != document_index:
break
random_document = all_documents[random_document_index]
random_start = rng.randint(0, len(random_document) - 1)
for j in range(random_start, len(random_document)):
tokens_b.extend(random_document[j])
if len(tokens_b) >= target_b_length:
break
# We didn't actually use these segments so we "put them back" so
# they don't go to waste.
num_unused_segments = len(current_chunk) - a_end
i -= num_unused_segments
# Actual next
else:
is_random_next = False
for j in range(a_end, len(current_chunk)):
tokens_b.extend(current_chunk[j])
truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng)
assert len(tokens_a) >= 1
assert len(tokens_b) >= 1
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
(tokens, masked_lm_positions,
masked_lm_labels) = create_masked_lm_predictions(
tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng)
instance = TrainingInstance(
tokens=tokens,
segment_ids=segment_ids,
is_random_next=is_random_next,
masked_lm_positions=masked_lm_positions,
masked_lm_labels=masked_lm_labels)
instances.append(instance)
current_chunk = []
current_length = 0
i += 1
return instances
MaskedLmInstance = collections.namedtuple("MaskedLmInstance",
["index", "label"])
def create_masked_lm_predictions(tokens, masked_lm_prob,
max_predictions_per_seq, vocab_words, rng):
"""Creates the predictions for the masked LM objective."""
cand_indexes = []
for (i, token) in enumerate(tokens):
if token == "[CLS]" or token == "[SEP]":
continue
# Whole Word Masking means that if we mask all of the wordpieces
# corresponding to an original word. When a word has been split into
# WordPieces, the first token does not have any marker and any subsequence
# tokens are prefixed with ##. So whenever we see the ## token, we
# append it to the previous set of word indexes.
#
# Note that Whole Word Masking does *not* change the training code
# at all -- we still predict each WordPiece independently, softmaxed
# over the entire vocabulary.
if (FLAGS.do_whole_word_mask and len(cand_indexes) >= 1 and
token.startswith("##")):
cand_indexes[-1].append(i)
else:
cand_indexes.append([i])
rng.shuffle(cand_indexes)
output_tokens = [t[2:] if len(re.findall('##[\u4E00-\u9FA5]', t))>0 else t for t in tokens] # 去掉"##"
num_to_predict = min(max_predictions_per_seq,
max(1, int(round(len(tokens) * masked_lm_prob))))
masked_lms = []
covered_indexes = set()
for index_set in cand_indexes:
if len(masked_lms) >= num_to_predict:
break
# If adding a whole-word mask would exceed the maximum number of
# predictions, then just skip this candidate.
if len(masked_lms) + len(index_set) > num_to_predict:
continue
is_any_index_covered = False
for index in index_set:
if index in covered_indexes:
is_any_index_covered = True
break
if is_any_index_covered:
continue
for index in index_set:
covered_indexes.add(index)
masked_token = None
# 80% of the time, replace with [MASK]
if rng.random() < 0.8:
masked_token = "[MASK]"
else:
# 10% of the time, keep original
if rng.random() < 0.5:
masked_token = tokens[index][2:] if len(re.findall('##[\u4E00-\u9FA5]', tokens[index]))>0 else tokens[index] # 去掉"##"
# 10% of the time, replace with random word
else:
masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
output_tokens[index] = masked_token
masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
assert len(masked_lms) <= num_to_predict
masked_lms = sorted(masked_lms, key=lambda x: x.index)
masked_lm_positions = []
masked_lm_labels = []
for p in masked_lms:
masked_lm_positions.append(p.index)
masked_lm_labels.append(p.label)
# tf.logging.info('%s' % (tokens))
# tf.logging.info('%s' % (output_tokens))
return (output_tokens, masked_lm_positions, masked_lm_labels)
def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng):
"""Truncates a pair of sequences to a maximum sequence length."""
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_num_tokens:
break
trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b
assert len(trunc_tokens) >= 1
# We want to sometimes truncate from the front and sometimes from the
# back to add more randomness and avoid biases.
if rng.random() < 0.5:
del trunc_tokens[0]
else:
trunc_tokens.pop()
def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
tokenizer = tokenization.FullTokenizer(
vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
input_files = []
for input_pattern in FLAGS.input_file.split(","):
input_files.extend(tf.gfile.Glob(input_pattern))
tf.logging.info("*** Reading from input files ***")
for input_file in input_files:
tf.logging.info(" %s", input_file)
rng = random.Random(FLAGS.random_seed)
instances = create_training_instances(
input_files, tokenizer, FLAGS.max_seq_length, FLAGS.dupe_factor,
FLAGS.short_seq_prob, FLAGS.masked_lm_prob, FLAGS.max_predictions_per_seq,
rng)
output_files = FLAGS.output_file.split(",")
tf.logging.info("*** Writing to output files ***")
for output_file in output_files:
tf.logging.info(" %s", output_file)
write_instance_to_example_files(instances, tokenizer, FLAGS.max_seq_length,
FLAGS.max_predictions_per_seq, output_files)
if __name__ == "__main__":
flags.mark_flag_as_required("input_file")
flags.mark_flag_as_required("output_file")
flags.mark_flag_as_required("vocab_file")
tf.app.run()
================================================
FILE: resources/shell_scripts/create_pretrain_data_batch_webtext.sh
================================================
#!/usr/bin/env bash
echo $1,$2
BERT_BASE_DIR=./bert_config
for((i=$1;i<=$2;i++));
do
python3 create_pretraining_data.py --do_whole_word_mask=True --input_file=gs://raw_text/web_text_zh_raw/web_text_zh_$i.txt \
--output_file=gs://albert_zh/tf_records/tf_web_text_zh_$i.tfrecord --vocab_file=$BERT_BASE_DIR/vocab.txt --do_lower_case=True \
--max_seq_length=512 --max_predictions_per_seq=76 --masked_lm_prob=0.15
done
================================================
FILE: run_classifier.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""BERT finetuning runner."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import csv
import os
import modeling
import optimization_finetuning as optimization
import tokenization
import tensorflow as tf
# from loss import bi_tempered_logistic_loss
flags = tf.flags
FLAGS = flags.FLAGS
## Required parameters
flags.DEFINE_string(
"data_dir", None,
"The input data dir. Should contain the .tsv files (or other data files) "
"for the task.")
flags.DEFINE_string(
"bert_config_file", None,
"The config json file corresponding to the pre-trained BERT model. "
"This specifies the model architecture.")
flags.DEFINE_string("task_name", None, "The name of the task to train.")
flags.DEFINE_string("vocab_file", None,
"The vocabulary file that the BERT model was trained on.")
flags.DEFINE_string(
"output_dir", None,
"The output directory where the model checkpoints will be written.")
## Other parameters
flags.DEFINE_string(
"init_checkpoint", None,
"Initial checkpoint (usually from a pre-trained BERT model).")
flags.DEFINE_bool(
"do_lower_case", True,
"Whether to lower case the input text. Should be True for uncased "
"models and False for cased models.")
flags.DEFINE_integer(
"max_seq_length", 128,
"The maximum total input sequence length after WordPiece tokenization. "
"Sequences longer than this will be truncated, and sequences shorter "
"than this will be padded.")
flags.DEFINE_bool("do_train", False, "Whether to run training.")
flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
flags.DEFINE_bool(
"do_predict", False,
"Whether to run the model in inference mode on the test set.")
flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.")
flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
flags.DEFINE_float("num_train_epochs", 3.0,
"Total number of training epochs to perform.")
flags.DEFINE_float(
"warmup_proportion", 0.1,
"Proportion of training to perform linear learning rate warmup for. "
"E.g., 0.1 = 10% of training.")
flags.DEFINE_integer("save_checkpoints_steps", 1000,
"How often to save the model checkpoint.")
flags.DEFINE_integer("iterations_per_loop", 1000,
"How many steps to make in each estimator call.")
flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
tf.flags.DEFINE_string(
"tpu_name", None,
"The Cloud TPU to use for training. This should be either the name "
"used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
"url.")
tf.flags.DEFINE_string(
"tpu_zone", None,
"[Optional] GCE zone where the Cloud TPU is located in. If not "
"specified, we will attempt to automatically detect the GCE project from "
"metadata.")
tf.flags.DEFINE_string(
"gcp_project", None,
"[Optional] Project name for the Cloud TPU-enabled project. If not "
"specified, we will attempt to automatically detect the GCE project from "
"metadata.")
tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
flags.DEFINE_integer(
"num_tpu_cores", 8,
"Only used if `use_tpu` is True. Total number of TPU cores to use.")
class InputExample(object):
"""A single training/test example for simple sequence classification."""
def __init__(self, guid, text_a, text_b=None, label=None):
"""Constructs a InputExample.
Args:
guid: Unique id for the example.
text_a: string. The untokenized text of the first sequence. For single
sequence tasks, only this sequence must be specified.
text_b: (Optional) string. The untokenized text of the second sequence.
Only must be specified for sequence pair tasks.
label: (Optional) string. The label of the example. This should be
specified for train and dev examples, but not for test examples.
"""
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
class PaddingInputExample(object):
"""Fake example so the num input examples is a multiple of the batch size.
When running eval/predict on the TPU, we need to pad the number of examples
to be a multiple of the batch size, because the TPU requires a fixed batch
size. The alternative is to drop the last batch, which is bad because it means
the entire output data won't be generated.
We use this class instead of `None` because treating `None` as padding
battches could cause silent errors.
"""
class InputFeatures(object):
"""A single set of features of data."""
def __init__(self,
input_ids,
input_mask,
segment_ids,
label_id,
is_real_example=True):
self.input_ids = input_ids
self.input_mask = input_mask
self.segment_ids = segment_ids
self.label_id = label_id
self.is_real_example = is_real_example
class DataProcessor(object):
"""Base class for data converters for sequence classification data sets."""
def get_train_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the train set."""
raise NotImplementedError()
def get_dev_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the dev set."""
raise NotImplementedError()
def get_test_examples(self, data_dir):
"""Gets a collection of `InputExample`s for prediction."""
raise NotImplementedError()
def get_labels(self):
"""Gets the list of labels for this data set."""
raise NotImplementedError()
@classmethod
def _read_tsv(cls, input_file, quotechar=None):
"""Reads a tab separated value file."""
with tf.gfile.Open(input_file, "r") as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
lines = []
for line in reader:
lines.append(line)
return lines
def convert_single_example(ex_index, example, label_list, max_seq_length,
tokenizer):
"""Converts a single `InputExample` into a single `InputFeatures`."""
if isinstance(example, PaddingInputExample):
return InputFeatures(
input_ids=[0] * max_seq_length,
input_mask=[0] * max_seq_length,
segment_ids=[0] * max_seq_length,
label_id=0,
is_real_example=False)
label_map = {}
for (i, label) in enumerate(label_list):
label_map[label] = i
tokens_a = tokenizer.tokenize(example.text_a)
tokens_b = None
if example.text_b:
tokens_b = tokenizer.tokenize(example.text_b)
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
# The convention in BERT is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
#
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# `type=1` were learned during pre-training and are added to the wordpiece
# embedding vector (and position vector). This is not *strictly* necessary
# since the [SEP] token unambiguously separates the sequences, but it makes
# it easier for the model to learn the concept of sequences.
#
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
if tokens_b:
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)
# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
label_id = label_map[example.label]
if ex_index < 5:
tf.logging.info("*** Example ***")
tf.logging.info("guid: %s" % (example.guid))
tf.logging.info("tokens: %s" % " ".join(
[tokenization.printable_text(x) for x in tokens]))
tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
feature = InputFeatures(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_id=label_id,
is_real_example=True)
return feature
def file_based_convert_examples_to_features(
examples, label_list, max_seq_length, tokenizer, output_file):
"""Convert a set of `InputExample`s to a TFRecord file."""
writer = tf.python_io.TFRecordWriter(output_file)
for (ex_index, example) in enumerate(examples):
if ex_index % 10000 == 0:
tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
feature = convert_single_example(ex_index, example, label_list,
max_seq_length, tokenizer)
def create_int_feature(values):
f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
return f
features = collections.OrderedDict()
features["input_ids"] = create_int_feature(feature.input_ids)
features["input_mask"] = create_int_feature(feature.input_mask)
features["segment_ids"] = create_int_feature(feature.segment_ids)
features["label_ids"] = create_int_feature([feature.label_id])
features["is_real_example"] = create_int_feature(
[int(feature.is_real_example)])
tf_example = tf.train.Example(features=tf.train.Features(feature=features))
writer.write(tf_example.SerializeToString())
writer.close()
def file_based_input_fn_builder(input_file, seq_length, is_training,
drop_remainder):
"""Creates an `input_fn` closure to be passed to TPUEstimator."""
name_to_features = {
"input_ids": tf.FixedLenFeature([seq_length], tf.int64),
"input_mask": tf.FixedLenFeature([seq_length], tf.int64),
"segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
"label_ids": tf.FixedLenFeature([], tf.int64),
"is_real_example": tf.FixedLenFeature([], tf.int64),
}
def _decode_record(record, name_to_features):
"""Decodes a record to a TensorFlow example."""
example = tf.parse_single_example(record, name_to_features)
# tf.Example only supports tf.int64, but the TPU only supports tf.int32.
# So cast all int64 to int32.
for name in list(example.keys()):
t = example[name]
if t.dtype == tf.int64:
t = tf.to_int32(t)
example[name] = t
return example
def input_fn(params):
"""The actual input function."""
batch_size = params["batch_size"]
# For training, we want a lot of parallel reading and shuffling.
# For eval, we want no shuffling and parallel reading doesn't matter.
d = tf.data.TFRecordDataset(input_file)
if is_training:
d = d.repeat()
d = d.shuffle(buffer_size=100)
d = d.apply(
tf.contrib.data.map_and_batch(
lambda record: _decode_record(record, name_to_features),
batch_size=batch_size,
drop_remainder=drop_remainder))
return d
return input_fn
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal percent
# of tokens from each, since if one sequence is very short then each token
# that's truncated likely contains more information than a longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
labels, num_labels, use_one_hot_embeddings):
"""Creates a classification model."""
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
use_one_hot_embeddings=use_one_hot_embeddings)
# In the demo, we are doing a simple classification task on the entire
# segment.
#
# If you want to use the token-level output, use model.get_sequence_output()
# instead.
output_layer = model.get_pooled_output()
hidden_size = output_layer.shape[-1].value
output_weights = tf.get_variable(
"output_weights", [num_labels, hidden_size],
initializer=tf.truncated_normal_initializer(stddev=0.02))
output_bias = tf.get_variable(
"output_bias", [num_labels], initializer=tf.zeros_initializer())
with tf.variable_scope("loss"):
ln_type = bert_config.ln_type
if ln_type == 'preln': # add by brightmart, 10-06. if it is preln, we need to an additonal layer: layer normalization as suggested in paper "ON LAYER NORMALIZATION IN THE TRANSFORMER ARCHITECTURE"
print("ln_type is preln. add LN layer.")
output_layer=layer_norm(output_layer)
else:
print("ln_type is postln or other,do nothing.")
if is_training:
# I.e., 0.1 dropout
output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
logits = tf.matmul(output_layer, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
probabilities = tf.nn.softmax(logits, axis=-1)
log_probs = tf.nn.log_softmax(logits, axis=-1)
one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) # todo 08-29 try temp-loss
###############bi_tempered_logistic_loss############################################################################
# print("##cross entropy loss is used...."); tf.logging.info("##cross entropy loss is used....")
# t1=0.9 #t1=0.90
# t2=1.05 #t2=1.05
# per_example_loss=bi_tempered_logistic_loss(log_probs,one_hot_labels,t1,t2,label_smoothing=0.1,num_iters=5) # TODO label_smoothing=0.0
#tf.logging.info("per_example_loss:"+str(per_example_loss.shape))
##############bi_tempered_logistic_loss#############################################################################
loss = tf.reduce_mean(per_example_loss)
return (loss, per_example_loss, logits, probabilities)
def layer_norm(input_tensor, name=None):
"""Run layer normalization on the last dimension of the tensor."""
return tf.contrib.layers.layer_norm(
inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name)
def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
num_train_steps, num_warmup_steps, use_tpu,
use_one_hot_embeddings):
"""Returns `model_fn` closure for TPUEstimator."""
def model_fn(features, labels, mode, params): # pylint: disable=unused-argument
"""The `model_fn` for TPUEstimator."""
tf.logging.info("*** Features ***")
for name in sorted(features.keys()):
tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape))
input_ids = features["input_ids"]
input_mask = features["input_mask"]
segment_ids = features["segment_ids"]
label_ids = features["label_ids"]
is_real_example = None
if "is_real_example" in features:
is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
else:
is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32)
is_training = (mode == tf.estimator.ModeKeys.TRAIN)
(total_loss, per_example_loss, logits, probabilities) = create_model(
bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
num_labels, use_one_hot_embeddings)
tvars = tf.trainable_variables()
initialized_variable_names = {}
scaffold_fn = None
if init_checkpoint:
(assignment_map, initialized_variable_names
) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
if use_tpu:
def tpu_scaffold():
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
return tf.train.Scaffold()
scaffold_fn = tpu_scaffold
else:
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
tf.logging.info("**** Trainable Variables ****")
for var in tvars:
init_string = ""
if var.name in initialized_variable_names:
init_string = ", *INIT_FROM_CKPT*"
tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
init_string)
output_spec = None
if mode == tf.estimator.ModeKeys.TRAIN:
train_op = optimization.create_optimizer(
total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
train_op=train_op,
scaffold_fn=scaffold_fn)
elif mode == tf.estimator.ModeKeys.EVAL:
def metric_fn(per_example_loss, label_ids, logits, is_real_example):
predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
accuracy = tf.metrics.accuracy(
labels=label_ids, predictions=predictions, weights=is_real_example)
loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example)
return {
"eval_accuracy": accuracy,
"eval_loss": loss,
}
eval_metrics = (metric_fn,
[per_example_loss, label_ids, logits, is_real_example])
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
eval_metrics=eval_metrics,
scaffold_fn=scaffold_fn)
else:
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
predictions={"probabilities": probabilities},
scaffold_fn=scaffold_fn)
return output_spec
return model_fn
# This function is not used by this file but is still used by the Colab and
# people who depend on it.
def input_fn_builder(features, seq_length, is_training, drop_remainder):
"""Creates an `input_fn` closure to be passed to TPUEstimator."""
all_input_ids = []
all_input_mask = []
all_segment_ids = []
all_label_ids = []
for feature in features:
all_input_ids.append(feature.input_ids)
all_input_mask.append(feature.input_mask)
all_segment_ids.append(feature.segment_ids)
all_label_ids.append(feature.label_id)
def input_fn(params):
"""The actual input function."""
batch_size = params["batch_size"]
num_examples = len(features)
# This is for demo purposes and does NOT scale to large data sets. We do
# not use Dataset.from_generator() because that uses tf.py_func which is
# not TPU compatible. The right way to load data is with TFRecordReader.
d = tf.data.Dataset.from_tensor_slices({
"input_ids":
tf.constant(
all_input_ids, shape=[num_examples, seq_length],
dtype=tf.int32),
"input_mask":
tf.constant(
all_input_mask,
shape=[num_examples, seq_length],
dtype=tf.int32),
"segment_ids":
tf.constant(
all_segment_ids,
shape=[num_examples, seq_length],
dtype=tf.int32),
"label_ids":
tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),
})
if is_training:
d = d.repeat()
d = d.shuffle(buffer_size=100)
d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)
return d
return input_fn
class LCQMCPairClassificationProcessor(DataProcessor): # TODO NEED CHANGE2
"""Processor for the internal data set. sentence pair classification"""
def __init__(self):
self.language = "zh"
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.txt")), "train")
# dev_0827.tsv
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.txt")), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "test.txt")), "test")
def get_labels(self):
"""See base class."""
return ["0", "1"]
#return ["-1","0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
print("length of lines:",len(lines))
for (i, line) in enumerate(lines):
#print('#i:',i,line)
if i == 0:
continue
guid = "%s-%s" % (set_type, i)
try:
label = tokenization.convert_to_unicode(line[2])
text_a = tokenization.convert_to_unicode(line[0])
text_b = tokenization.convert_to_unicode(line[1])
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
except Exception:
print('###error.i:', i, line)
return examples
class SentencePairClassificationProcessor(DataProcessor):
"""Processor for the internal data set. sentence pair classification"""
def __init__(self):
self.language = "zh"
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train_0827.tsv")), "train")
# dev_0827.tsv
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev_0827.tsv")), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "test_0827.tsv")), "test")
def get_labels(self):
"""See base class."""
return ["0", "1"]
#return ["-1","0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
print("length of lines:",len(lines))
for (i, line) in enumerate(lines):
#print('#i:',i,line)
if i == 0:
continue
guid = "%s-%s" % (set_type, i)
try:
label = tokenization.convert_to_unicode(line[0])
text_a = tokenization.convert_to_unicode(line[1])
text_b = tokenization.convert_to_unicode(line[2])
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
except Exception:
print('###error.i:', i, line)
return examples
# This function is not used by this file but is still used by the Colab and
# people who depend on it.
def convert_examples_to_features(examples, label_list, max_seq_length,
tokenizer):
"""Convert a set of `InputExample`s to a list of `InputFeatures`."""
features = []
for (ex_index, example) in enumerate(examples):
if ex_index % 10000 == 0:
tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
feature = convert_single_example(ex_index, example, label_list,
max_seq_length, tokenizer)
features.append(feature)
return features
def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
processors = {
"sentence_pair": SentencePairClassificationProcessor,
"lcqmc_pair":LCQMCPairClassificationProcessor,
"lcqmc": LCQMCPairClassificationProcessor
}
tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case,
FLAGS.init_checkpoint)
if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict:
raise ValueError(
"At least one of `do_train`, `do_eval` or `do_predict' must be True.")
bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
if FLAGS.max_seq_length > bert_config.max_position_embeddings:
raise ValueError(
"Cannot use sequence length %d because the BERT model "
"was only trained up to sequence length %d" %
(FLAGS.max_seq_length, bert_config.max_position_embeddings))
tf.gfile.MakeDirs(FLAGS.output_dir)
task_name = FLAGS.task_name.lower()
if task_name not in processors:
raise ValueError("Task not found: %s" % (task_name))
processor = processors[task_name]()
label_list = processor.get_labels()
tokenizer = tokenization.FullTokenizer(
vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
tpu_cluster_resolver = None
if FLAGS.use_tpu and FLAGS.tpu_name:
tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
# Cloud TPU: Invalid TPU configuration, ensure ClusterResolver is passed to tpu.
print("###tpu_cluster_resolver:",tpu_cluster_resolver)
run_config = tf.contrib.tpu.RunConfig(
cluster=tpu_cluster_resolver,
master=FLAGS.master,
model_dir=FLAGS.output_dir,
save_checkpoints_steps=FLAGS.save_checkpoints_steps,
tpu_config=tf.contrib.tpu.TPUConfig(
iterations_per_loop=FLAGS.iterations_per_loop,
num_shards=FLAGS.num_tpu_cores,
per_host_input_for_training=is_per_host))
train_examples = None
num_train_steps = None
num_warmup_steps = None
if FLAGS.do_train:
train_examples =processor.get_train_examples(FLAGS.data_dir) # TODO
print("###length of total train_examples:",len(train_examples))
num_train_steps = int(len(train_examples)/ FLAGS.train_batch_size * FLAGS.num_train_epochs)
num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
model_fn = model_fn_builder(
bert_config=bert_config,
num_labels=len(label_list),
init_checkpoint=FLAGS.init_checkpoint,
learning_rate=FLAGS.learning_rate,
num_train_steps=num_train_steps,
num_warmup_steps=num_warmup_steps,
use_tpu=FLAGS.use_tpu,
use_one_hot_embeddings=FLAGS.use_tpu)
# If TPU is not available, this will fall back to normal Estimator on CPU
# or GPU.
estimator = tf.contrib.tpu.TPUEstimator(
use_tpu=FLAGS.use_tpu,
model_fn=model_fn,
config=run_config,
train_batch_size=FLAGS.train_batch_size,
eval_batch_size=FLAGS.eval_batch_size,
predict_batch_size=FLAGS.predict_batch_size)
if FLAGS.do_train:
train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
train_file_exists=os.path.exists(train_file)
print("###train_file_exists:", train_file_exists," ;train_file:",train_file)
if not train_file_exists: # if tf_record file not exist, convert from raw text file. # TODO
file_based_convert_examples_to_features(train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
tf.logging.info("***** Running training *****")
tf.logging.info(" Num examples = %d", len(train_examples))
tf.logging.info(" Batch size = %d", FLAGS.train_batch_size)
tf.logging.info(" Num steps = %d", num_train_steps)
train_input_fn = file_based_input_fn_builder(
input_file=train_file,
seq_length=FLAGS.max_seq_length,
is_training=True,
drop_remainder=True)
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
if FLAGS.do_eval:
eval_examples = processor.get_dev_examples(FLAGS.data_dir)
num_actual_eval_examples = len(eval_examples)
if FLAGS.use_tpu:
# TPU requires a fixed batch size for all batches, therefore the number
# of examples must be a multiple of the batch size, or else examples
# will get dropped. So we pad with fake examples which are ignored
# later on. These do NOT count towards the metric (all tf.metrics
# support a per-instance weight, and these get a weight of 0.0).
while len(eval_examples) % FLAGS.eval_batch_size != 0:
eval_examples.append(PaddingInputExample())
eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record")
file_based_convert_examples_to_features(
eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file)
tf.logging.info("***** Running evaluation *****")
tf.logging.info(" Num examples = %d (%d actual, %d padding)",
len(eval_examples), num_actual_eval_examples,
len(eval_examples) - num_actual_eval_examples)
tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size)
# This tells the estimator to run through the entire set.
eval_steps = None
# However, if running eval on the TPU, you will need to specify the
# number of steps.
if FLAGS.use_tpu:
assert len(eval_examples) % FLAGS.eval_batch_size == 0
eval_steps = int(len(eval_examples) // FLAGS.eval_batch_size)
eval_drop_remainder = True if FLAGS.use_tpu else False
eval_input_fn = file_based_input_fn_builder(
input_file=eval_file,
seq_length=FLAGS.max_seq_length,
is_training=False,
drop_remainder=eval_drop_remainder)
#######################################################################################################################
# evaluate all checkpoints; you can use the checkpoint with the best dev accuarcy
steps_and_files = []
filenames = tf.gfile.ListDirectory(FLAGS.output_dir)
for filename in filenames:
if filename.endswith(".index"):
ckpt_name = filename[:-6]
cur_filename = os.path.join(FLAGS.output_dir, ckpt_name)
global_step = int(cur_filename.split("-")[-1])
tf.logging.info("Add {} to eval list.".format(cur_filename))
steps_and_files.append([global_step, cur_filename])
steps_and_files = sorted(steps_and_files, key=lambda x: x[0])
output_eval_file = os.path.join(FLAGS.data_dir, "eval_results_albert_zh.txt")
print("output_eval_file:",output_eval_file)
tf.logging.info("output_eval_file:"+output_eval_file)
with tf.gfile.GFile(output_eval_file, "w") as writer:
for global_step, filename in sorted(steps_and_files, key=lambda x: x[0]):
result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps, checkpoint_path=filename)
tf.logging.info("***** Eval results %s *****" % (filename))
writer.write("***** Eval results %s *****\n" % (filename))
for key in sorted(result.keys()):
tf.logging.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
#######################################################################################################################
#result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
#
#output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
#with tf.gfile.GFile(output_eval_file, "w") as writer:
# tf.logging.info("***** Eval results *****")
# for key in sorted(result.keys()):
# tf.logging.info(" %s = %s", key, str(result[key]))
# writer.write("%s = %s\n" % (key, str(result[key])))
if FLAGS.do_predict:
predict_examples = processor.get_test_examples(FLAGS.data_dir)
num_actual_predict_examples = len(predict_examples)
if FLAGS.use_tpu:
# TPU requires a fixed batch size for all batches, therefore the number
# of examples must be a multiple of the batch size, or else examples
# will get dropped. So we pad with fake examples which are ignored
# later on.
while len(predict_examples) % FLAGS.predict_batch_size != 0:
predict_examples.append(PaddingInputExample())
predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record")
file_based_convert_examples_to_features(predict_examples, label_list,
FLAGS.max_seq_length, tokenizer,
predict_file)
tf.logging.info("***** Running prediction*****")
tf.logging.info(" Num examples = %d (%d actual, %d padding)",
len(predict_examples), num_actual_predict_examples,
len(predict_examples) - num_actual_predict_examples)
tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size)
predict_drop_remainder = True if FLAGS.use_tpu else False
predict_input_fn = file_based_input_fn_builder(
input_file=predict_file,
seq_length=FLAGS.max_seq_length,
is_training=False,
drop_remainder=predict_drop_remainder)
result = estimator.predict(input_fn=predict_input_fn)
output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv")
with tf.gfile.GFile(output_predict_file, "w") as writer:
num_written_lines = 0
tf.logging.info("***** Predict results *****")
for (i, prediction) in enumerate(result):
probabilities = prediction["probabilities"]
if i >= num_actual_predict_examples:
break
output_line = "\t".join(
str(class_probability)
for class_probability in probabilities) + "\n"
writer.write(output_line)
num_written_lines += 1
assert num_written_lines == num_actual_predict_examples
if __name__ == "__main__":
flags.mark_flag_as_required("data_dir")
flags.mark_flag_as_required("task_name")
flags.mark_flag_as_required("vocab_file")
flags.mark_flag_as_required("bert_config_file")
flags.mark_flag_as_required("output_dir")
tf.app.run()
================================================
FILE: run_classifier_clue.py
================================================
# -*- coding: utf-8 -*-
# @Author: bo.shi
# @Date: 2019-11-04 09:56:36
# @Last Modified by: bo.shi
# @Last Modified time: 2019-12-04 14:29:04
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""BERT finetuning runner."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import modeling
import optimization_finetuning as optimization
import tokenization
import tensorflow as tf
# from loss import bi_tempered_logistic_loss
import sys
sys.path.append('..')
from classifier_utils import *
flags = tf.flags
FLAGS = flags.FLAGS
# Required parameters
flags.DEFINE_string(
"data_dir", None,
"The input data dir. Should contain the .tsv files (or other data files) "
"for the task.")
flags.DEFINE_string(
"bert_config_file", None,
"The config json file corresponding to the pre-trained BERT model. "
"This specifies the model architecture.")
flags.DEFINE_string("task_name", None, "The name of the task to train.")
flags.DEFINE_string("vocab_file", None,
"The vocabulary file that the BERT model was trained on.")
flags.DEFINE_string(
"output_dir", None,
"The output directory where the model checkpoints will be written.")
# Other parameters
flags.DEFINE_string(
"init_checkpoint", None,
"Initial checkpoint (usually from a pre-trained BERT model).")
flags.DEFINE_bool(
"do_lower_case", True,
"Whether to lower case the input text. Should be True for uncased "
"models and False for cased models.")
flags.DEFINE_integer(
"max_seq_length", 128,
"The maximum total input sequence length after WordPiece tokenization. "
"Sequences longer than this will be truncated, and sequences shorter "
"than this will be padded.")
flags.DEFINE_bool("do_train", False, "Whether to run training.")
flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
flags.DEFINE_bool(
"do_predict", False,
"Whether to run the model in inference mode on the test set.")
flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.")
flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
flags.DEFINE_float("num_train_epochs", 3.0,
"Total number of training epochs to perform.")
flags.DEFINE_float(
"warmup_proportion", 0.1,
"Proportion of training to perform linear learning rate warmup for. "
"E.g., 0.1 = 10% of training.")
flags.DEFINE_integer("save_checkpoints_steps", 1000,
"How often to save the model checkpoint.")
flags.DEFINE_integer("iterations_per_loop", 1000,
"How many steps to make in each estimator call.")
flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
tf.flags.DEFINE_string(
"tpu_name", None,
"The Cloud TPU to use for training. This should be either the name "
"used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
"url.")
tf.flags.DEFINE_string(
"tpu_zone", None,
"[Optional] GCE zone where the Cloud TPU is located in. If not "
"specified, we will attempt to automatically detect the GCE project from "
"metadata.")
tf.flags.DEFINE_string(
"gcp_project", None,
"[Optional] Project name for the Cloud TPU-enabled project. If not "
"specified, we will attempt to automatically detect the GCE project from "
"metadata.")
tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
flags.DEFINE_integer(
"num_tpu_cores", 8,
"Only used if `use_tpu` is True. Total number of TPU cores to use.")
class InputFeatures(object):
"""A single set of features of data."""
def __init__(self,
input_ids,
input_mask,
segment_ids,
label_id,
is_real_example=True):
self.input_ids = input_ids
self.input_mask = input_mask
self.segment_ids = segment_ids
self.label_id = label_id
self.is_real_example = is_real_example
def convert_single_example_for_inews(ex_index, tokens_a, tokens_b, label_map, max_seq_length,
tokenizer, example):
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
# The convention in BERT is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
#
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# `type=1` were learned during pre-training and are added to the wordpiece
# embedding vector (and position vector). This is not *strictly* necessary
# since the [SEP] token unambiguously separates the sequences, but it makes
# it easier for the model to learn the concept of sequences.
#
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
if tokens_b:
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)
# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
label_id = label_map[example.label]
if ex_index < 5:
tf.logging.info("*** Example ***")
tf.logging.info("guid: %s" % (example.guid))
tf.logging.info("tokens: %s" % " ".join(
[tokenization.printable_text(x) for x in tokens]))
tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
feature = InputFeatures(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_id=label_id,
is_real_example=True)
return feature
def convert_example_list_for_inews(ex_index, example, label_list, max_seq_length,
tokenizer):
"""Converts a single `InputExample` into a single `InputFeatures`."""
if isinstance(example, PaddingInputExample):
return [InputFeatures(
input_ids=[0] * max_seq_length,
input_mask=[0] * max_seq_length,
segment_ids=[0] * max_seq_length,
label_id=0,
is_real_example=False)]
label_map = {}
for (i, label) in enumerate(label_list):
label_map[label] = i
tokens_a = tokenizer.tokenize(example.text_a)
tokens_b = None
if example.text_b:
tokens_b = tokenizer.tokenize(example.text_b)
must_len = len(tokens_a) + 3
extra_len = max_seq_length - must_len
feature_list = []
if example.text_b and extra_len > 0:
extra_num = int((len(tokens_b) - 1) / extra_len) + 1
for num in range(extra_num):
max_len = min((num + 1) * extra_len, len(tokens_b))
tokens_b_sub = tokens_b[num * extra_len: max_len]
feature = convert_single_example_for_inews(
ex_index, tokens_a, tokens_b_sub, label_map, max_seq_length, tokenizer, example)
feature_list.append(feature)
else:
feature = convert_single_example_for_inews(
ex_index, tokens_a, tokens_b, label_map, max_seq_length, tokenizer, example)
feature_list.append(feature)
return feature_list
def file_based_convert_examples_to_features_for_inews(
examples, label_list, max_seq_length, tokenizer, output_file):
"""Convert a set of `InputExample`s to a TFRecord file."""
writer = tf.python_io.TFRecordWriter(output_file)
num_example = 0
for (ex_index, example) in enumerate(examples):
if ex_index % 1000 == 0:
tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
feature_list = convert_example_list_for_inews(ex_index, example, label_list,
max_seq_length, tokenizer)
num_example += len(feature_list)
def create_int_feature(values):
f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
return f
features = collections.OrderedDict()
for feature in feature_list:
features["input_ids"] = create_int_feature(feature.input_ids)
features["input_mask"] = create_int_feature(feature.input_mask)
features["segment_ids"] = create_int_feature(feature.segment_ids)
features["label_ids"] = create_int_feature([feature.label_id])
features["is_real_example"] = create_int_feature(
[int(feature.is_real_example)])
tf_example = tf.train.Example(features=tf.train.Features(feature=features))
writer.write(tf_example.SerializeToString())
tf.logging.info("feature num: %s", num_example)
writer.close()
def convert_single_example(ex_index, example, label_list, max_seq_length,
tokenizer):
"""Converts a single `InputExample` into a single `InputFeatures`."""
if isinstance(example, PaddingInputExample):
return InputFeatures(
input_ids=[0] * max_seq_length,
input_mask=[0] * max_seq_length,
segment_ids=[0] * max_seq_length,
label_id=0,
is_real_example=False)
label_map = {}
for (i, label) in enumerate(label_list):
label_map[label] = i
tokens_a = tokenizer.tokenize(example.text_a)
tokens_b = None
if example.text_b:
tokens_b = tokenizer.tokenize(example.text_b)
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
# The convention in BERT is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
#
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# `type=1` were learned during pre-training and are added to the wordpiece
# embedding vector (and position vector). This is not *strictly* necessary
# since the [SEP] token unambiguously separates the sequences, but it makes
# it easier for the model to learn the concept of sequences.
#
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
if tokens_b:
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)
# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
label_id = label_map[example.label]
if ex_index < 5:
tf.logging.info("*** Example ***")
tf.logging.info("guid: %s" % (example.guid))
tf.logging.info("tokens: %s" % " ".join(
[tokenization.printable_text(x) for x in tokens]))
tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
feature = InputFeatures(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_id=label_id,
is_real_example=True)
return feature
def file_based_convert_examples_to_features(
examples, label_list, max_seq_length, tokenizer, output_file):
"""Convert a set of `InputExample`s to a TFRecord file."""
writer = tf.python_io.TFRecordWriter(output_file)
for (ex_index, example) in enumerate(examples):
if ex_index % 10000 == 0:
tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
feature = convert_single_example(ex_index, example, label_list,
max_seq_length, tokenizer)
def create_int_feature(values):
f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
return f
features = collections.OrderedDict()
features["input_ids"] = create_int_feature(feature.input_ids)
features["input_mask"] = create_int_feature(feature.input_mask)
features["segment_ids"] = create_int_feature(feature.segment_ids)
features["label_ids"] = create_int_feature([feature.label_id])
features["is_real_example"] = create_int_feature(
[int(feature.is_real_example)])
tf_example = tf.train.Example(features=tf.train.Features(feature=features))
writer.write(tf_example.SerializeToString())
writer.close()
def file_based_input_fn_builder(input_file, seq_length, is_training,
drop_remainder):
"""Creates an `input_fn` closure to be passed to TPUEstimator."""
name_to_features = {
"input_ids": tf.FixedLenFeature([seq_length], tf.int64),
"input_mask": tf.FixedLenFeature([seq_length], tf.int64),
"segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
"label_ids": tf.FixedLenFeature([], tf.int64),
"is_real_example": tf.FixedLenFeature([], tf.int64),
}
def _decode_record(record, name_to_features):
"""Decodes a record to a TensorFlow example."""
example = tf.parse_single_example(record, name_to_features)
# tf.Example only supports tf.int64, but the TPU only supports tf.int32.
# So cast all int64 to int32.
for name in list(example.keys()):
t = example[name]
if t.dtype == tf.int64:
t = tf.to_int32(t)
example[name] = t
return example
def input_fn(params):
"""The actual input function."""
batch_size = params["batch_size"]
# For training, we want a lot of parallel reading and shuffling.
# For eval, we want no shuffling and parallel reading doesn't matter.
d = tf.data.TFRecordDataset(input_file)
if is_training:
d = d.repeat()
d = d.shuffle(buffer_size=100)
d = d.apply(
tf.contrib.data.map_and_batch(
lambda record: _decode_record(record, name_to_features),
batch_size=batch_size,
drop_remainder=drop_remainder))
return d
return input_fn
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal percent
# of tokens from each, since if one sequence is very short then each token
# that's truncated likely contains more information than a longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
labels, num_labels, use_one_hot_embeddings):
"""Creates a classification model."""
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
use_one_hot_embeddings=use_one_hot_embeddings)
# In the demo, we are doing a simple classification task on the entire
# segment.
#
# If you want to use the token-level output, use model.get_sequence_output()
# instead.
output_layer = model.get_pooled_output()
hidden_size = output_layer.shape[-1].value
output_weights = tf.get_variable(
"output_weights", [num_labels, hidden_size],
initializer=tf.truncated_normal_initializer(stddev=0.02))
output_bias = tf.get_variable(
"output_bias", [num_labels], initializer=tf.zeros_initializer())
with tf.variable_scope("loss"):
ln_type = bert_config.ln_type
if ln_type == 'preln': # add by brightmart, 10-06. if it is preln, we need to an additonal layer: layer normalization as suggested in paper "ON LAYER NORMALIZATION IN THE TRANSFORMER ARCHITECTURE"
print("ln_type is preln. add LN layer.")
output_layer = layer_norm(output_layer)
else:
print("ln_type is postln or other,do nothing.")
if is_training:
# I.e., 0.1 dropout
output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
logits = tf.matmul(output_layer, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
probabilities = tf.nn.softmax(logits, axis=-1)
log_probs = tf.nn.log_softmax(logits, axis=-1)
one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs,
axis=-1) # todo 08-29 try temp-loss
###############bi_tempered_logistic_loss############################################################################
# print("##cross entropy loss is used...."); tf.logging.info("##cross entropy loss is used....")
# t1=0.9 #t1=0.90
# t2=1.05 #t2=1.05
# per_example_loss=bi_tempered_logistic_loss(log_probs,one_hot_labels,t1,t2,label_smoothing=0.1,num_iters=5) # TODO label_smoothing=0.0
# tf.logging.info("per_example_loss:"+str(per_example_loss.shape))
##############bi_tempered_logistic_loss#############################################################################
loss = tf.reduce_mean(per_example_loss)
return (loss, per_example_loss, logits, probabilities)
def layer_norm(input_tensor, name=None):
"""Run layer normalization on the last dimension of the tensor."""
return tf.contrib.layers.layer_norm(
inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name)
def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
num_train_steps, num_warmup_steps, use_tpu,
use_one_hot_embeddings):
"""Returns `model_fn` closure for TPUEstimator."""
def model_fn(features, labels, mode, params): # pylint: disable=unused-argument
"""The `model_fn` for TPUEstimator."""
tf.logging.info("*** Features ***")
for name in sorted(features.keys()):
tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape))
input_ids = features["input_ids"]
input_mask = features["input_mask"]
segment_ids = features["segment_ids"]
label_ids = features["label_ids"]
is_real_example = None
if "is_real_example" in features:
is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
else:
is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32)
is_training = (mode == tf.estimator.ModeKeys.TRAIN)
(total_loss, per_example_loss, logits, probabilities) = create_model(
bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
num_labels, use_one_hot_embeddings)
tvars = tf.trainable_variables()
initialized_variable_names = {}
scaffold_fn = None
if init_checkpoint:
(assignment_map, initialized_variable_names
) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
if use_tpu:
def tpu_scaffold():
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
return tf.train.Scaffold()
scaffold_fn = tpu_scaffold
else:
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
tf.logging.info("**** Trainable Variables ****")
for var in tvars:
init_string = ""
if var.name in initialized_variable_names:
init_string = ", *INIT_FROM_CKPT*"
tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
init_string)
output_spec = None
if mode == tf.estimator.ModeKeys.TRAIN:
train_op = optimization.create_optimizer(
total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
train_op=train_op,
scaffold_fn=scaffold_fn)
elif mode == tf.estimator.ModeKeys.EVAL:
def metric_fn(per_example_loss, label_ids, logits, is_real_example):
predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
accuracy = tf.metrics.accuracy(
labels=label_ids, predictions=predictions, weights=is_real_example)
loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example)
return {
"eval_accuracy": accuracy,
"eval_loss": loss,
}
eval_metrics = (metric_fn,
[per_example_loss, label_ids, logits, is_real_example])
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
eval_metrics=eval_metrics,
scaffold_fn=scaffold_fn)
else:
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
predictions={"probabilities": probabilities},
scaffold_fn=scaffold_fn)
return output_spec
return model_fn
# This function is not used by this file but is still used by the Colab and
# people who depend on it.
def input_fn_builder(features, seq_length, is_training, drop_remainder):
"""Creates an `input_fn` closure to be passed to TPUEstimator."""
all_input_ids = []
all_input_mask = []
all_segment_ids = []
all_label_ids = []
for feature in features:
all_input_ids.append(feature.input_ids)
all_input_mask.append(feature.input_mask)
all_segment_ids.append(feature.segment_ids)
all_label_ids.append(feature.label_id)
def input_fn(params):
"""The actual input function."""
batch_size = params["batch_size"]
num_examples = len(features)
# This is for demo purposes and does NOT scale to large data sets. We do
# not use Dataset.from_generator() because that uses tf.py_func which is
# not TPU compatible. The right way to load data is with TFRecordReader.
d = tf.data.Dataset.from_tensor_slices({
"input_ids":
tf.constant(
all_input_ids, shape=[num_examples, seq_length],
dtype=tf.int32),
"input_mask":
tf.constant(
all_input_mask,
shape=[num_examples, seq_length],
dtype=tf.int32),
"segment_ids":
tf.constant(
all_segment_ids,
shape=[num_examples, seq_length],
dtype=tf.int32),
"label_ids":
tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),
})
if is_training:
d = d.repeat()
d = d.shuffle(buffer_size=100)
d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)
return d
return input_fn
# This function is not used by this file but is still used by the Colab and
# people who depend on it.
def convert_examples_to_features(examples, label_list, max_seq_length,
tokenizer):
"""Convert a set of `InputExample`s to a list of `InputFeatures`."""
features = []
for (ex_index, example) in enumerate(examples):
if ex_index % 10000 == 0:
tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
feature = convert_single_example(ex_index, example, label_list,
max_seq_length, tokenizer)
features.append(feature)
return features
def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
processors = {
"xnli": XnliProcessor,
"tnews": TnewsProcessor,
"afqmc": AFQMCProcessor,
"iflytek": iFLYTEKDataProcessor,
"copa": COPAProcessor,
"cmnli": CMNLIProcessor,
"wsc": WSCProcessor,
"csl": CslProcessor,
"copa": COPAProcessor,
}
tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case,
FLAGS.init_checkpoint)
if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict:
raise ValueError(
"At least one of `do_train`, `do_eval` or `do_predict' must be True.")
bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
if FLAGS.max_seq_length > bert_config.max_position_embeddings:
raise ValueError(
"Cannot use sequence length %d because the BERT model "
"was only trained up to sequence length %d" %
(FLAGS.max_seq_length, bert_config.max_position_embeddings))
tf.gfile.MakeDirs(FLAGS.output_dir)
task_name = FLAGS.task_name.lower()
if task_name not in processors:
raise ValueError("Task not found: %s" % (task_name))
processor = processors[task_name]()
label_list = processor.get_labels()
tokenizer = tokenization.FullTokenizer(
vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
tpu_cluster_resolver = None
if FLAGS.use_tpu and FLAGS.tpu_name:
tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
# Cloud TPU: Invalid TPU configuration, ensure ClusterResolver is passed to tpu.
print("###tpu_cluster_resolver:", tpu_cluster_resolver)
run_config = tf.contrib.tpu.RunConfig(
cluster=tpu_cluster_resolver,
master=FLAGS.master,
model_dir=FLAGS.output_dir,
save_checkpoints_steps=FLAGS.save_checkpoints_steps,
tpu_config=tf.contrib.tpu.TPUConfig(
iterations_per_loop=FLAGS.iterations_per_loop,
num_shards=FLAGS.num_tpu_cores,
per_host_input_for_training=is_per_host))
train_examples = None
num_train_steps = None
num_warmup_steps = None
if FLAGS.do_train:
train_examples = processor.get_train_examples(FLAGS.data_dir) # TODO
print("###length of total train_examples:", len(train_examples))
num_train_steps = int(len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
model_fn = model_fn_builder(
bert_config=bert_config,
num_labels=len(label_list),
init_checkpoint=FLAGS.init_checkpoint,
learning_rate=FLAGS.learning_rate,
num_train_steps=num_train_steps,
num_warmup_steps=num_warmup_steps,
use_tpu=FLAGS.use_tpu,
use_one_hot_embeddings=FLAGS.use_tpu)
# If TPU is not available, this will fall back to normal Estimator on CPU
# or GPU.
estimator = tf.contrib.tpu.TPUEstimator(
use_tpu=FLAGS.use_tpu,
model_fn=model_fn,
config=run_config,
train_batch_size=FLAGS.train_batch_size,
eval_batch_size=FLAGS.eval_batch_size,
predict_batch_size=FLAGS.predict_batch_size)
if FLAGS.do_train:
train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
train_file_exists = os.path.exists(train_file)
print("###train_file_exists:", train_file_exists, " ;train_file:", train_file)
if not train_file_exists: # if tf_record file not exist, convert from raw text file. # TODO
if task_name == "inews":
file_based_convert_examples_to_features_for_inews(
train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
else:
file_based_convert_examples_to_features(
train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
tf.logging.info("***** Running training *****")
tf.logging.info(" Num examples = %d", len(train_examples))
tf.logging.info(" Batch size = %d", FLAGS.train_batch_size)
tf.logging.info(" Num steps = %d", num_train_steps)
train_input_fn = file_based_input_fn_builder(
input_file=train_file,
seq_length=FLAGS.max_seq_length,
is_training=True,
drop_remainder=True)
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
if FLAGS.do_eval:
# dev dataset
eval_examples = processor.get_dev_examples(FLAGS.data_dir)
num_actual_eval_examples = len(eval_examples)
if FLAGS.use_tpu:
# TPU requires a fixed batch size for all batches, therefore the number
# of examples must be a multiple of the batch size, or else examples
# will get dropped. So we pad with fake examples which are ignored
# later on. These do NOT count towards the metric (all tf.metrics
# support a per-instance weight, and these get a weight of 0.0).
while len(eval_examples) % FLAGS.eval_batch_size != 0:
eval_examples.append(PaddingInputExample())
eval_file = os.path.join(FLAGS.output_dir, "dev.tf_record")
if task_name == "inews":
file_based_convert_examples_to_features_for_inews(
eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file)
else:
file_based_convert_examples_to_features(
eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file)
tf.logging.info("***** Running evaluation *****")
tf.logging.info(" Num examples = %d (%d actual, %d padding)",
len(eval_examples), num_actual_eval_examples,
len(eval_examples) - num_actual_eval_examples)
tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size)
# This tells the estimator to run through the entire set.
eval_steps = None
# However, if running eval on the TPU, you will need to specify the
# number of steps.
if FLAGS.use_tpu:
assert len(eval_examples) % FLAGS.eval_batch_size == 0
eval_steps = int(len(eval_examples) // FLAGS.eval_batch_size)
eval_drop_remainder = True if FLAGS.use_tpu else False
eval_input_fn = file_based_input_fn_builder(
input_file=eval_file,
seq_length=FLAGS.max_seq_length,
is_training=False,
drop_remainder=eval_drop_remainder)
#######################################################################################################################
# evaluate all checkpoints; you can use the checkpoint with the best dev accuarcy
steps_and_files = []
filenames = tf.gfile.ListDirectory(FLAGS.output_dir)
for filename in filenames:
if filename.endswith(".index"):
ckpt_name = filename[:-6]
cur_filename = os.path.join(FLAGS.output_dir, ckpt_name)
global_step = int(cur_filename.split("-")[-1])
tf.logging.info("Add {} to eval list.".format(cur_filename))
steps_and_files.append([global_step, cur_filename])
steps_and_files = sorted(steps_and_files, key=lambda x: x[0])
output_eval_file = os.path.join(FLAGS.data_dir, "dev_results_albert_zh.txt")
print("output_eval_file:", output_eval_file)
tf.logging.info("output_eval_file:" + output_eval_file)
with tf.gfile.GFile(output_eval_file, "w") as writer:
for global_step, filename in sorted(steps_and_files, key=lambda x: x[0]):
result = estimator.evaluate(input_fn=eval_input_fn,
steps=eval_steps, checkpoint_path=filename)
tf.logging.info("***** Eval results %s *****" % (filename))
writer.write("***** Eval results %s *****\n" % (filename))
for key in sorted(result.keys()):
tf.logging.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
#######################################################################################################################
# result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
#
# output_eval_file = os.path.join(FLAGS.output_dir, "dev_results_albert_zh.txt")
# with tf.gfile.GFile(output_eval_file, "w") as writer:
# tf.logging.info("***** Eval results *****")
# for key in sorted(result.keys()):
# tf.logging.info(" %s = %s", key, str(result[key]))
# writer.write("%s = %s\n" % (key, str(result[key])))
if FLAGS.do_predict:
predict_examples = processor.get_test_examples(FLAGS.data_dir)
num_actual_predict_examples = len(predict_examples)
if FLAGS.use_tpu:
# TPU requires a fixed batch size for all batches, therefore the number
# of examples must be a multiple of the batch size, or else examples
# will get dropped. So we pad with fake examples which are ignored
# later on.
while len(predict_examples) % FLAGS.predict_batch_size != 0:
predict_examples.append(PaddingInputExample())
predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record")
if task_name == "inews":
file_based_convert_examples_to_features_for_inews(predict_examples, label_list,
FLAGS.max_seq_length, tokenizer,
predict_file)
else:
file_based_convert_examples_to_features(predict_examples, label_list,
FLAGS.max_seq_length, tokenizer,
predict_file)
tf.logging.info("***** Running prediction*****")
tf.logging.info(" Num examples = %d (%d actual, %d padding)",
len(predict_examples), num_actual_predict_examples,
len(predict_examples) - num_actual_predict_examples)
tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size)
predict_drop_remainder = True if FLAGS.use_tpu else False
predict_input_fn = file_based_input_fn_builder(
input_file=predict_file,
seq_length=FLAGS.max_seq_length,
is_training=False,
drop_remainder=predict_drop_remainder)
result = estimator.predict(input_fn=predict_input_fn)
index2label_map = {}
for (i, label) in enumerate(label_list):
index2label_map[i] = label
output_predict_file_label_name = task_name + "_predict.json"
output_predict_file_label = os.path.join(FLAGS.output_dir, output_predict_file_label_name)
output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv")
with tf.gfile.GFile(output_predict_file_label, "w") as writer_label:
with tf.gfile.GFile(output_predict_file, "w") as writer:
num_written_lines = 0
tf.logging.info("***** Predict results *****")
for (i, prediction) in enumerate(result):
probabilities = prediction["probabilities"]
label_index = probabilities.argmax(0)
if i >= num_actual_predict_examples:
break
output_line = "\t".join(
str(class_probability)
for class_probability in probabilities) + "\n"
test_label_dict = {}
test_label_dict["id"] = i
test_label_dict["label"] = str(index2label_map[label_index])
if task_name == "tnews":
test_label_dict["label_desc"] = ""
writer.write(output_line)
json.dump(test_label_dict, writer_label)
writer_label.write("\n")
num_written_lines += 1
assert num_written_lines == num_actual_predict_examples
if __name__ == "__main__":
flags.mark_flag_as_required("data_dir")
flags.mark_flag_as_required("task_name")
flags.mark_flag_as_required("vocab_file")
flags.mark_flag_as_required("bert_config_file")
flags.mark_flag_as_required("output_dir")
tf.app.run()
================================================
FILE: run_classifier_clue.sh
================================================
# @Author: bo.shi
# @Date: 2020-03-15 16:11:00
# @Last Modified by: bo.shi
# @Last Modified time: 2020-04-02 17:54:05
#!/usr/bin/env bash
export CUDA_VISIBLE_DEVICES="0"
CURRENT_DIR=$(cd -P -- "$(dirname -- "$0")" && pwd -P)
CLUE_DATA_DIR=$CURRENT_DIR/CLUEdataset
ALBERT_TINY_DIR=$CURRENT_DIR/albert_tiny
download_data(){
TASK_NAME=$1
if [ ! -d $CLUE_DATA_DIR ]; then
mkdir -p $CLUE_DATA_DIR
echo "makedir $CLUE_DATA_DIR"
fi
cd $CLUE_DATA_DIR
if [ ! -d ${TASK_NAME} ]; then
mkdir $TASK_NAME
echo "make dataset dir $CLUE_DATA_DIR/$TASK_NAME"
fi
cd $TASK_NAME
if [ ! -f "train.json" ] || [ ! -f "dev.json" ] || [ ! -f "test.json" ]; then
rm *
wget https://storage.googleapis.com/cluebenchmark/tasks/${TASK_NAME}_public.zip
unzip ${TASK_NAME}_public.zip
rm ${TASK_NAME}_public.zip
else
echo "data exists"
fi
echo "Finish download dataset."
}
download_model(){
if [ ! -d $ALBERT_TINY_DIR ]; then
mkdir -p $ALBERT_TINY_DIR
echo "makedir $ALBERT_TINY_DIR"
fi
cd $ALBERT_TINY_DIR
if [ ! -f "albert_config_tiny.json" ] || [ ! -f "vocab.txt" ] || [ ! -f "checkpoint" ] || [ ! -f "albert_model.ckpt.index" ] || [ ! -f "albert_model.ckpt.meta" ] || [ ! -f "albert_model.ckpt.data-00000-of-00001" ]; then
rm *
wget -c https://storage.googleapis.com/albert_zh/albert_tiny_489k.zip
unzip albert_tiny_489k.zip
rm albert_tiny_489k.zip
else
echo "model exists"
fi
echo "Finish download model."
}
run_task() {
TASK_NAME=$1
download_data $TASK_NAME
download_model $MODEL_NAME
DATA_DIR=$CLUE_DATA_DIR/${TASK_NAME}
PREV_TRAINED_MODEL_DIR=$ALBERT_TINY_DIR
MAX_SEQ_LENGTH=$2
TRAIN_BATCH_SIZE=$3
LEARNING_RATE=$4
NUM_TRAIN_EPOCHS=$5
SAVE_CHECKPOINTS_STEPS=$6
OUTPUT_DIR=$CURRENT_DIR/${TASK_NAME}_output/
COMMON_ARGS="
--task_name=$TASK_NAME \
--data_dir=$DATA_DIR \
--vocab_file=$PREV_TRAINED_MODEL_DIR/vocab.txt \
--bert_config_file=$PREV_TRAINED_MODEL_DIR/albert_config_tiny.json \
--init_checkpoint=$PREV_TRAINED_MODEL_DIR/albert_model.ckpt \
--max_seq_length=$MAX_SEQ_LENGTH \
--train_batch_size=$TRAIN_BATCH_SIZE \
--learning_rate=$LEARNING_RATE \
--num_train_epochs=$NUM_TRAIN_EPOCHS \
--save_checkpoints_steps=$SAVE_CHECKPOINTS_STEPS \
--output_dir=$OUTPUT_DIR \
--keep_checkpoint_max=0 \
"
cd $CURRENT_DIR
echo "Start running..."
python run_classifier_clue.py \
$COMMON_ARGS \
--do_train=true \
--do_eval=false \
--do_predict=false
echo "Start predict..."
python run_classifier_clue.py \
$COMMON_ARGS \
--do_train=false \
--do_eval=true \
--do_predict=true
}
##command##task_name##model_name##max_seq_length##train_batch_size##learning_rate##num_train_epochs##save_checkpoints_steps##tpu_ip
run_task afqmc 128 16 2e-5 3 300
run_task cmnli 128 64 3e-5 2 300
run_task csl 128 16 1e-5 5 100
run_task iflytek 128 32 2e-5 3 300
run_task tnews 128 16 2e-5 3 300
run_task wsc 128 16 1e-5 10 10
================================================
FILE: run_classifier_lcqmc.sh
================================================
#!/usr/bin/env bash
# @Author: bo.shi, https://github.com/chineseGLUE/chineseGLUE
# @Date: 2019-11-04 09:56:36
# @Last Modified by: bright
# @Last Modified time: 2019-11-10 09:00:00
TASK_NAME="lcqmc"
MODEL_NAME="albert_tiny_zh"
CURRENT_DIR=$(cd -P -- "$(dirname -- "$0")" && pwd -P)
export CUDA_VISIBLE_DEVICES="0"
export ALBERT_CONFIG_DIR=$CURRENT_DIR/albert_config
export ALBERT_PRETRAINED_MODELS_DIR=$CURRENT_DIR/prev_trained_model
export ALBERT_TINY_DIR=$ALBERT_PRETRAINED_MODELS_DIR/$MODEL_NAME
#mkdir chineseGLUEdatasets
export GLUE_DATA_DIR=$CURRENT_DIR/chineseGLUEdatasets
# download and unzip dataset
if [ ! -d $GLUE_DATA_DIR ]; then
mkdir -p $GLUE_DATA_DIR
echo "makedir $GLUE_DATA_DIR"
fi
cd $GLUE_DATA_DIR
if [ ! -d $TASK_NAME ]; then
mkdir $TASK_NAME
echo "makedir $GLUE_DATA_DIR/$TASK_NAME"
fi
cd $TASK_NAME
echo "Please try again if the data is not downloaded successfully."
wget -c https://raw.githubusercontent.com/pengming617/text_matching/master/data/train.txt
wget -c https://raw.githubusercontent.com/pengming617/text_matching/master/data/dev.txt
wget -c https://raw.githubusercontent.com/pengming617/text_matching/master/data/test.txt
echo "Finish download dataset."
# download model
if [ ! -d $ALBERT_TINY_DIR ]; then
mkdir -p $ALBERT_TINY_DIR
echo "makedir $ALBERT_TINY_DIR"
fi
cd $ALBERT_TINY_DIR
if [ ! -f "albert_config_tiny.json" ] || [ ! -f "vocab.txt" ] || [ ! -f "checkpoint" ] || [ ! -f "albert_model.ckpt.index" ] || [ ! -f "albert_model.ckpt.meta" ] || [ ! -f "albert_model.ckpt.data-00000-of-00001" ]; then
rm *
wget https://storage.googleapis.com/albert_zh/albert_tiny_489k.zip
unzip albert_tiny_489k.zip
rm albert_tiny_489k.zip
else
echo "model exists"
fi
echo "Finish download model."
# run task
cd $CURRENT_DIR
echo "Start running..."
python run_classifier.py \
--task_name=$TASK_NAME \
--do_train=true \
--do_eval=true \
--data_dir=$GLUE_DATA_DIR/$TASK_NAME \
--vocab_file=$ALBERT_CONFIG_DIR/vocab.txt \
--bert_config_file=$ALBERT_CONFIG_DIR/albert_config_tiny.json \
--init_checkpoint=$ALBERT_TINY_DIR/albert_model.ckpt \
--max_seq_length=128 \
--train_batch_size=64 \
--learning_rate=1e-4 \
--num_train_epochs=5.0 \
--output_dir=$CURRENT_DIR/${TASK_NAME}_output/
================================================
FILE: run_classifier_sp_google.py
================================================
# coding=utf-8
# Copyright 2019 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Lint as: python2, python3
"""BERT finetuning runner with sentence piece tokenization."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import csv
import os
import six
from six.moves import zip
import tensorflow as tf
import modeling_google as modeling
import optimization_google as optimization
import tokenization_google as tokenization
flags = tf.flags
FLAGS = flags.FLAGS
## Required parameters
flags.DEFINE_string(
"data_dir", None,
"The input data dir. Should contain the .tsv files (or other data files) "
"for the task.")
flags.DEFINE_string(
"albert_config_file", None,
"The config json file corresponding to the pre-trained ALBERT model. "
"This specifies the model architecture.")
flags.DEFINE_string("task_name", None, "The name of the task to train.")
flags.DEFINE_string(
"vocab_file", None,
"The vocabulary file that the ALBERT model was trained on.")
flags.DEFINE_string("spm_model_file", None,
"The model file for sentence piece tokenization.")
flags.DEFINE_string(
"output_dir", None,
"The output directory where the model checkpoints will be written.")
## Other parameters
flags.DEFINE_string(
"init_checkpoint", None,
"Initial checkpoint (usually from a pre-trained ALBERT model).")
flags.DEFINE_bool(
"use_pooled_output", True, "Whether to use the CLS token outputs")
flags.DEFINE_bool(
"do_lower_case", True,
"Whether to lower case the input text. Should be True for uncased "
"models and False for cased models.")
flags.DEFINE_integer(
"max_seq_length", 512,
"The maximum total input sequence length after WordPiece tokenization. "
"Sequences longer than this will be truncated, and sequences shorter "
"than this will be padded.")
flags.DEFINE_bool("do_train", False, "Whether to run training.")
flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
flags.DEFINE_bool(
"do_predict", False,
"Whether to run the model in inference mode on the test set.")
flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.")
flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
flags.DEFINE_float("num_train_epochs", 3.0,
"Total number of training epochs to perform.")
flags.DEFINE_float(
"warmup_proportion", 0.1,
"Proportion of training to perform linear learning rate warmup for. "
"E.g., 0.1 = 10% of training.")
flags.DEFINE_integer("save_checkpoints_steps", 1000,
"How often to save the model checkpoint.")
flags.DEFINE_integer("iterations_per_loop", 1000,
"How many steps to make in each estimator call.")
flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
tf.flags.DEFINE_string(
"tpu_name", None,
"The Cloud TPU to use for training. This should be either the name "
"used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
"url.")
tf.flags.DEFINE_string(
"tpu_zone", None,
"[Optional] GCE zone where the Cloud TPU is located in. If not "
"specified, we will attempt to automatically detect the GCE project from "
"metadata.")
tf.flags.DEFINE_string(
"gcp_project", None,
"[Optional] Project name for the Cloud TPU-enabled project. If not "
"specified, we will attempt to automatically detect the GCE project from "
"metadata.")
tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
flags.DEFINE_integer(
"num_tpu_cores", 8,
"Only used if `use_tpu` is True. Total number of TPU cores to use.")
class InputExample(object):
"""A single training/test example for simple sequence classification."""
def __init__(self, guid, text_a, text_b=None, label=None):
"""Constructs a InputExample.
Args:
guid: Unique id for the example.
text_a: string. The untokenized text of the first sequence. For single
sequence tasks, only this sequence must be specified.
text_b: (Optional) string. The untokenized text of the second sequence.
Only must be specified for sequence pair tasks.
label: (Optional) string. The label of the example. This should be
specified for train and dev examples, but not for test examples.
"""
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
class PaddingInputExample(object):
"""Fake example so the num input examples is a multiple of the batch size.
When running eval/predict on the TPU, we need to pad the number of examples
to be a multiple of the batch size, because the TPU requires a fixed batch
size. The alternative is to drop the last batch, which is bad because it means
the entire output data won't be generated.
We use this class instead of `None` because treating `None` as padding
battches could cause silent errors.
"""
class InputFeatures(object):
"""A single set of features of data."""
def __init__(self,
input_ids,
input_mask,
segment_ids,
label_id,
is_real_example=True):
self.input_ids = input_ids
self.input_mask = input_mask
self.segment_ids = segment_ids
self.label_id = label_id
self.is_real_example = is_real_example
class DataProcessor(object):
"""Base class for data converters for sequence classification data sets."""
def get_train_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the train set."""
raise NotImplementedError()
def get_dev_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the dev set."""
raise NotImplementedError()
def get_test_examples(self, data_dir):
"""Gets a collection of `InputExample`s for prediction."""
raise NotImplementedError()
def get_labels(self):
"""Gets the list of labels for this data set."""
raise NotImplementedError()
@classmethod
def _read_tsv(cls, input_file, quotechar=None):
"""Reads a tab separated value file."""
with tf.gfile.Open(input_file, "r") as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
lines = []
for line in reader:
lines.append(line)
return lines
class XnliProcessor(DataProcessor):
"""Processor for the XNLI data set."""
def __init__(self):
self.language = "zh"
def get_train_examples(self, data_dir):
"""See base class."""
lines = self._read_tsv(
os.path.join(data_dir, "multinli",
"multinli.train.%s.tsv" % self.language))
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "train-%d" % (i)
text_a = tokenization.convert_to_unicode(line[0])
text_b = tokenization.convert_to_unicode(line[1])
label = tokenization.convert_to_unicode(line[2])
if label == tokenization.convert_to_unicode("contradictory"):
label = tokenization.convert_to_unicode("contradiction")
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
def get_dev_examples(self, data_dir):
"""See base class."""
lines = self._read_tsv(os.path.join(data_dir, "xnli.dev.tsv"))
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "dev-%d" % (i)
language = tokenization.convert_to_unicode(line[0])
if language != tokenization.convert_to_unicode(self.language):
continue
text_a = tokenization.convert_to_unicode(line[6])
text_b = tokenization.convert_to_unicode(line[7])
label = tokenization.convert_to_unicode(line[1])
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
def get_labels(self):
"""See base class."""
return ["contradiction", "entailment", "neutral"]
class MnliProcessor(DataProcessor):
"""Processor for the MultiNLI data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")),
"dev_matched")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "test_matched.tsv")), "test")
def get_labels(self):
"""See base class."""
return ["contradiction", "entailment", "neutral"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
# Note(mingdachen): We will rely on this guid for GLUE submission.
guid = tokenization.preprocess_text(line[0], lower=FLAGS.do_lower_case)
text_a = tokenization.preprocess_text(line[8], lower=FLAGS.do_lower_case)
text_b = tokenization.preprocess_text(line[9], lower=FLAGS.do_lower_case)
if set_type == "test":
label = "contradiction"
else:
label = tokenization.preprocess_text(line[-1])
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class LCQMCPairClassificationProcessor(DataProcessor):
"""Processor for the internal data set. sentence pair classification"""
def __init__(self):
self.language = "zh"
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.txt")), "train")
# dev_0827.tsv
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "test.txt")), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "test.txt")), "test")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
print("length of lines:",len(lines))
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, i)
try:
label = tokenization.convert_to_unicode(line[2])
text_a = tokenization.convert_to_unicode(line[0])
text_b = tokenization.convert_to_unicode(line[1])
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
except Exception:
print('###error.i:', i, line)
return examples
class MrpcProcessor(DataProcessor):
"""Processor for the MRPC data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, i)
text_a = tokenization.preprocess_text(line[3], lower=FLAGS.do_lower_case)
text_b = tokenization.preprocess_text(line[4], lower=FLAGS.do_lower_case)
if set_type == "test":
guid = line[0]
label = "0"
else:
label = tokenization.preprocess_text(line[0])
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class ColaProcessor(DataProcessor):
"""Processor for the CoLA data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
# Only the test set has a header
if set_type == "test" and i == 0:
continue
guid = "%s-%s" % (set_type, i)
if set_type == "test":
guid = line[0]
text_a = tokenization.preprocess_text(
line[1], lower=FLAGS.do_lower_case)
label = "0"
else:
text_a = tokenization.preprocess_text(
line[3], lower=FLAGS.do_lower_case)
label = tokenization.preprocess_text(line[1])
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
def convert_single_example(ex_index, example, label_list, max_seq_length,
tokenizer):
"""Converts a single `InputExample` into a single `InputFeatures`."""
if isinstance(example, PaddingInputExample):
return InputFeatures(
input_ids=[0] * max_seq_length,
input_mask=[0] * max_seq_length,
segment_ids=[0] * max_seq_length,
label_id=0,
is_real_example=False)
label_map = {}
for (i, label) in enumerate(label_list):
label_map[label] = i
tokens_a = tokenizer.tokenize(example.text_a)
tokens_b = None
if example.text_b:
tokens_b = tokenizer.tokenize(example.text_b)
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
# The convention in ALBERT is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
#
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# `type=1` were learned during pre-training and are added to the wordpiece
# embedding vector (and position vector). This is not *strictly* necessary
# since the [SEP] token unambiguously separates the sequences, but it makes
# it easier for the model to learn the concept of sequences.
#
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
if tokens_b:
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)
# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
label_id = label_map[example.label]
if ex_index < 5:
tf.logging.info("*** Example ***")
tf.logging.info("guid: %s" % (example.guid))
tf.logging.info("tokens: %s" % " ".join(
[tokenization.printable_text(x) for x in tokens]))
tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
feature = InputFeatures(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_id=label_id,
is_real_example=True)
return feature
def file_based_convert_examples_to_features(
examples, label_list, max_seq_length, tokenizer, output_file):
"""Convert a set of `InputExample`s to a TFRecord file."""
writer = tf.python_io.TFRecordWriter(output_file)
for (ex_index, example) in enumerate(examples):
if ex_index % 10000 == 0:
tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
feature = convert_single_example(ex_index, example, label_list,
max_seq_length, tokenizer)
def create_int_feature(values):
f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
return f
features = collections.OrderedDict()
features["input_ids"] = create_int_feature(feature.input_ids)
features["input_mask"] = create_int_feature(feature.input_mask)
features["segment_ids"] = create_int_feature(feature.segment_ids)
features["label_ids"] = create_int_feature([feature.label_id])
features["is_real_example"] = create_int_feature(
[int(feature.is_real_example)])
tf_example = tf.train.Example(features=tf.train.Features(feature=features))
writer.write(tf_example.SerializeToString())
writer.close()
def file_based_input_fn_builder(input_file, seq_length, is_training,
drop_remainder):
"""Creates an `input_fn` closure to be passed to TPUEstimator."""
name_to_features = {
"input_ids": tf.FixedLenFeature([seq_length], tf.int64),
"input_mask": tf.FixedLenFeature([seq_length], tf.int64),
"segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
"label_ids": tf.FixedLenFeature([], tf.int64),
"is_real_example": tf.FixedLenFeature([], tf.int64),
}
def _decode_record(record, name_to_features):
"""Decodes a record to a TensorFlow example."""
example = tf.parse_single_example(record, name_to_features)
# tf.Example only supports tf.int64, but the TPU only supports tf.int32.
# So cast all int64 to int32.
for name in list(example.keys()):
t = example[name]
if t.dtype == tf.int64:
t = tf.to_int32(t)
example[name] = t
return example
def input_fn(params):
"""The actual input function."""
batch_size = params["batch_size"]
# For training, we want a lot of parallel reading and shuffling.
# For eval, we want no shuffling and parallel reading doesn't matter.
d = tf.data.TFRecordDataset(input_file)
if is_training:
d = d.repeat()
d = d.shuffle(buffer_size=100)
d = d.apply(
tf.contrib.data.map_and_batch(
lambda record: _decode_record(record, name_to_features),
batch_size=batch_size,
drop_remainder=drop_remainder))
return d
return input_fn
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal percent
# of tokens from each, since if one sequence is very short then each token
# that's truncated likely contains more information than a longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
def create_model(albert_config, is_training, input_ids, input_mask, segment_ids,
labels, num_labels, use_one_hot_embeddings):
"""Creates a classification model."""
model = modeling.AlbertModel(
config=albert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
use_one_hot_embeddings=use_one_hot_embeddings)
# In the demo, we are doing a simple classification task on the entire
# segment.
#
# If you want to use the token-level output, use model.get_sequence_output()
# instead.
if FLAGS.use_pooled_output:
tf.logging.info("using pooled output")
output_layer = model.get_pooled_output()
else:
tf.logging.info("using meaned output")
output_layer = tf.reduce_mean(model.get_sequence_output(), axis=1)
hidden_size = output_layer.shape[-1].value
output_weights = tf.get_variable(
"output_weights", [num_labels, hidden_size],
initializer=tf.truncated_normal_initializer(stddev=0.02))
output_bias = tf.get_variable(
"output_bias", [num_labels], initializer=tf.zeros_initializer())
with tf.variable_scope("loss"):
if is_training:
# I.e., 0.1 dropout
output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
logits = tf.matmul(output_layer, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
probabilities = tf.nn.softmax(logits, axis=-1)
log_probs = tf.nn.log_softmax(logits, axis=-1)
one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss)
return (loss, per_example_loss, probabilities, predictions)
def model_fn_builder(albert_config, num_labels, init_checkpoint, learning_rate,
num_train_steps, num_warmup_steps, use_tpu,
use_one_hot_embeddings):
"""Returns `model_fn` closure for TPUEstimator."""
def model_fn(features, labels, mode, params): # pylint: disable=unused-argument
"""The `model_fn` for TPUEstimator."""
tf.logging.info("*** Features ***")
for name in sorted(features.keys()):
tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape))
input_ids = features["input_ids"]
input_mask = features["input_mask"]
segment_ids = features["segment_ids"]
label_ids = features["label_ids"]
is_real_example = None
if "is_real_example" in features:
is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
else:
is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32)
is_training = (mode == tf.estimator.ModeKeys.TRAIN)
(total_loss, per_example_loss, probabilities, predictions) = \
create_model(albert_config, is_training, input_ids, input_mask,
segment_ids, label_ids, num_labels, use_one_hot_embeddings)
tvars = tf.trainable_variables()
initialized_variable_names = {}
scaffold_fn = None
if init_checkpoint:
(assignment_map, initialized_variable_names
) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
if use_tpu:
def tpu_scaffold():
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
return tf.train.Scaffold()
scaffold_fn = tpu_scaffold
else:
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
tf.logging.info("**** Trainable Variables ****")
for var in tvars:
init_string = ""
if var.name in initialized_variable_names:
init_string = ", *INIT_FROM_CKPT*"
tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
init_string)
output_spec = None
if mode == tf.estimator.ModeKeys.TRAIN:
train_op = optimization.create_optimizer(
total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
train_op=train_op,
scaffold_fn=scaffold_fn)
elif mode == tf.estimator.ModeKeys.EVAL:
def metric_fn(per_example_loss, label_ids, predictions, is_real_example):
accuracy = tf.metrics.accuracy(
labels=label_ids, predictions=predictions, weights=is_real_example)
loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example)
return {
"eval_accuracy": accuracy,
"eval_loss": loss,
}
eval_metrics = (metric_fn,
[per_example_loss, label_ids,
predictions, is_real_example])
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
eval_metrics=eval_metrics,
scaffold_fn=scaffold_fn)
else:
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
predictions={"probabilities": probabilities,
"predictions": predictions},
scaffold_fn=scaffold_fn)
return output_spec
return model_fn
# This function is not used by this file but is still used by the Colab and
# people who depend on it.
def input_fn_builder(features, seq_length, is_training, drop_remainder):
"""Creates an `input_fn` closure to be passed to TPUEstimator."""
all_input_ids = []
all_input_mask = []
all_segment_ids = []
all_label_ids = []
for feature in features:
all_input_ids.append(feature.input_ids)
all_input_mask.append(feature.input_mask)
all_segment_ids.append(feature.segment_ids)
all_label_ids.append(feature.label_id)
def input_fn(params):
"""The actual input function."""
batch_size = params["batch_size"]
num_examples = len(features)
# This is for demo purposes and does NOT scale to large data sets. We do
# not use Dataset.from_generator() because that uses tf.py_func which is
# not TPU compatible. The right way to load data is with TFRecordReader.
d = tf.data.Dataset.from_tensor_slices({
"input_ids":
tf.constant(
all_input_ids, shape=[num_examples, seq_length],
dtype=tf.int32),
"input_mask":
tf.constant(
all_input_mask,
shape=[num_examples, seq_length],
dtype=tf.int32),
"segment_ids":
tf.constant(
all_segment_ids,
shape=[num_examples, seq_length],
dtype=tf.int32),
"label_ids":
tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),
})
if is_training:
d = d.repeat()
d = d.shuffle(buffer_size=100)
d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)
return d
return input_fn
# This function is not used by this file but is still used by the Colab and
# people who depend on it.
def convert_examples_to_features(examples, label_list, max_seq_length,
tokenizer):
"""Convert a set of `InputExample`s to a list of `InputFeatures`."""
features = []
for (ex_index, example) in enumerate(examples):
if ex_index % 10000 == 0:
tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
feature = convert_single_example(ex_index, example, label_list,
max_seq_length, tokenizer)
features.append(feature)
return features
def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
processors = {
"cola": ColaProcessor,
"mnli": MnliProcessor,
"mrpc": MrpcProcessor,
"xnli": XnliProcessor,
"lcqmc_pair": LCQMCPairClassificationProcessor
}
tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case,
FLAGS.init_checkpoint)
if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict:
raise ValueError(
"At least one of `do_train`, `do_eval` or `do_predict' must be True.")
albert_config = modeling.AlbertConfig.from_json_file(FLAGS.albert_config_file)
if FLAGS.max_seq_length > albert_config.max_position_embeddings:
raise ValueError(
"Cannot use sequence length %d because the ALBERT model "
"was only trained up to sequence length %d" %
(FLAGS.max_seq_length, albert_config.max_position_embeddings))
tf.gfile.MakeDirs(FLAGS.output_dir)
task_name = FLAGS.task_name.lower()
if task_name not in processors:
raise ValueError("Task not found: %s" % (task_name))
processor = processors[task_name]()
label_list = processor.get_labels()
tokenizer = tokenization.FullTokenizer(
vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case,
spm_model_file=FLAGS.spm_model_file)
tpu_cluster_resolver = None
if FLAGS.use_tpu and FLAGS.tpu_name:
tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
run_config = tf.contrib.tpu.RunConfig(
cluster=tpu_cluster_resolver,
master=FLAGS.master,
model_dir=FLAGS.output_dir,
save_checkpoints_steps=FLAGS.save_checkpoints_steps,
tpu_config=tf.contrib.tpu.TPUConfig(
iterations_per_loop=FLAGS.iterations_per_loop,
num_shards=FLAGS.num_tpu_cores,
per_host_input_for_training=is_per_host))
train_examples = None
num_train_steps = None
num_warmup_steps = None
if FLAGS.do_train:
train_examples = processor.get_train_examples(FLAGS.data_dir)
num_train_steps = int(
len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
model_fn = model_fn_builder(
albert_config=albert_config,
num_labels=len(label_list),
init_checkpoint=FLAGS.init_checkpoint,
learning_rate=FLAGS.learning_rate,
num_train_steps=num_train_steps,
num_warmup_steps=num_warmup_steps,
use_tpu=FLAGS.use_tpu,
use_one_hot_embeddings=FLAGS.use_tpu)
# If TPU is not available, this will fall back to normal Estimator on CPU
# or GPU.
estimator = tf.contrib.tpu.TPUEstimator(
use_tpu=FLAGS.use_tpu,
model_fn=model_fn,
config=run_config,
train_batch_size=FLAGS.train_batch_size,
eval_batch_size=FLAGS.eval_batch_size,
predict_batch_size=FLAGS.predict_batch_size)
if FLAGS.do_train:
train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
file_based_convert_examples_to_features(
train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
tf.logging.info("***** Running training *****")
tf.logging.info(" Num examples = %d", len(train_examples))
tf.logging.info(" Batch size = %d", FLAGS.train_batch_size)
tf.logging.info(" Num steps = %d", num_train_steps)
train_input_fn = file_based_input_fn_builder(
input_file=train_file,
seq_length=FLAGS.max_seq_length,
is_training=True,
drop_remainder=True)
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
if FLAGS.do_eval:
eval_examples = processor.get_dev_examples(FLAGS.data_dir)
num_actual_eval_examples = len(eval_examples)
if FLAGS.use_tpu:
# TPU requires a fixed batch size for all batches, therefore the number
# of examples must be a multiple of the batch size, or else examples
# will get dropped. So we pad with fake examples which are ignored
# later on. These do NOT count towards the metric (all tf.metrics
# support a per-instance weight, and these get a weight of 0.0).
while len(eval_examples) % FLAGS.eval_batch_size != 0:
eval_examples.append(PaddingInputExample())
eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record")
file_based_convert_examples_to_features(
eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file)
tf.logging.info("***** Running evaluation *****")
tf.logging.info(" Num examples = %d (%d actual, %d padding)",
len(eval_examples), num_actual_eval_examples,
len(eval_examples) - num_actual_eval_examples)
tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size)
# This tells the estimator to run through the entire set.
eval_steps = None
# However, if running eval on the TPU, you will need to specify the
# number of steps.
if FLAGS.use_tpu:
assert len(eval_examples) % FLAGS.eval_batch_size == 0
eval_steps = int(len(eval_examples) // FLAGS.eval_batch_size)
eval_drop_remainder = True if FLAGS.use_tpu else False
eval_input_fn = file_based_input_fn_builder(
input_file=eval_file,
seq_length=FLAGS.max_seq_length,
is_training=False,
drop_remainder=eval_drop_remainder)
#######################################################################################################################
# evaluate all checkpoints; you can use the checkpoint with the best dev accuarcy
steps_and_files = []
filenames = tf.gfile.ListDirectory(FLAGS.output_dir)
for filename in filenames:
if filename.endswith(".index"):
ckpt_name = filename[:-6]
cur_filename = os.path.join(FLAGS.output_dir, ckpt_name)
global_step = int(cur_filename.split("-")[-1])
tf.logging.info("Add {} to eval list.".format(cur_filename))
steps_and_files.append([global_step, cur_filename])
steps_and_files = sorted(steps_and_files, key=lambda x: x[0])
output_eval_file = os.path.join(FLAGS.data_dir, "eval_results_albert_zh.txt")
print("output_eval_file:",output_eval_file)
tf.logging.info("output_eval_file:"+output_eval_file)
with tf.gfile.GFile(output_eval_file, "w") as writer:
for global_step, filename in sorted(steps_and_files, key=lambda x: x[0]):
result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps, checkpoint_path=filename)
tf.logging.info("***** Eval results %s *****" % (filename))
writer.write("***** Eval results %s *****\n" % (filename))
for key in sorted(result.keys()):
tf.logging.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
#######################################################################################################################
# result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
# output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
# with tf.gfile.GFile(output_eval_file, "w") as writer:
# tf.logging.info("***** Eval results *****")
# for key in sorted(result.keys()):
# tf.logging.info(" %s = %s", key, str(result[key]))
# writer.write("%s = %s\n" % (key, str(result[key])))
if FLAGS.do_predict:
predict_examples = processor.get_test_examples(FLAGS.data_dir)
num_actual_predict_examples = len(predict_examples)
if FLAGS.use_tpu:
# TPU requires a fixed batch size for all batches, therefore the number
# of examples must be a multiple of the batch size, or else examples
# will get dropped. So we pad with fake examples which are ignored
# later on.
while len(predict_examples) % FLAGS.predict_batch_size != 0:
predict_examples.append(PaddingInputExample())
predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record")
file_based_convert_examples_to_features(predict_examples, label_list,
FLAGS.max_seq_length, tokenizer,
predict_file)
tf.logging.info("***** Running prediction*****")
tf.logging.info(" Num examples = %d (%d actual, %d padding)",
len(predict_examples), num_actual_predict_examples,
len(predict_examples) - num_actual_predict_examples)
tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size)
predict_drop_remainder = True if FLAGS.use_tpu else False
predict_input_fn = file_based_input_fn_builder(
input_file=predict_file,
seq_length=FLAGS.max_seq_length,
is_training=False,
drop_remainder=predict_drop_remainder)
result = estimator.predict(input_fn=predict_input_fn)
output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv")
output_submit_file = os.path.join(FLAGS.output_dir, "submit_results.tsv")
with tf.gfile.GFile(output_predict_file, "w") as pred_writer,\
tf.gfile.GFile(output_submit_file, "w") as sub_writer:
num_written_lines = 0
tf.logging.info("***** Predict results *****")
for (i, (example, prediction)) in\
enumerate(zip(predict_examples, result)):
probabilities = prediction["probabilities"]
if i >= num_actual_predict_examples:
break
output_line = "\t".join(
str(class_probability)
for class_probability in probabilities) + "\n"
pred_writer.write(output_line)
actual_label = label_list[int(prediction["predictions"])]
sub_writer.write(
six.ensure_str(example.guid) + "\t" + actual_label + "\n")
num_written_lines += 1
assert num_written_lines == num_actual_predict_examples
if __name__ == "__main__":
flags.mark_flag_as_required("data_dir")
flags.mark_flag_as_required("task_name")
flags.mark_flag_as_required("vocab_file")
flags.mark_flag_as_required("albert_config_file")
flags.mark_flag_as_required("output_dir")
tf.app.run()
================================================
FILE: run_pretraining.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Run masked LM/next sentence masked_lm pre-training for BERT."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import modeling
import optimization
import tensorflow as tf
flags = tf.flags
FLAGS = flags.FLAGS
## Required parameters
flags.DEFINE_string(
"bert_config_file", None,
"The config json file corresponding to the pre-trained BERT model. "
"This specifies the model architecture.")
flags.DEFINE_string(
"input_file", None,
"Input TF example files (can be a glob or comma separated).")
flags.DEFINE_string(
"output_dir", None,
"The output directory where the model checkpoints will be written.")
## Other parameters
flags.DEFINE_string(
"init_checkpoint", None,
"Initial checkpoint (usually from a pre-trained BERT model).")
flags.DEFINE_integer(
"max_seq_length", 128,
"The maximum total input sequence length after WordPiece tokenization. "
"Sequences longer than this will be truncated, and sequences shorter "
"than this will be padded. Must match data generation.")
flags.DEFINE_integer(
"max_predictions_per_seq", 20,
"Maximum number of masked LM predictions per sequence. "
"Must match data generation.")
flags.DEFINE_bool("do_train", False, "Whether to run training.")
flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.")
flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
flags.DEFINE_integer("num_train_steps", 100000, "Number of training steps.")
flags.DEFINE_integer("num_warmup_steps", 10000, "Number of warmup steps.")
flags.DEFINE_integer("save_checkpoints_steps", 1000,
"How often to save the model checkpoint.")
flags.DEFINE_integer("iterations_per_loop", 1000,
"How many steps to make in each estimator call.")
flags.DEFINE_integer("max_eval_steps", 100, "Maximum number of eval steps.")
flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
tf.flags.DEFINE_string(
"tpu_name", None,
"The Cloud TPU to use for training. This should be either the name "
"used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
"url.")
tf.flags.DEFINE_string(
"tpu_zone", None,
"[Optional] GCE zone where the Cloud TPU is located in. If not "
"specified, we will attempt to automatically detect the GCE project from "
"metadata.")
tf.flags.DEFINE_string(
"gcp_project", None,
"[Optional] Project name for the Cloud TPU-enabled project. If not "
"specified, we will attempt to automatically detect the GCE project from "
"metadata.")
tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
flags.DEFINE_integer(
"num_tpu_cores", 8,
"Only used if `use_tpu` is True. Total number of TPU cores to use.")
def model_fn_builder(bert_config, init_checkpoint, learning_rate,
num_train_steps, num_warmup_steps, use_tpu,
use_one_hot_embeddings):
"""Returns `model_fn` closure for TPUEstimator."""
def model_fn(features, labels, mode, params): # pylint: disable=unused-argument
"""The `model_fn` for TPUEstimator."""
tf.logging.info("*** Features ***")
for name in sorted(features.keys()):
tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape))
input_ids = features["input_ids"]
input_mask = features["input_mask"]
segment_ids = features["segment_ids"]
masked_lm_positions = features["masked_lm_positions"]
masked_lm_ids = features["masked_lm_ids"]
masked_lm_weights = features["masked_lm_weights"]
next_sentence_labels = features["next_sentence_labels"]
is_training = (mode == tf.estimator.ModeKeys.TRAIN)
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
use_one_hot_embeddings=use_one_hot_embeddings)
(masked_lm_loss,
masked_lm_example_loss, masked_lm_log_probs) = get_masked_lm_output(
bert_config, model.get_sequence_output(), model.get_embedding_table(),model.get_embedding_table_2(),
masked_lm_positions, masked_lm_ids, masked_lm_weights)
(next_sentence_loss, next_sentence_example_loss,
next_sentence_log_probs) = get_next_sentence_output(
bert_config, model.get_pooled_output(), next_sentence_labels)
total_loss = masked_lm_loss + next_sentence_loss
tvars = tf.trainable_variables()
initialized_variable_names = {}
print("init_checkpoint:",init_checkpoint)
scaffold_fn = None
if init_checkpoint:
(assignment_map, initialized_variable_names
) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
if use_tpu:
def tpu_scaffold():
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
return tf.train.Scaffold()
scaffold_fn = tpu_scaffold
else:
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
tf.logging.info("**** Trainable Variables ****")
for var in tvars:
init_string = ""
if var.name in initialized_variable_names:
init_string = ", *INIT_FROM_CKPT*"
tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
init_string)
output_spec = None
if mode == tf.estimator.ModeKeys.TRAIN:
train_op = optimization.create_optimizer(
total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
train_op=train_op,
scaffold_fn=scaffold_fn)
elif mode == tf.estimator.ModeKeys.EVAL:
def metric_fn(masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids,
masked_lm_weights, next_sentence_example_loss,
next_sentence_log_probs, next_sentence_labels):
"""Computes the loss and accuracy of the model."""
masked_lm_log_probs = tf.reshape(masked_lm_log_probs,[-1, masked_lm_log_probs.shape[-1]])
masked_lm_predictions = tf.argmax(masked_lm_log_probs, axis=-1, output_type=tf.int32)
masked_lm_example_loss = tf.reshape(masked_lm_example_loss, [-1])
masked_lm_ids = tf.reshape(masked_lm_ids, [-1])
masked_lm_weights = tf.reshape(masked_lm_weights, [-1])
masked_lm_accuracy = tf.metrics.accuracy(
labels=masked_lm_ids,
predictions=masked_lm_predictions,
weights=masked_lm_weights)
masked_lm_mean_loss = tf.metrics.mean(
values=masked_lm_example_loss, weights=masked_lm_weights)
next_sentence_log_probs = tf.reshape(
next_sentence_log_probs, [-1, next_sentence_log_probs.shape[-1]])
next_sentence_predictions = tf.argmax(
next_sentence_log_probs, axis=-1, output_type=tf.int32)
next_sentence_labels = tf.reshape(next_sentence_labels, [-1])
next_sentence_accuracy = tf.metrics.accuracy(
labels=next_sentence_labels, predictions=next_sentence_predictions)
next_sentence_mean_loss = tf.metrics.mean(
values=next_sentence_example_loss)
return {
"masked_lm_accuracy": masked_lm_accuracy,
"masked_lm_loss": masked_lm_mean_loss,
"next_sentence_accuracy": next_sentence_accuracy,
"next_sentence_loss": next_sentence_mean_loss,
}
# next_sentence_example_loss=0.0 TODO
# next_sentence_log_probs=0.0 # TODO
eval_metrics = (metric_fn, [
masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids,
masked_lm_weights, next_sentence_example_loss,
next_sentence_log_probs, next_sentence_labels
])
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
eval_metrics=eval_metrics,
scaffold_fn=scaffold_fn)
else:
raise ValueError("Only TRAIN and EVAL modes are supported: %s" % (mode))
return output_spec
return model_fn
def get_masked_lm_output(bert_config, input_tensor, output_weights,project_weights, positions,
label_ids, label_weights):
"""Get loss and log probs for the masked LM."""
input_tensor = gather_indexes(input_tensor, positions)
with tf.variable_scope("cls/predictions"):
# We apply one more non-linear transformation before the output layer.
# This matrix is not used after pre-training.
with tf.variable_scope("transform"):
input_tensor = tf.layers.dense(
input_tensor,
units=bert_config.hidden_size,
activation=modeling.get_activation(bert_config.hidden_act),
kernel_initializer=modeling.create_initializer(
bert_config.initializer_range))
input_tensor = modeling.layer_norm(input_tensor)
# The output weights are the same as the input embeddings, but there is
# an output-only bias for each token.
output_bias = tf.get_variable(
"output_bias",
shape=[bert_config.vocab_size],
initializer=tf.zeros_initializer())
# logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
# input_tensor=[-1,hidden_size], project_weights=[embedding_size, hidden_size], project_weights_transpose=[hidden_size, embedding_size]--->[-1, embedding_size]
input_project = tf.matmul(input_tensor, project_weights, transpose_b=True)
logits = tf.matmul(input_project, output_weights, transpose_b=True)
# # input_project=[-1, embedding_size], output_weights=[vocab_size, embedding_size], output_weights_transpose=[embedding_size, vocab_size] ---> [-1, vocab_size]
logits = tf.nn.bias_add(logits, output_bias)
log_probs = tf.nn.log_softmax(logits, axis=-1)
label_ids = tf.reshape(label_ids, [-1])
label_weights = tf.reshape(label_weights, [-1])
one_hot_labels = tf.one_hot(label_ids, depth=bert_config.vocab_size, dtype=tf.float32)
# The `positions` tensor might be zero-padded (if the sequence is too
# short to have the maximum number of predictions). The `label_weights`
# tensor has a value of 1.0 for every real prediction and 0.0 for the
# padding predictions.
per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])
numerator = tf.reduce_sum(label_weights * per_example_loss)
denominator = tf.reduce_sum(label_weights) + 1e-5
loss = numerator / denominator
return (loss, per_example_loss, log_probs)
def get_next_sentence_output(bert_config, input_tensor, labels):
"""Get loss and log probs for the next sentence prediction."""
# Simple binary classification. Note that 0 is "next sentence" and 1 is
# "random sentence". This weight matrix is not used after pre-training.
with tf.variable_scope("cls/seq_relationship"):
output_weights = tf.get_variable(
"output_weights",
shape=[2, bert_config.hidden_size],
initializer=modeling.create_initializer(bert_config.initializer_range))
output_bias = tf.get_variable(
"output_bias", shape=[2], initializer=tf.zeros_initializer())
logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
log_probs = tf.nn.log_softmax(logits, axis=-1)
labels = tf.reshape(labels, [-1])
one_hot_labels = tf.one_hot(labels, depth=2, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss)
return (loss, per_example_loss, log_probs)
def gather_indexes(sequence_tensor, positions):
"""Gathers the vectors at the specific positions over a minibatch."""
sequence_shape = modeling.get_shape_list(sequence_tensor, expected_rank=3)
batch_size = sequence_shape[0]
seq_length = sequence_shape[1]
width = sequence_shape[2]
flat_offsets = tf.reshape(
tf.range(0, batch_size, dtype=tf.int32) * seq_length, [-1, 1])
flat_positions = tf.reshape(positions + flat_offsets, [-1])
flat_sequence_tensor = tf.reshape(sequence_tensor,
[batch_size * seq_length, width])
output_tensor = tf.gather(flat_sequence_tensor, flat_positions)
return output_tensor
def input_fn_builder(input_files,
max_seq_length,
max_predictions_per_seq,
is_training,
num_cpu_threads=4):
"""Creates an `input_fn` closure to be passed to TPUEstimator."""
def input_fn(params):
"""The actual input function."""
batch_size = params["batch_size"]
name_to_features = {
"input_ids":
tf.FixedLenFeature([max_seq_length], tf.int64),
"input_mask":
tf.FixedLenFeature([max_seq_length], tf.int64),
"segment_ids":
tf.FixedLenFeature([max_seq_length], tf.int64),
"masked_lm_positions":
tf.FixedLenFeature([max_predictions_per_seq], tf.int64),
"masked_lm_ids":
tf.FixedLenFeature([max_predictions_per_seq], tf.int64),
"masked_lm_weights":
tf.FixedLenFeature([max_predictions_per_seq], tf.float32),
"next_sentence_labels":
tf.FixedLenFeature([1], tf.int64),
}
# For training, we want a lot of parallel reading and shuffling.
# For eval, we want no shuffling and parallel reading doesn't matter.
if is_training:
d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files))
d = d.repeat()
d = d.shuffle(buffer_size=len(input_files))
# `cycle_length` is the number of parallel files that get read.
cycle_length = min(num_cpu_threads, len(input_files))
# `sloppy` mode means that the interleaving is not exact. This adds
# even more randomness to the training pipeline.
d = d.apply(
tf.contrib.data.parallel_interleave(
tf.data.TFRecordDataset,
sloppy=is_training,
cycle_length=cycle_length))
d = d.shuffle(buffer_size=100)
else:
d = tf.data.TFRecordDataset(input_files)
# Since we evaluate for a fixed number of steps we don't want to encounter
# out-of-range exceptions.
d = d.repeat()
# We must `drop_remainder` on training because the TPU requires fixed
# size dimensions. For eval, we assume we are evaluating on the CPU or GPU
# and we *don't* want to drop the remainder, otherwise we wont cover
# every sample.
d = d.apply(
tf.contrib.data.map_and_batch(
lambda record: _decode_record(record, name_to_features),
batch_size=batch_size,
num_parallel_batches=num_cpu_threads,
drop_remainder=True))
return d
return input_fn
def _decode_record(record, name_to_features):
"""Decodes a record to a TensorFlow example."""
example = tf.parse_single_example(record, name_to_features)
# tf.Example only supports tf.int64, but the TPU only supports tf.int32.
# So cast all int64 to int32.
for name in list(example.keys()):
t = example[name]
if t.dtype == tf.int64:
t = tf.to_int32(t)
example[name] = t
return example
def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
if not FLAGS.do_train and not FLAGS.do_eval: # 必须是训练或验证的类型
raise ValueError("At least one of `do_train` or `do_eval` must be True.")
bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file) # 从json文件中获得配置信息
tf.gfile.MakeDirs(FLAGS.output_dir)
input_files = [] # 输入可以是多个文件,以“逗号隔开”;可以是一个匹配形式的,如“input_x*”
for input_pattern in FLAGS.input_file.split(","):
input_files.extend(tf.gfile.Glob(input_pattern))
tf.logging.info("*** Input Files ***")
for input_file in input_files:
tf.logging.info(" %s" % input_file)
tpu_cluster_resolver = None
if FLAGS.use_tpu and FLAGS.tpu_name:
tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( # TODO
tpu=FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
print("###tpu_cluster_resolver:",tpu_cluster_resolver,";FLAGS.use_tpu:",FLAGS.use_tpu,";FLAGS.tpu_name:",FLAGS.tpu_name,";FLAGS.tpu_zone:",FLAGS.tpu_zone)
# ###tpu_cluster_resolver: ;FLAGS.use_tpu: True ;FLAGS.tpu_name: grpc://10.240.1.83:8470
is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
run_config = tf.contrib.tpu.RunConfig(
keep_checkpoint_max=20, # 10
cluster=tpu_cluster_resolver,
master=FLAGS.master,
model_dir=FLAGS.output_dir,
save_checkpoints_steps=FLAGS.save_checkpoints_steps,
tpu_config=tf.contrib.tpu.TPUConfig(
iterations_per_loop=FLAGS.iterations_per_loop,
num_shards=FLAGS.num_tpu_cores,
per_host_input_for_training=is_per_host))
model_fn = model_fn_builder(
bert_config=bert_config,
init_checkpoint=FLAGS.init_checkpoint,
learning_rate=FLAGS.learning_rate,
num_train_steps=FLAGS.num_train_steps,
num_warmup_steps=FLAGS.num_warmup_steps,
use_tpu=FLAGS.use_tpu,
use_one_hot_embeddings=FLAGS.use_tpu)
# If TPU is not available, this will fall back to normal Estimator on CPU
# or GPU.
estimator = tf.contrib.tpu.TPUEstimator(
use_tpu=FLAGS.use_tpu,
model_fn=model_fn,
config=run_config,
train_batch_size=FLAGS.train_batch_size,
eval_batch_size=FLAGS.eval_batch_size)
if FLAGS.do_train:
tf.logging.info("***** Running training *****")
tf.logging.info(" Batch size = %d", FLAGS.train_batch_size)
train_input_fn = input_fn_builder(
input_files=input_files,
max_seq_length=FLAGS.max_seq_length,
max_predictions_per_seq=FLAGS.max_predictions_per_seq,
is_training=True)
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps)
if FLAGS.do_eval:
tf.logging.info("***** Running evaluation *****")
tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size)
eval_input_fn = input_fn_builder(
input_files=input_files,
max_seq_length=FLAGS.max_seq_length,
max_predictions_per_seq=FLAGS.max_predictions_per_seq,
is_training=False)
result = estimator.evaluate(input_fn=eval_input_fn, steps=FLAGS.max_eval_steps)
output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
with tf.gfile.GFile(output_eval_file, "w") as writer:
tf.logging.info("***** Eval results *****")
for key in sorted(result.keys()):
tf.logging.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
if __name__ == "__main__":
flags.mark_flag_as_required("input_file")
flags.mark_flag_as_required("bert_config_file")
flags.mark_flag_as_required("output_dir")
tf.app.run()
================================================
FILE: run_pretraining_google.py
================================================
# coding=utf-8
# Copyright 2019 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Lint as: python2, python3
"""Run masked LM/next sentence masked_lm pre-training for ALBERT."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import time
from six.moves import range
import tensorflow as tf
import modeling_google as modeling
import optimization_google as optimization
flags = tf.flags
FLAGS = flags.FLAGS
## Required parameters
flags.DEFINE_string(
"albert_config_file", None,
"The config json file corresponding to the pre-trained ALBERT model. "
"This specifies the model architecture.")
flags.DEFINE_string(
"input_file", None,
"Input TF example files (can be a glob or comma separated).")
flags.DEFINE_string(
"output_dir", None,
"The output directory where the model checkpoints will be written.")
flags.DEFINE_string(
"export_dir", None,
"The output directory where the saved models will be written.")
## Other parameters
flags.DEFINE_string(
"init_checkpoint", None,
"Initial checkpoint (usually from a pre-trained ALBERT model).")
flags.DEFINE_integer(
"max_seq_length", 512,
"The maximum total input sequence length after WordPiece tokenization. "
"Sequences longer than this will be truncated, and sequences shorter "
"than this will be padded. Must match data generation.")
flags.DEFINE_integer(
"max_predictions_per_seq", 20,
"Maximum number of masked LM predictions per sequence. "
"Must match data generation.")
flags.DEFINE_bool("do_train", True, "Whether to run training.")
flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
flags.DEFINE_integer("train_batch_size", 4096, "Total batch size for training.")
flags.DEFINE_integer("eval_batch_size", 64, "Total batch size for eval.")
flags.DEFINE_enum("optimizer", "lamb", ["adamw", "lamb"],
"The optimizer for training.")
flags.DEFINE_float("learning_rate", 0.00176, "The initial learning rate.")
flags.DEFINE_float("poly_power", 1.0, "The power of poly decay.")
flags.DEFINE_integer("num_train_steps", 125000, "Number of training steps.")
flags.DEFINE_integer("num_warmup_steps", 3125, "Number of warmup steps.")
flags.DEFINE_integer("start_warmup_step", 0, "The starting step of warmup.")
flags.DEFINE_integer("save_checkpoints_steps", 5000,
"How often to save the model checkpoint.")
flags.DEFINE_integer("iterations_per_loop", 1000,
"How many steps to make in each estimator call.")
flags.DEFINE_integer("max_eval_steps", 100, "Maximum number of eval steps.")
flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
flags.DEFINE_bool("init_from_group0", False, "Whether to initialize"
"parameters of other groups from group 0")
tf.flags.DEFINE_string(
"tpu_name", None,
"The Cloud TPU to use for training. This should be either the name "
"used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
"url.")
tf.flags.DEFINE_string(
"tpu_zone", None,
"[Optional] GCE zone where the Cloud TPU is located in. If not "
"specified, we will attempt to automatically detect the GCE project from "
"metadata.")
tf.flags.DEFINE_string(
"gcp_project", None,
"[Optional] Project name for the Cloud TPU-enabled project. If not "
"specified, we will attempt to automatically detect the GCE project from "
"metadata.")
tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
flags.DEFINE_integer(
"num_tpu_cores", 8,
"Only used if `use_tpu` is True. Total number of TPU cores to use.")
flags.DEFINE_float(
"masked_lm_budget", 0,
"If >0, the ratio of masked ngrams to unmasked ngrams. Default 0,"
"for offline masking")
def model_fn_builder(albert_config, init_checkpoint, learning_rate,
num_train_steps, num_warmup_steps, use_tpu,
use_one_hot_embeddings, optimizer, poly_power,
start_warmup_step):
"""Returns `model_fn` closure for TPUEstimator."""
def model_fn(features, labels, mode, params): # pylint: disable=unused-argument
"""The `model_fn` for TPUEstimator."""
tf.logging.info("*** Features ***")
for name in sorted(features.keys()):
tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape))
input_ids = features["input_ids"]
input_mask = features["input_mask"]
segment_ids = features["segment_ids"]
masked_lm_positions = features["masked_lm_positions"]
masked_lm_ids = features["masked_lm_ids"]
masked_lm_weights = features["masked_lm_weights"]
# Note: We keep this feature name `next_sentence_labels` to be compatible
# with the original data created by lanzhzh@. However, in the ALBERT case
# it does represent sentence_order_labels.
sentence_order_labels = features["next_sentence_labels"]
is_training = (mode == tf.estimator.ModeKeys.TRAIN)
model = modeling.AlbertModel(
config=albert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
use_one_hot_embeddings=use_one_hot_embeddings)
(masked_lm_loss, masked_lm_example_loss,
masked_lm_log_probs) = get_masked_lm_output(albert_config,
model.get_sequence_output(),
model.get_embedding_table(),
masked_lm_positions,
masked_lm_ids,
masked_lm_weights)
(sentence_order_loss, sentence_order_example_loss,
sentence_order_log_probs) = get_sentence_order_output(
albert_config, model.get_pooled_output(), sentence_order_labels)
total_loss = masked_lm_loss + sentence_order_loss
tvars = tf.trainable_variables()
initialized_variable_names = {}
scaffold_fn = None
if init_checkpoint:
tf.logging.info("number of hidden group %d to initialize",
albert_config.num_hidden_groups)
num_of_initialize_group = 1
if FLAGS.init_from_group0:
num_of_initialize_group = albert_config.num_hidden_groups
if albert_config.net_structure_type > 0:
num_of_initialize_group = albert_config.num_hidden_layers
(assignment_map, initialized_variable_names
) = modeling.get_assignment_map_from_checkpoint(
tvars, init_checkpoint, num_of_initialize_group)
if use_tpu:
def tpu_scaffold():
for gid in range(num_of_initialize_group):
tf.logging.info("initialize the %dth layer", gid)
tf.logging.info(assignment_map[gid])
tf.train.init_from_checkpoint(init_checkpoint, assignment_map[gid])
return tf.train.Scaffold()
scaffold_fn = tpu_scaffold
else:
for gid in range(num_of_initialize_group):
tf.logging.info("initialize the %dth layer", gid)
tf.logging.info(assignment_map[gid])
tf.train.init_from_checkpoint(init_checkpoint, assignment_map[gid])
tf.logging.info("**** Trainable Variables ****")
for var in tvars:
init_string = ""
if var.name in initialized_variable_names:
init_string = ", *INIT_FROM_CKPT*"
tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
init_string)
output_spec = None
if mode == tf.estimator.ModeKeys.TRAIN:
train_op = optimization.create_optimizer(
total_loss, learning_rate, num_train_steps, num_warmup_steps,
use_tpu, optimizer, poly_power, start_warmup_step)
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
train_op=train_op,
scaffold_fn=scaffold_fn)
elif mode == tf.estimator.ModeKeys.EVAL:
def metric_fn(*args):
"""Computes the loss and accuracy of the model."""
(masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids,
masked_lm_weights, sentence_order_example_loss,
sentence_order_log_probs, sentence_order_labels) = args[:7]
masked_lm_log_probs = tf.reshape(masked_lm_log_probs,
[-1, masked_lm_log_probs.shape[-1]])
masked_lm_predictions = tf.argmax(
masked_lm_log_probs, axis=-1, output_type=tf.int32)
masked_lm_example_loss = tf.reshape(masked_lm_example_loss, [-1])
masked_lm_ids = tf.reshape(masked_lm_ids, [-1])
masked_lm_weights = tf.reshape(masked_lm_weights, [-1])
masked_lm_accuracy = tf.metrics.accuracy(
labels=masked_lm_ids,
predictions=masked_lm_predictions,
weights=masked_lm_weights)
masked_lm_mean_loss = tf.metrics.mean(
values=masked_lm_example_loss, weights=masked_lm_weights)
metrics = {
"masked_lm_accuracy": masked_lm_accuracy,
"masked_lm_loss": masked_lm_mean_loss,
}
sentence_order_log_probs = tf.reshape(
sentence_order_log_probs, [-1, sentence_order_log_probs.shape[-1]])
sentence_order_predictions = tf.argmax(
sentence_order_log_probs, axis=-1, output_type=tf.int32)
sentence_order_labels = tf.reshape(sentence_order_labels, [-1])
sentence_order_accuracy = tf.metrics.accuracy(
labels=sentence_order_labels,
predictions=sentence_order_predictions)
sentence_order_mean_loss = tf.metrics.mean(
values=sentence_order_example_loss)
metrics.update({
"sentence_order_accuracy": sentence_order_accuracy,
"sentence_order_loss": sentence_order_mean_loss
})
return metrics
metric_values = [
masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids,
masked_lm_weights, sentence_order_example_loss,
sentence_order_log_probs, sentence_order_labels
]
eval_metrics = (metric_fn, metric_values)
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
eval_metrics=eval_metrics,
scaffold_fn=scaffold_fn)
else:
raise ValueError("Only TRAIN and EVAL modes are supported: %s" % (mode))
return output_spec
return model_fn
def get_masked_lm_output(albert_config, input_tensor, output_weights, positions,
label_ids, label_weights):
"""Get loss and log probs for the masked LM."""
input_tensor = gather_indexes(input_tensor, positions)
with tf.variable_scope("cls/predictions"):
# We apply one more non-linear transformation before the output layer.
# This matrix is not used after pre-training.
with tf.variable_scope("transform"):
input_tensor = tf.layers.dense(
input_tensor,
units=albert_config.embedding_size,
activation=modeling.get_activation(albert_config.hidden_act),
kernel_initializer=modeling.create_initializer(
albert_config.initializer_range))
input_tensor = modeling.layer_norm(input_tensor)
# The output weights are the same as the input embeddings, but there is
# an output-only bias for each token.
output_bias = tf.get_variable(
"output_bias",
shape=[albert_config.vocab_size],
initializer=tf.zeros_initializer())
logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
log_probs = tf.nn.log_softmax(logits, axis=-1)
label_ids = tf.reshape(label_ids, [-1])
label_weights = tf.reshape(label_weights, [-1])
one_hot_labels = tf.one_hot(
label_ids, depth=albert_config.vocab_size, dtype=tf.float32)
# The `positions` tensor might be zero-padded (if the sequence is too
# short to have the maximum number of predictions). The `label_weights`
# tensor has a value of 1.0 for every real prediction and 0.0 for the
# padding predictions.
per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])
numerator = tf.reduce_sum(label_weights * per_example_loss)
denominator = tf.reduce_sum(label_weights) + 1e-5
loss = numerator / denominator
return (loss, per_example_loss, log_probs)
def get_sentence_order_output(albert_config, input_tensor, labels):
"""Get loss and log probs for the next sentence prediction."""
# Simple binary classification. Note that 0 is "next sentence" and 1 is
# "random sentence". This weight matrix is not used after pre-training.
with tf.variable_scope("cls/seq_relationship"):
output_weights = tf.get_variable(
"output_weights",
shape=[2, albert_config.hidden_size],
initializer=modeling.create_initializer(
albert_config.initializer_range))
output_bias = tf.get_variable(
"output_bias", shape=[2], initializer=tf.zeros_initializer())
logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
log_probs = tf.nn.log_softmax(logits, axis=-1)
labels = tf.reshape(labels, [-1])
one_hot_labels = tf.one_hot(labels, depth=2, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss)
return (loss, per_example_loss, log_probs)
def gather_indexes(sequence_tensor, positions):
"""Gathers the vectors at the specific positions over a minibatch."""
sequence_shape = modeling.get_shape_list(sequence_tensor, expected_rank=3)
batch_size = sequence_shape[0]
seq_length = sequence_shape[1]
width = sequence_shape[2]
flat_offsets = tf.reshape(
tf.range(0, batch_size, dtype=tf.int32) * seq_length, [-1, 1])
flat_positions = tf.reshape(positions + flat_offsets, [-1])
flat_sequence_tensor = tf.reshape(sequence_tensor,
[batch_size * seq_length, width])
output_tensor = tf.gather(flat_sequence_tensor, flat_positions)
return output_tensor
def input_fn_builder(input_files,
max_seq_length,
max_predictions_per_seq,
is_training,
num_cpu_threads=4):
"""Creates an `input_fn` closure to be passed to TPUEstimator."""
def input_fn(params):
"""The actual input function."""
batch_size = params["batch_size"]
name_to_features = {
"input_ids": tf.FixedLenFeature([max_seq_length], tf.int64),
"input_mask": tf.FixedLenFeature([max_seq_length], tf.int64),
"segment_ids": tf.FixedLenFeature([max_seq_length], tf.int64),
# Note: We keep this feature name `next_sentence_labels` to be
# compatible with the original data created by lanzhzh@. However, in
# the ALBERT case it does represent sentence_order_labels.
"next_sentence_labels": tf.FixedLenFeature([1], tf.int64),
}
if FLAGS.masked_lm_budget:
name_to_features.update({
"token_boundary":
tf.FixedLenFeature([max_seq_length], tf.int64)})
else:
name_to_features.update({
"masked_lm_positions":
tf.FixedLenFeature([max_predictions_per_seq], tf.int64),
"masked_lm_ids":
tf.FixedLenFeature([max_predictions_per_seq], tf.int64),
"masked_lm_weights":
tf.FixedLenFeature([max_predictions_per_seq], tf.float32)})
# For training, we want a lot of parallel reading and shuffling.
# For eval, we want no shuffling and parallel reading doesn't matter.
if is_training:
d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files))
d = d.repeat()
d = d.shuffle(buffer_size=len(input_files))
# `cycle_length` is the number of parallel files that get read.
cycle_length = min(num_cpu_threads, len(input_files))
# `sloppy` mode means that the interleaving is not exact. This adds
# even more randomness to the training pipeline.
d = d.apply(
tf.contrib.data.parallel_interleave(
tf.data.TFRecordDataset,
sloppy=is_training,
cycle_length=cycle_length))
d = d.shuffle(buffer_size=100)
else:
d = tf.data.TFRecordDataset(input_files)
# Since we evaluate for a fixed number of steps we don't want to encounter
# out-of-range exceptions.
d = d.repeat()
# We must `drop_remainder` on training because the TPU requires fixed
# size dimensions. For eval, we assume we are evaluating on the CPU or GPU
# and we *don't* want to drop the remainder, otherwise we wont cover
# every sample.
d = d.apply(
tf.data.experimental.map_and_batch_with_legacy_function(
lambda record: _decode_record(record, name_to_features),
batch_size=batch_size,
num_parallel_batches=num_cpu_threads,
drop_remainder=True))
tf.logging.info(d)
return d
return input_fn
def _decode_record(record, name_to_features):
"""Decodes a record to a TensorFlow example."""
example = tf.parse_single_example(record, name_to_features)
# tf.Example only supports tf.int64, but the TPU only supports tf.int32.
# So cast all int64 to int32.
for name in list(example.keys()):
t = example[name]
if t.dtype == tf.int64:
t = tf.to_int32(t)
example[name] = t
return example
def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
if not FLAGS.do_train and not FLAGS.do_eval:
raise ValueError("At least one of `do_train` or `do_eval` must be True.")
albert_config = modeling.AlbertConfig.from_json_file(FLAGS.albert_config_file)
tf.gfile.MakeDirs(FLAGS.output_dir)
input_files = []
for input_pattern in FLAGS.input_file.split(","):
input_files.extend(tf.gfile.Glob(input_pattern))
tf.logging.info("*** Input Files ***")
for input_file in input_files:
tf.logging.info(" %s" % input_file)
tpu_cluster_resolver = None
if FLAGS.use_tpu and FLAGS.tpu_name:
tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
run_config = tf.contrib.tpu.RunConfig(
cluster=tpu_cluster_resolver,
master=FLAGS.master,
model_dir=FLAGS.output_dir,
save_checkpoints_steps=FLAGS.save_checkpoints_steps,
tpu_config=tf.contrib.tpu.TPUConfig(
iterations_per_loop=FLAGS.iterations_per_loop,
num_shards=FLAGS.num_tpu_cores,
per_host_input_for_training=is_per_host))
model_fn = model_fn_builder(
albert_config=albert_config,
init_checkpoint=FLAGS.init_checkpoint,
learning_rate=FLAGS.learning_rate,
num_train_steps=FLAGS.num_train_steps,
num_warmup_steps=FLAGS.num_warmup_steps,
use_tpu=FLAGS.use_tpu,
use_one_hot_embeddings=FLAGS.use_tpu,
optimizer=FLAGS.optimizer,
poly_power=FLAGS.poly_power,
start_warmup_step=FLAGS.start_warmup_step)
# If TPU is not available, this will fall back to normal Estimator on CPU
# or GPU.
estimator = tf.contrib.tpu.TPUEstimator(
use_tpu=FLAGS.use_tpu,
model_fn=model_fn,
config=run_config,
train_batch_size=FLAGS.train_batch_size,
eval_batch_size=FLAGS.eval_batch_size)
if FLAGS.do_train:
tf.logging.info("***** Running training *****")
tf.logging.info(" Batch size = %d", FLAGS.train_batch_size)
train_input_fn = input_fn_builder(
input_files=input_files,
max_seq_length=FLAGS.max_seq_length,
max_predictions_per_seq=FLAGS.max_predictions_per_seq,
is_training=True)
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps)
if FLAGS.do_eval:
tf.logging.info("***** Running evaluation *****")
tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size)
global_step = -1
output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
writer = tf.gfile.GFile(output_eval_file, "w")
tf.gfile.MakeDirs(FLAGS.export_dir)
eval_input_fn = input_fn_builder(
input_files=input_files,
max_seq_length=FLAGS.max_seq_length,
max_predictions_per_seq=FLAGS.max_predictions_per_seq,
is_training=False)
while global_step < FLAGS.num_train_steps:
if estimator.latest_checkpoint() is None:
tf.logging.info("No checkpoint found yet. Sleeping.")
time.sleep(1)
else:
result = estimator.evaluate(
input_fn=eval_input_fn, steps=FLAGS.max_eval_steps)
global_step = result["global_step"]
tf.logging.info("***** Eval results *****")
for key in sorted(result.keys()):
tf.logging.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
if __name__ == "__main__":
flags.mark_flag_as_required("input_file")
flags.mark_flag_as_required("albert_config_file")
flags.mark_flag_as_required("output_dir")
tf.app.run()
================================================
FILE: run_pretraining_google_fast.py
================================================
# coding=utf-8
# Copyright 2019 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Lint as: python2, python3
"""Run masked LM/next sentence masked_lm pre-training for ALBERT."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import time
from six.moves import range
import tensorflow as tf
import modeling_google_fast as modeling
import optimization_google as optimization
flags = tf.flags
FLAGS = flags.FLAGS
## Required parameters
flags.DEFINE_string(
"albert_config_file", None,
"The config json file corresponding to the pre-trained ALBERT model. "
"This specifies the model architecture.")
flags.DEFINE_string(
"input_file", None,
"Input TF example files (can be a glob or comma separated).")
flags.DEFINE_string(
"output_dir", None,
"The output directory where the model checkpoints will be written.")
flags.DEFINE_string(
"export_dir", None,
"The output directory where the saved models will be written.")
## Other parameters
flags.DEFINE_string(
"init_checkpoint", None,
"Initial checkpoint (usually from a pre-trained ALBERT model).")
flags.DEFINE_integer(
"max_seq_length", 512,
"The maximum total input sequence length after WordPiece tokenization. "
"Sequences longer than this will be truncated, and sequences shorter "
"than this will be padded. Must match data generation.")
flags.DEFINE_integer(
"max_predictions_per_seq", 20,
"Maximum number of masked LM predictions per sequence. "
"Must match data generation.")
flags.DEFINE_bool("do_train", True, "Whether to run training.")
flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
flags.DEFINE_integer("train_batch_size", 4096, "Total batch size for training.")
flags.DEFINE_integer("eval_batch_size", 64, "Total batch size for eval.")
flags.DEFINE_enum("optimizer", "lamb", ["adamw", "lamb"],
"The optimizer for training.")
flags.DEFINE_float("learning_rate", 0.00176, "The initial learning rate.")
flags.DEFINE_float("poly_power", 1.0, "The power of poly decay.")
flags.DEFINE_integer("num_train_steps", 125000, "Number of training steps.")
flags.DEFINE_integer("num_warmup_steps", 3125, "Number of warmup steps.")
flags.DEFINE_integer("start_warmup_step", 0, "The starting step of warmup.")
flags.DEFINE_integer("save_checkpoints_steps", 5000,
"How often to save the model checkpoint.")
flags.DEFINE_integer("iterations_per_loop", 1000,
"How many steps to make in each estimator call.")
flags.DEFINE_integer("max_eval_steps", 100, "Maximum number of eval steps.")
flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
flags.DEFINE_bool("init_from_group0", False, "Whether to initialize"
"parameters of other groups from group 0")
tf.flags.DEFINE_string(
"tpu_name", None,
"The Cloud TPU to use for training. This should be either the name "
"used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
"url.")
tf.flags.DEFINE_string(
"tpu_zone", None,
"[Optional] GCE zone where the Cloud TPU is located in. If not "
"specified, we will attempt to automatically detect the GCE project from "
"metadata.")
tf.flags.DEFINE_string(
"gcp_project", None,
"[Optional] Project name for the Cloud TPU-enabled project. If not "
"specified, we will attempt to automatically detect the GCE project from "
"metadata.")
tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
flags.DEFINE_integer(
"num_tpu_cores", 8,
"Only used if `use_tpu` is True. Total number of TPU cores to use.")
flags.DEFINE_float(
"masked_lm_budget", 0,
"If >0, the ratio of masked ngrams to unmasked ngrams. Default 0,"
"for offline masking")
def model_fn_builder(albert_config, init_checkpoint, learning_rate,
num_train_steps, num_warmup_steps, use_tpu,
use_one_hot_embeddings, optimizer, poly_power,
start_warmup_step):
"""Returns `model_fn` closure for TPUEstimator."""
def model_fn(features, labels, mode, params): # pylint: disable=unused-argument
"""The `model_fn` for TPUEstimator."""
tf.logging.info("*** Features ***")
for name in sorted(features.keys()):
tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape))
input_ids = features["input_ids"]
input_mask = features["input_mask"]
segment_ids = features["segment_ids"]
masked_lm_positions = features["masked_lm_positions"]
masked_lm_ids = features["masked_lm_ids"]
masked_lm_weights = features["masked_lm_weights"]
# Note: We keep this feature name `next_sentence_labels` to be compatible
# with the original data created by lanzhzh@. However, in the ALBERT case
# it does represent sentence_order_labels.
sentence_order_labels = features["next_sentence_labels"]
is_training = (mode == tf.estimator.ModeKeys.TRAIN)
model = modeling.AlbertModel(
config=albert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
use_one_hot_embeddings=use_one_hot_embeddings)
(masked_lm_loss, masked_lm_example_loss,
masked_lm_log_probs) = get_masked_lm_output(albert_config,
model.get_sequence_output(),
model.get_embedding_table(),
masked_lm_positions,
masked_lm_ids,
masked_lm_weights)
(sentence_order_loss, sentence_order_example_loss,
sentence_order_log_probs) = get_sentence_order_output(
albert_config, model.get_pooled_output(), sentence_order_labels)
total_loss = masked_lm_loss + sentence_order_loss
tvars = tf.trainable_variables()
initialized_variable_names = {}
scaffold_fn = None
if init_checkpoint:
tf.logging.info("number of hidden group %d to initialize",
albert_config.num_hidden_groups)
num_of_initialize_group = 1
if FLAGS.init_from_group0:
num_of_initialize_group = albert_config.num_hidden_groups
if albert_config.net_structure_type > 0:
num_of_initialize_group = albert_config.num_hidden_layers
(assignment_map, initialized_variable_names
) = modeling.get_assignment_map_from_checkpoint(
tvars, init_checkpoint, num_of_initialize_group)
if use_tpu:
def tpu_scaffold():
for gid in range(num_of_initialize_group):
tf.logging.info("initialize the %dth layer", gid)
tf.logging.info(assignment_map[gid])
tf.train.init_from_checkpoint(init_checkpoint, assignment_map[gid])
return tf.train.Scaffold()
scaffold_fn = tpu_scaffold
else:
for gid in range(num_of_initialize_group):
tf.logging.info("initialize the %dth layer", gid)
tf.logging.info(assignment_map[gid])
tf.train.init_from_checkpoint(init_checkpoint, assignment_map[gid])
tf.logging.info("**** Trainable Variables ****")
for var in tvars:
init_string = ""
if var.name in initialized_variable_names:
init_string = ", *INIT_FROM_CKPT*"
tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
init_string)
output_spec = None
if mode == tf.estimator.ModeKeys.TRAIN:
train_op = optimization.create_optimizer(
total_loss, learning_rate, num_train_steps, num_warmup_steps,
use_tpu, optimizer, poly_power, start_warmup_step)
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
train_op=train_op,
scaffold_fn=scaffold_fn)
elif mode == tf.estimator.ModeKeys.EVAL:
def metric_fn(*args):
"""Computes the loss and accuracy of the model."""
(masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids,
masked_lm_weights, sentence_order_example_loss,
sentence_order_log_probs, sentence_order_labels) = args[:7]
masked_lm_log_probs = tf.reshape(masked_lm_log_probs,
[-1, masked_lm_log_probs.shape[-1]])
masked_lm_predictions = tf.argmax(
masked_lm_log_probs, axis=-1, output_type=tf.int32)
masked_lm_example_loss = tf.reshape(masked_lm_example_loss, [-1])
masked_lm_ids = tf.reshape(masked_lm_ids, [-1])
masked_lm_weights = tf.reshape(masked_lm_weights, [-1])
masked_lm_accuracy = tf.metrics.accuracy(
labels=masked_lm_ids,
predictions=masked_lm_predictions,
weights=masked_lm_weights)
masked_lm_mean_loss = tf.metrics.mean(
values=masked_lm_example_loss, weights=masked_lm_weights)
metrics = {
"masked_lm_accuracy": masked_lm_accuracy,
"masked_lm_loss": masked_lm_mean_loss,
}
sentence_order_log_probs = tf.reshape(
sentence_order_log_probs, [-1, sentence_order_log_probs.shape[-1]])
sentence_order_predictions = tf.argmax(
sentence_order_log_probs, axis=-1, output_type=tf.int32)
sentence_order_labels = tf.reshape(sentence_order_labels, [-1])
sentence_order_accuracy = tf.metrics.accuracy(
labels=sentence_order_labels,
predictions=sentence_order_predictions)
sentence_order_mean_loss = tf.metrics.mean(
values=sentence_order_example_loss)
metrics.update({
"sentence_order_accuracy": sentence_order_accuracy,
"sentence_order_loss": sentence_order_mean_loss
})
return metrics
metric_values = [
masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids,
masked_lm_weights, sentence_order_example_loss,
sentence_order_log_probs, sentence_order_labels
]
eval_metrics = (metric_fn, metric_values)
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
eval_metrics=eval_metrics,
scaffold_fn=scaffold_fn)
else:
raise ValueError("Only TRAIN and EVAL modes are supported: %s" % (mode))
return output_spec
return model_fn
def get_masked_lm_output(albert_config, input_tensor, output_weights, positions,
label_ids, label_weights):
"""Get loss and log probs for the masked LM."""
input_tensor = gather_indexes(input_tensor, positions)
with tf.variable_scope("cls/predictions"):
# We apply one more non-linear transformation before the output layer.
# This matrix is not used after pre-training.
with tf.variable_scope("transform"):
input_tensor = tf.layers.dense(
input_tensor,
units=albert_config.embedding_size,
activation=modeling.get_activation(albert_config.hidden_act),
kernel_initializer=modeling.create_initializer(
albert_config.initializer_range))
input_tensor = modeling.layer_norm(input_tensor)
# The output weights are the same as the input embeddings, but there is
# an output-only bias for each token.
output_bias = tf.get_variable(
"output_bias",
shape=[albert_config.vocab_size],
initializer=tf.zeros_initializer())
logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
log_probs = tf.nn.log_softmax(logits, axis=-1)
label_ids = tf.reshape(label_ids, [-1])
label_weights = tf.reshape(label_weights, [-1])
one_hot_labels = tf.one_hot(
label_ids, depth=albert_config.vocab_size, dtype=tf.float32)
# The `positions` tensor might be zero-padded (if the sequence is too
# short to have the maximum number of predictions). The `label_weights`
# tensor has a value of 1.0 for every real prediction and 0.0 for the
# padding predictions.
per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])
numerator = tf.reduce_sum(label_weights * per_example_loss)
denominator = tf.reduce_sum(label_weights) + 1e-5
loss = numerator / denominator
return (loss, per_example_loss, log_probs)
def get_sentence_order_output(albert_config, input_tensor, labels):
"""Get loss and log probs for the next sentence prediction."""
# Simple binary classification. Note that 0 is "next sentence" and 1 is
# "random sentence". This weight matrix is not used after pre-training.
with tf.variable_scope("cls/seq_relationship"):
output_weights = tf.get_variable(
"output_weights",
shape=[2, albert_config.hidden_size],
initializer=modeling.create_initializer(
albert_config.initializer_range))
output_bias = tf.get_variable(
"output_bias", shape=[2], initializer=tf.zeros_initializer())
logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
log_probs = tf.nn.log_softmax(logits, axis=-1)
labels = tf.reshape(labels, [-1])
one_hot_labels = tf.one_hot(labels, depth=2, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss)
return (loss, per_example_loss, log_probs)
def gather_indexes(sequence_tensor, positions):
"""Gathers the vectors at the specific positions over a minibatch."""
sequence_shape = modeling.get_shape_list(sequence_tensor, expected_rank=3)
batch_size = sequence_shape[0]
seq_length = sequence_shape[1]
width = sequence_shape[2]
flat_offsets = tf.reshape(
tf.range(0, batch_size, dtype=tf.int32) * seq_length, [-1, 1])
flat_positions = tf.reshape(positions + flat_offsets, [-1])
flat_sequence_tensor = tf.reshape(sequence_tensor,
[batch_size * seq_length, width])
output_tensor = tf.gather(flat_sequence_tensor, flat_positions)
return output_tensor
def input_fn_builder(input_files,
max_seq_length,
max_predictions_per_seq,
is_training,
num_cpu_threads=4):
"""Creates an `input_fn` closure to be passed to TPUEstimator."""
def input_fn(params):
"""The actual input function."""
batch_size = params["batch_size"]
name_to_features = {
"input_ids": tf.FixedLenFeature([max_seq_length], tf.int64),
"input_mask": tf.FixedLenFeature([max_seq_length], tf.int64),
"segment_ids": tf.FixedLenFeature([max_seq_length], tf.int64),
# Note: We keep this feature name `next_sentence_labels` to be
# compatible with the original data created by lanzhzh@. However, in
# the ALBERT case it does represent sentence_order_labels.
"next_sentence_labels": tf.FixedLenFeature([1], tf.int64),
}
if FLAGS.masked_lm_budget:
name_to_features.update({
"token_boundary":
tf.FixedLenFeature([max_seq_length], tf.int64)})
else:
name_to_features.update({
"masked_lm_positions":
tf.FixedLenFeature([max_predictions_per_seq], tf.int64),
"masked_lm_ids":
tf.FixedLenFeature([max_predictions_per_seq], tf.int64),
"masked_lm_weights":
tf.FixedLenFeature([max_predictions_per_seq], tf.float32)})
# For training, we want a lot of parallel reading and shuffling.
# For eval, we want no shuffling and parallel reading doesn't matter.
if is_training:
d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files))
d = d.repeat()
d = d.shuffle(buffer_size=len(input_files))
# `cycle_length` is the number of parallel files that get read.
cycle_length = min(num_cpu_threads, len(input_files))
# `sloppy` mode means that the interleaving is not exact. This adds
# even more randomness to the training pipeline.
d = d.apply(
tf.contrib.data.parallel_interleave(
tf.data.TFRecordDataset,
sloppy=is_training,
cycle_length=cycle_length))
d = d.shuffle(buffer_size=100)
else:
d = tf.data.TFRecordDataset(input_files)
# Since we evaluate for a fixed number of steps we don't want to encounter
# out-of-range exceptions.
d = d.repeat()
# We must `drop_remainder` on training because the TPU requires fixed
# size dimensions. For eval, we assume we are evaluating on the CPU or GPU
# and we *don't* want to drop the remainder, otherwise we wont cover
# every sample.
d = d.apply(
tf.data.experimental.map_and_batch_with_legacy_function(
lambda record: _decode_record(record, name_to_features),
batch_size=batch_size,
num_parallel_batches=num_cpu_threads,
drop_remainder=True))
tf.logging.info(d)
return d
return input_fn
def _decode_record(record, name_to_features):
"""Decodes a record to a TensorFlow example."""
example = tf.parse_single_example(record, name_to_features)
# tf.Example only supports tf.int64, but the TPU only supports tf.int32.
# So cast all int64 to int32.
for name in list(example.keys()):
t = example[name]
if t.dtype == tf.int64:
t = tf.to_int32(t)
example[name] = t
return example
def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
if not FLAGS.do_train and not FLAGS.do_eval:
raise ValueError("At least one of `do_train` or `do_eval` must be True.")
albert_config = modeling.AlbertConfig.from_json_file(FLAGS.albert_config_file)
tf.gfile.MakeDirs(FLAGS.output_dir)
input_files = []
for input_pattern in FLAGS.input_file.split(","):
input_files.extend(tf.gfile.Glob(input_pattern))
tf.logging.info("*** Input Files ***")
for input_file in input_files:
tf.logging.info(" %s" % input_file)
tpu_cluster_resolver = None
if FLAGS.use_tpu and FLAGS.tpu_name:
tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
run_config = tf.contrib.tpu.RunConfig(
cluster=tpu_cluster_resolver,
master=FLAGS.master,
model_dir=FLAGS.output_dir,
save_checkpoints_steps=FLAGS.save_checkpoints_steps,
tpu_config=tf.contrib.tpu.TPUConfig(
iterations_per_loop=FLAGS.iterations_per_loop,
num_shards=FLAGS.num_tpu_cores,
per_host_input_for_training=is_per_host))
model_fn = model_fn_builder(
albert_config=albert_config,
init_checkpoint=FLAGS.init_checkpoint,
learning_rate=FLAGS.learning_rate,
num_train_steps=FLAGS.num_train_steps,
num_warmup_steps=FLAGS.num_warmup_steps,
use_tpu=FLAGS.use_tpu,
use_one_hot_embeddings=FLAGS.use_tpu,
optimizer=FLAGS.optimizer,
poly_power=FLAGS.poly_power,
start_warmup_step=FLAGS.start_warmup_step)
# If TPU is not available, this will fall back to normal Estimator on CPU
# or GPU.
estimator = tf.contrib.tpu.TPUEstimator(
use_tpu=FLAGS.use_tpu,
model_fn=model_fn,
config=run_config,
train_batch_size=FLAGS.train_batch_size,
eval_batch_size=FLAGS.eval_batch_size)
if FLAGS.do_train:
tf.logging.info("***** Running training *****")
tf.logging.info(" Batch size = %d", FLAGS.train_batch_size)
train_input_fn = input_fn_builder(
input_files=input_files,
max_seq_length=FLAGS.max_seq_length,
max_predictions_per_seq=FLAGS.max_predictions_per_seq,
is_training=True)
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps)
if FLAGS.do_eval:
tf.logging.info("***** Running evaluation *****")
tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size)
global_step = -1
output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
writer = tf.gfile.GFile(output_eval_file, "w")
tf.gfile.MakeDirs(FLAGS.export_dir)
eval_input_fn = input_fn_builder(
input_files=input_files,
max_seq_length=FLAGS.max_seq_length,
max_predictions_per_seq=FLAGS.max_predictions_per_seq,
is_training=False)
while global_step < FLAGS.num_train_steps:
if estimator.latest_checkpoint() is None:
tf.logging.info("No checkpoint found yet. Sleeping.")
time.sleep(1)
else:
result = estimator.evaluate(
input_fn=eval_input_fn, steps=FLAGS.max_eval_steps)
global_step = result["global_step"]
tf.logging.info("***** Eval results *****")
for key in sorted(result.keys()):
tf.logging.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
if __name__ == "__main__":
flags.mark_flag_as_required("input_file")
flags.mark_flag_as_required("albert_config_file")
flags.mark_flag_as_required("output_dir")
tf.app.run()
================================================
FILE: similarity.py
================================================
"""
进行文本相似度预测的示例。可以直接运行进行预测。
参考了项目:https://github.com/chdd/bert-utils
"""
import tensorflow as tf
import args
import tokenization
import modeling
from run_classifier import InputFeatures, InputExample, DataProcessor, create_model, convert_examples_to_features
# os.environ['CUDA_VISIBLE_DEVICES'] = '1'
class SimProcessor(DataProcessor):
def get_sentence_examples(self, questions):
examples = []
for index, data in enumerate(questions):
guid = 'test-%d' % index
text_a = tokenization.convert_to_unicode(str(data[0]))
text_b = tokenization.convert_to_unicode(str(data[1]))
label = str(0)
examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
def get_labels(self):
return ['0', '1']
"""
模型类,负责载入checkpoint初始化模型
"""
class BertSim:
def __init__(self, batch_size=args.batch_size):
self.mode = None
self.max_seq_length = args.max_seq_len
self.tokenizer = tokenization.FullTokenizer(vocab_file=args.vocab_file, do_lower_case=True)
self.batch_size = batch_size
self.estimator = None
self.processor = SimProcessor()
tf.logging.set_verbosity(tf.logging.INFO)
#载入estimator,构造模型
def start_model(self):
self.estimator = self.get_estimator()
def model_fn_builder(self, bert_config, num_labels, init_checkpoint, learning_rate,
num_train_steps, num_warmup_steps,
use_one_hot_embeddings):
"""Returns `model_fn` closurimport_tfe for TPUEstimator."""
def model_fn(features, labels, mode, params): # pylint: disable=unused-argument
from tensorflow.python.estimator.model_fn import EstimatorSpec
tf.logging.info("*** Features ***")
for name in sorted(features.keys()):
tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape))
input_ids = features["input_ids"]
input_mask = features["input_mask"]
segment_ids = features["segment_ids"]
label_ids = features["label_ids"]
is_training = (mode == tf.estimator.ModeKeys.TRAIN)
(total_loss, per_example_loss, logits, probabilities) = create_model(
bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
num_labels, use_one_hot_embeddings)
tvars = tf.trainable_variables()
initialized_variable_names = {}
if init_checkpoint:
(assignment_map, initialized_variable_names) \
= modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
tf.logging.info("**** Trainable Variables ****")
for var in tvars:
init_string = ""
if var.name in initialized_variable_names:
init_string = ", *INIT_FROM_CKPT*"
tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
init_string)
output_spec = EstimatorSpec(mode=mode, predictions=probabilities)
return output_spec
return model_fn
def get_estimator(self):
from tensorflow.python.estimator.estimator import Estimator
from tensorflow.python.estimator.run_config import RunConfig
bert_config = modeling.BertConfig.from_json_file(args.config_name)
label_list = self.processor.get_labels()
if self.mode == tf.estimator.ModeKeys.TRAIN:
init_checkpoint = args.ckpt_name
else:
init_checkpoint = args.output_dir
model_fn = self.model_fn_builder(
bert_config=bert_config,
num_labels=len(label_list),
init_checkpoint=init_checkpoint,
learning_rate=args.learning_rate,
num_train_steps=None,
num_warmup_steps=None,
use_one_hot_embeddings=False)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.per_process_gpu_memory_fraction = args.gpu_memory_fraction
config.log_device_placement = False
return Estimator(model_fn=model_fn, config=RunConfig(session_config=config), model_dir=args.output_dir,
params={'batch_size': self.batch_size})
def predict_sentences(self,sentences):
results= self.estimator.predict(input_fn=input_fn_builder(self,sentences), yield_single_examples=False)
#打印预测结果
for i in results:
print(i)
def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal percent
# of tokens from each, since if one sequence is very short then each token
# that's truncated likely contains more information than a longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
def convert_single_example(self, ex_index, example, label_list, max_seq_length, tokenizer):
"""Converts a single `InputExample` into a single `InputFeatures`."""
label_map = {}
for (i, label) in enumerate(label_list):
label_map[label] = i
tokens_a = tokenizer.tokenize(example.text_a)
tokens_b = None
if example.text_b:
tokens_b = tokenizer.tokenize(example.text_b)
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
self._truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
# The convention in BERT is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
#
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# `type=1` were learned during pre-training and are added to the wordpiece
# embedding vector (and position vector). This is not *strictly* necessary
# since the [SEP] token unambiguously separates the sequences, but it makes
# it easier for the model to learn the concept of sequences.
#
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
if tokens_b:
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)
# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
label_id = label_map[example.label]
if ex_index < 5:
tf.logging.info("*** Example ***")
tf.logging.info("guid: %s" % (example.guid))
tf.logging.info("tokens: %s" % " ".join(
[tokenization.printable_text(x) for x in tokens]))
tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
feature = InputFeatures(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_id=label_id)
return feature
def input_fn_builder(bertSim,sentences):
def predict_input_fn():
return (tf.data.Dataset.from_generator(
generate_from_input,
output_types={
'input_ids': tf.int32,
'input_mask': tf.int32,
'segment_ids': tf.int32,
'label_ids': tf.int32},
output_shapes={
'input_ids': (None, bertSim.max_seq_length),
'input_mask': (None, bertSim.max_seq_length),
'segment_ids': (None, bertSim.max_seq_length),
'label_ids': (1,)}).prefetch(10))
def generate_from_input():
processor = bertSim.processor
predict_examples = processor.get_sentence_examples(sentences)
features = convert_examples_to_features(predict_examples, processor.get_labels(), args.max_seq_len,
bertSim.tokenizer)
yield {
'input_ids': [f.input_ids for f in features],
'input_mask': [f.input_mask for f in features],
'segment_ids': [f.segment_ids for f in features],
'label_ids': [f.label_id for f in features]
}
return predict_input_fn
if __name__ == '__main__':
sim = BertSim()
sim.start_model()
sim.predict_sentences([("我喜欢妈妈做的汤", "妈妈做的汤我很喜欢喝")])
================================================
FILE: test_changes.py
================================================
# coding=utf-8
import tensorflow as tf
from modeling import embedding_lookup_factorized,transformer_model
import os
"""
测试albert主要的改进点:词嵌入的因式分解、层间参数共享、段落间连贯性
test main change of albert from bert
"""
batch_size = 2048
sequence_length = 512
vocab_size = 30000
hidden_size = 1024
num_attention_heads = int(hidden_size / 64)
def get_total_parameters():
"""
get total parameters of a graph
:return:
"""
total_parameters = 0
for variable in tf.trainable_variables():
# shape is an array of tf.Dimension
shape = variable.get_shape()
# print(shape)
# print(len(shape))
variable_parameters = 1
for dim in shape:
# print(dim)
variable_parameters *= dim.value
# print(variable_parameters)
total_parameters += variable_parameters
return total_parameters
def test_factorized_embedding():
"""
test of Factorized embedding parameterization
:return:
"""
input_ids=tf.zeros((batch_size, sequence_length),dtype=tf.int32)
output, embedding_table, embedding_table_2=embedding_lookup_factorized(input_ids,vocab_size,hidden_size)
print("output:",output)
def test_share_parameters():
"""
test of share parameters across all layers: how many parameter after share parameter across layers of transformer.
:return:
"""
def total_parameters_transformer(share_parameter_across_layers):
input_tensor=tf.zeros((batch_size, sequence_length, hidden_size),dtype=tf.float32)
print("transformer_model. input:",input_tensor)
transformer_result=transformer_model(input_tensor,hidden_size=hidden_size,num_attention_heads=num_attention_heads,share_parameter_across_layers=share_parameter_across_layers)
print("transformer_result:",transformer_result)
total_parameters=get_total_parameters()
print('total_parameters(not share):',total_parameters)
share_parameter_across_layers=False
total_parameters_transformer(share_parameter_across_layers) # total parameters, not share: 125,976,576 = 125 million
tf.reset_default_graph() # Clears the default graph stack and resets the global default graph
share_parameter_across_layers=True
total_parameters_transformer(share_parameter_across_layers) # total parameters, share: 10,498,048 = 10.5 million
def test_sentence_order_prediction():
"""
sentence order prediction.
check method of create_instances_from_document_albert from create_pretrining_data.py
:return:
"""
# 添加运行权限
os.system("chmod +x create_pretrain_data.sh")
os.system("./create_pretrain_data.sh")
# 1.test of Factorized embedding parameterization
#test_factorized_embedding()
# 2. test of share parameters across all layers: how many parameter after share parameter across layers of transformer.
# before share parameter: 125,976,576; after share parameter:
#test_share_parameters()
# 3. test of sentence order prediction(SOP)
test_sentence_order_prediction()
================================================
FILE: tokenization.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import re
import unicodedata
import six
import tensorflow as tf
def validate_case_matches_checkpoint(do_lower_case, init_checkpoint):
"""Checks whether the casing config is consistent with the checkpoint name."""
# The casing has to be passed in by the user and there is no explicit check
# as to whether it matches the checkpoint. The casing information probably
# should have been stored in the bert_config.json file, but it's not, so
# we have to heuristically detect it to validate.
if not init_checkpoint:
return
m = re.match("^.*?([A-Za-z0-9_-]+)/bert_model.ckpt", init_checkpoint)
if m is None:
return
model_name = m.group(1)
lower_models = [
"uncased_L-24_H-1024_A-16", "uncased_L-12_H-768_A-12",
"multilingual_L-12_H-768_A-12", "chinese_L-12_H-768_A-12"
]
cased_models = [
"cased_L-12_H-768_A-12", "cased_L-24_H-1024_A-16",
"multi_cased_L-12_H-768_A-12"
]
is_bad_config = False
if model_name in lower_models and not do_lower_case:
is_bad_config = True
actual_flag = "False"
case_name = "lowercased"
opposite_flag = "True"
if model_name in cased_models and do_lower_case:
is_bad_config = True
actual_flag = "True"
case_name = "cased"
opposite_flag = "False"
if is_bad_config:
raise ValueError(
"You passed in `--do_lower_case=%s` with `--init_checkpoint=%s`. "
"However, `%s` seems to be a %s model, so you "
"should pass in `--do_lower_case=%s` so that the fine-tuning matches "
"how the model was pre-training. If this error is wrong, please "
"just comment out this check." % (actual_flag, init_checkpoint,
model_name, case_name, opposite_flag))
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def printable_text(text):
"""Returns text encoded in a way suitable for print or `tf.logging`."""
# These functions want `str` for both Python2 and Python3, but in one case
# it's a Unicode string and in the other it's a byte string.
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text
elif isinstance(text, unicode):
return text.encode("utf-8")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
index = 0
with tf.gfile.GFile(vocab_file, "r") as reader:
while True:
token = convert_to_unicode(reader.readline())
if not token:
break
token = token.strip()
vocab[token] = index
index += 1
return vocab
def convert_by_vocab(vocab, items):
"""Converts a sequence of [tokens|ids] using the vocab."""
output = []
#print("items:",items) #['[CLS]', '日', '##期', ',', '但', '被', '##告', '金', '##东', '##福', '载', '##明', '[MASK]', 'U', '##N', '##K', ']', '保', '##证', '本', '##月', '1', '##4', '[MASK]', '到', '##位', ',', '2', '##0', '##1', '##5', '年', '6', '[MASK]', '1', '##1', '日', '[', 'U', '##N', '##K', ']', ',', '原', '##告', '[MASK]', '认', '##可', '于', '2', '##0', '##1', '##5', '[MASK]', '6', '月', '[MASK]', '[MASK]', '日', '##向', '被', '##告', '主', '##张', '权', '##利', '。', '而', '[MASK]', '[MASK]', '自', '[MASK]', '[MASK]', '[MASK]', '[MASK]', '年', '6', '月', '1', '##1', '日', '[SEP]', '原', '##告', '于', '2', '##0', '##1', '##6', '[MASK]', '6', '[MASK]', '2', '##4', '日', '起', '##诉', ',', '主', '##张', '保', '##证', '责', '##任', ',', '已', '超', '##过', '保', '##证', '期', '##限', '[MASK]', '保', '##证', '人', '依', '##法', '不', '##再', '承', '##担', '保', '##证', '[MASK]', '[MASK]', '[MASK]', '[SEP]']
for i,item in enumerate(items):
#print(i,"item:",item) # ##期
output.append(vocab[item])
return output
def convert_tokens_to_ids(vocab, tokens):
return convert_by_vocab(vocab, tokens)
def convert_ids_to_tokens(inv_vocab, ids):
return convert_by_vocab(inv_vocab, ids)
def whitespace_tokenize(text):
"""Runs basic whitespace cleaning and splitting on a piece of text."""
text = text.strip()
if not text:
return []
tokens = text.split()
return tokens
class FullTokenizer(object):
"""Runs end-to-end tokenziation."""
def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
def tokenize(self, text):
split_tokens = []
for token in self.basic_tokenizer.tokenize(text):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
return convert_by_vocab(self.inv_vocab, ids)
class BasicTokenizer(object):
"""Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
def __init__(self, do_lower_case=True):
"""Constructs a BasicTokenizer.
Args:
do_lower_case: Whether to lower case the input.
"""
self.do_lower_case = do_lower_case
def tokenize(self, text):
"""Tokenizes a piece of text."""
text = convert_to_unicode(text)
text = self._clean_text(text)
# This was added on November 1st, 2018 for the multilingual and Chinese
# models. This is also applied to the English models now, but it doesn't
# matter since the English models were not trained on any Chinese data
# and generally don't have any Chinese data in them (there are Chinese
# characters in the vocabulary because Wikipedia does have some Chinese
# words in the English Wikipedia.).
text = self._tokenize_chinese_chars(text)
orig_tokens = whitespace_tokenize(text)
split_tokens = []
for token in orig_tokens:
if self.do_lower_case:
token = token.lower()
token = self._run_strip_accents(token)
split_tokens.extend(self._run_split_on_punc(token))
output_tokens = whitespace_tokenize(" ".join(split_tokens))
return output_tokens
def _run_strip_accents(self, text):
"""Strips accents from a piece of text."""
text = unicodedata.normalize("NFD", text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == "Mn":
continue
output.append(char)
return "".join(output)
def _run_split_on_punc(self, text):
"""Splits punctuation on a piece of text."""
chars = list(text)
i = 0
start_new_word = True
output = []
while i < len(chars):
char = chars[i]
if _is_punctuation(char):
output.append([char])
start_new_word = True
else:
if start_new_word:
output.append([])
start_new_word = False
output[-1].append(char)
i += 1
return ["".join(x) for x in output]
def _tokenize_chinese_chars(self, text):
"""Adds whitespace around any CJK character."""
output = []
for char in text:
cp = ord(char)
if self._is_chinese_char(cp):
output.append(" ")
output.append(char)
output.append(" ")
else:
output.append(char)
return "".join(output)
def _is_chinese_char(self, cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
(cp >= 0x3400 and cp <= 0x4DBF) or #
(cp >= 0x20000 and cp <= 0x2A6DF) or #
(cp >= 0x2A700 and cp <= 0x2B73F) or #
(cp >= 0x2B740 and cp <= 0x2B81F) or #
(cp >= 0x2B820 and cp <= 0x2CEAF) or
(cp >= 0xF900 and cp <= 0xFAFF) or #
(cp >= 0x2F800 and cp <= 0x2FA1F)): #
return True
return False
def _clean_text(self, text):
"""Performs invalid character removal and whitespace cleanup on text."""
output = []
for char in text:
cp = ord(char)
if cp == 0 or cp == 0xfffd or _is_control(char):
continue
if _is_whitespace(char):
output.append(" ")
else:
output.append(char)
return "".join(output)
class WordpieceTokenizer(object):
"""Runs WordPiece tokenziation."""
def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word
def tokenize(self, text):
"""Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization
using the given vocabulary.
For example:
input = "unaffable"
output = ["un", "##aff", "##able"]
Args:
text: A single token or whitespace separated tokens. This should have
already been passed through `BasicTokenizer.
Returns:
A list of wordpiece tokens.
"""
text = convert_to_unicode(text)
output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue
is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start > 0:
substr = "##" + substr
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end
if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens
def _is_whitespace(char):
"""Checks whether `chars` is a whitespace character."""
# \t, \n, and \r are technically contorl characters but we treat them
# as whitespace since they are generally considered as such.
if char == " " or char == "\t" or char == "\n" or char == "\r":
return True
cat = unicodedata.category(char)
if cat == "Zs":
return True
return False
def _is_control(char):
"""Checks whether `chars` is a control character."""
# These are technically control characters but we count them as whitespace
# characters.
if char == "\t" or char == "\n" or char == "\r":
return False
cat = unicodedata.category(char)
if cat in ("Cc", "Cf"):
return True
return False
def _is_punctuation(char):
"""Checks whether `chars` is a punctuation character."""
cp = ord(char)
# We treat all non-letter/number ASCII as punctuation.
# Characters such as "^", "$", and "`" are not in the Unicode
# Punctuation class but we treat them as punctuation anyways, for
# consistency.
if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
(cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
return True
cat = unicodedata.category(char)
if cat.startswith("P"):
return True
return False
================================================
FILE: tokenization_google.py
================================================
# coding=utf-8
# Copyright 2019 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Lint as: python2, python3
# coding=utf-8
"""Tokenization classes."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import re
import unicodedata
import six
from six.moves import range
import tensorflow as tf
import sentencepiece as spm
SPIECE_UNDERLINE = u"▁".encode("utf-8")
def validate_case_matches_checkpoint(do_lower_case, init_checkpoint):
"""Checks whether the casing config is consistent with the checkpoint name."""
# The casing has to be passed in by the user and there is no explicit check
# as to whether it matches the checkpoint. The casing information probably
# should have been stored in the bert_config.json file, but it's not, so
# we have to heuristically detect it to validate.
if not init_checkpoint:
return
m = re.match("^.*?([A-Za-z0-9_-]+)/bert_model.ckpt",
six.ensure_str(init_checkpoint))
if m is None:
return
model_name = m.group(1)
lower_models = [
"uncased_L-24_H-1024_A-16", "uncased_L-12_H-768_A-12",
"multilingual_L-12_H-768_A-12", "chinese_L-12_H-768_A-12"
]
cased_models = [
"cased_L-12_H-768_A-12", "cased_L-24_H-1024_A-16",
"multi_cased_L-12_H-768_A-12"
]
is_bad_config = False
if model_name in lower_models and not do_lower_case:
is_bad_config = True
actual_flag = "False"
case_name = "lowercased"
opposite_flag = "True"
if model_name in cased_models and do_lower_case:
is_bad_config = True
actual_flag = "True"
case_name = "cased"
opposite_flag = "False"
if is_bad_config:
raise ValueError(
"You passed in `--do_lower_case=%s` with `--init_checkpoint=%s`. "
"However, `%s` seems to be a %s model, so you "
"should pass in `--do_lower_case=%s` so that the fine-tuning matches "
"how the model was pre-training. If this error is wrong, please "
"just comment out this check." % (actual_flag, init_checkpoint,
model_name, case_name, opposite_flag))
def preprocess_text(inputs, remove_space=True, lower=False):
"""preprocess data by removing extra space and normalize data."""
outputs = inputs
if remove_space:
outputs = " ".join(inputs.strip().split())
if six.PY2 and isinstance(outputs, str):
try:
outputs = six.ensure_text(outputs, "utf-8")
except UnicodeDecodeError:
outputs = six.ensure_text(outputs, "latin-1")
outputs = unicodedata.normalize("NFKD", outputs)
outputs = "".join([c for c in outputs if not unicodedata.combining(c)])
if lower:
outputs = outputs.lower()
return outputs
def encode_pieces(sp_model, text, return_unicode=True, sample=False):
"""turn sentences into word pieces."""
if six.PY2 and isinstance(text, six.text_type):
text = six.ensure_binary(text, "utf-8")
if not sample:
pieces = sp_model.EncodeAsPieces(text)
else:
pieces = sp_model.SampleEncodeAsPieces(text, 64, 0.1)
new_pieces = []
for piece in pieces:
piece = printable_text(piece)
if len(piece) > 1 and piece[-1] == "," and piece[-2].isdigit():
cur_pieces = sp_model.EncodeAsPieces(
six.ensure_binary(piece[:-1]).replace(SPIECE_UNDERLINE, b""))
if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:
if len(cur_pieces[0]) == 1:
cur_pieces = cur_pieces[1:]
else:
cur_pieces[0] = cur_pieces[0][1:]
cur_pieces.append(piece[-1])
new_pieces.extend(cur_pieces)
else:
new_pieces.append(piece)
# note(zhiliny): convert back to unicode for py2
if six.PY2 and return_unicode:
ret_pieces = []
for piece in new_pieces:
if isinstance(piece, str):
piece = six.ensure_text(piece, "utf-8")
ret_pieces.append(piece)
new_pieces = ret_pieces
return new_pieces
def encode_ids(sp_model, text, sample=False):
pieces = encode_pieces(sp_model, text, return_unicode=False, sample=sample)
ids = [sp_model.PieceToId(piece) for piece in pieces]
return ids
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return six.ensure_text(text, "utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return six.ensure_text(text, "utf-8", "ignore")
elif isinstance(text, six.text_type):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def printable_text(text):
"""Returns text encoded in a way suitable for print or `tf.logging`."""
# These functions want `str` for both Python2 and Python3, but in one case
# it's a Unicode string and in the other it's a byte string.
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return six.ensure_text(text, "utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text
elif isinstance(text, six.text_type):
return six.ensure_binary(text, "utf-8")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
with tf.gfile.GFile(vocab_file, "r") as reader:
while True:
token = convert_to_unicode(reader.readline())
if not token:
break
token = token.strip() # previous: token.strip().split()[0]
if token not in vocab:
vocab[token] = len(vocab)
return vocab
def convert_by_vocab(vocab, items):
"""Converts a sequence of [tokens|ids] using the vocab."""
output = []
for item in items:
output.append(vocab[item])
return output
def convert_tokens_to_ids(vocab, tokens):
return convert_by_vocab(vocab, tokens)
def convert_ids_to_tokens(inv_vocab, ids):
return convert_by_vocab(inv_vocab, ids)
def whitespace_tokenize(text):
"""Runs basic whitespace cleaning and splitting on a piece of text."""
text = text.strip()
if not text:
return []
tokens = text.split()
return tokens
class FullTokenizer(object):
"""Runs end-to-end tokenziation."""
def __init__(self, vocab_file, do_lower_case=True, spm_model_file=None):
self.vocab = None
self.sp_model = None
print("spm_model_file:",spm_model_file,";vocab_file:",vocab_file)
if spm_model_file:
print("#Use spm_model_file")
self.sp_model = spm.SentencePieceProcessor()
tf.logging.info("loading sentence piece model")
self.sp_model.Load(spm_model_file)
# Note(mingdachen): For the purpose of consisent API, we are
# generating a vocabulary for the sentence piece tokenizer.
self.vocab = {self.sp_model.IdToPiece(i): i for i
in range(self.sp_model.GetPieceSize())}
else:
print("#Use vocab_file")
self.vocab = load_vocab(vocab_file)
self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
def tokenize(self, text):
if self.sp_model:
split_tokens = encode_pieces(self.sp_model, text, return_unicode=False)
else:
split_tokens = []
for token in self.basic_tokenizer.tokenize(text):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
if self.sp_model:
tf.logging.info("using sentence piece tokenzier.")
return [self.sp_model.PieceToId(
printable_text(token)) for token in tokens]
else:
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
if self.sp_model:
tf.logging.info("using sentence piece tokenzier.")
return [self.sp_model.IdToPiece(id_) for id_ in ids]
else:
return convert_by_vocab(self.inv_vocab, ids)
class BasicTokenizer(object):
"""Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
def __init__(self, do_lower_case=True):
"""Constructs a BasicTokenizer.
Args:
do_lower_case: Whether to lower case the input.
"""
self.do_lower_case = do_lower_case
def tokenize(self, text):
"""Tokenizes a piece of text."""
text = convert_to_unicode(text)
text = self._clean_text(text)
# This was added on November 1st, 2018 for the multilingual and Chinese
# models. This is also applied to the English models now, but it doesn't
# matter since the English models were not trained on any Chinese data
# and generally don't have any Chinese data in them (there are Chinese
# characters in the vocabulary because Wikipedia does have some Chinese
# words in the English Wikipedia.).
text = self._tokenize_chinese_chars(text)
orig_tokens = whitespace_tokenize(text)
split_tokens = []
for token in orig_tokens:
if self.do_lower_case:
token = token.lower()
token = self._run_strip_accents(token)
split_tokens.extend(self._run_split_on_punc(token))
output_tokens = whitespace_tokenize(" ".join(split_tokens))
return output_tokens
def _run_strip_accents(self, text):
"""Strips accents from a piece of text."""
text = unicodedata.normalize("NFD", text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == "Mn":
continue
output.append(char)
return "".join(output)
def _run_split_on_punc(self, text):
"""Splits punctuation on a piece of text."""
chars = list(text)
i = 0
start_new_word = True
output = []
while i < len(chars):
char = chars[i]
if _is_punctuation(char):
output.append([char])
start_new_word = True
else:
if start_new_word:
output.append([])
start_new_word = False
output[-1].append(char)
i += 1
return ["".join(x) for x in output]
def _tokenize_chinese_chars(self, text):
"""Adds whitespace around any CJK character."""
output = []
for char in text:
cp = ord(char)
if self._is_chinese_char(cp):
output.append(" ")
output.append(char)
output.append(" ")
else:
output.append(char)
return "".join(output)
def _is_chinese_char(self, cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
(cp >= 0x3400 and cp <= 0x4DBF) or #
(cp >= 0x20000 and cp <= 0x2A6DF) or #
(cp >= 0x2A700 and cp <= 0x2B73F) or #
(cp >= 0x2B740 and cp <= 0x2B81F) or #
(cp >= 0x2B820 and cp <= 0x2CEAF) or
(cp >= 0xF900 and cp <= 0xFAFF) or #
(cp >= 0x2F800 and cp <= 0x2FA1F)): #
return True
return False
def _clean_text(self, text):
"""Performs invalid character removal and whitespace cleanup on text."""
output = []
for char in text:
cp = ord(char)
if cp == 0 or cp == 0xfffd or _is_control(char):
continue
if _is_whitespace(char):
output.append(" ")
else:
output.append(char)
return "".join(output)
class WordpieceTokenizer(object):
"""Runs WordPiece tokenziation."""
def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word
def tokenize(self, text):
"""Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization
using the given vocabulary.
For example:
input = "unaffable"
output = ["un", "##aff", "##able"]
Args:
text: A single token or whitespace separated tokens. This should have
already been passed through `BasicTokenizer.
Returns:
A list of wordpiece tokens.
"""
text = convert_to_unicode(text)
output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue
is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start > 0:
substr = "##" + six.ensure_str(substr)
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end
if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens
def _is_whitespace(char):
"""Checks whether `chars` is a whitespace character."""
# \t, \n, and \r are technically control characters but we treat them
# as whitespace since they are generally considered as such.
if char == " " or char == "\t" or char == "\n" or char == "\r":
return True
cat = unicodedata.category(char)
if cat == "Zs":
return True
return False
def _is_control(char):
"""Checks whether `chars` is a control character."""
# These are technically control characters but we count them as whitespace
# characters.
if char == "\t" or char == "\n" or char == "\r":
return False
cat = unicodedata.category(char)
if cat in ("Cc", "Cf"):
return True
return False
def _is_punctuation(char):
"""Checks whether `chars` is a punctuation character."""
cp = ord(char)
# We treat all non-letter/number ASCII as punctuation.
# Characters such as "^", "$", and "`" are not in the Unicode
# Punctuation class but we treat them as punctuation anyways, for
# consistency.
if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
(cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
return True
cat = unicodedata.category(char)
if cat.startswith("P"):
return True
return False