Repository: brightmart/albert_zh Branch: master Commit: 52149e82faf3 Files: 40 Total size: 620.3 KB Directory structure: gitextract_fwt0rbxl/ ├── README.md ├── albert_config/ │ ├── albert_config_base.json │ ├── albert_config_base_google_fast.json │ ├── albert_config_large.json │ ├── albert_config_small_google.json │ ├── albert_config_tiny.json │ ├── albert_config_tiny_google.json │ ├── albert_config_tiny_google_fast.json │ ├── albert_config_xlarge.json │ ├── albert_config_xxlarge.json │ ├── bert_config.json │ └── vocab.txt ├── args.py ├── bert_utils.py ├── classifier_utils.py ├── create_pretrain_data.sh ├── create_pretraining_data.py ├── create_pretraining_data_google.py ├── data/ │ └── news_zh_1.txt ├── lamb_optimizer_google.py ├── modeling.py ├── modeling_google.py ├── modeling_google_fast.py ├── optimization.py ├── optimization_finetuning.py ├── optimization_google.py ├── resources/ │ ├── create_pretraining_data_roberta.py │ └── shell_scripts/ │ └── create_pretrain_data_batch_webtext.sh ├── run_classifier.py ├── run_classifier_clue.py ├── run_classifier_clue.sh ├── run_classifier_lcqmc.sh ├── run_classifier_sp_google.py ├── run_pretraining.py ├── run_pretraining_google.py ├── run_pretraining_google_fast.py ├── similarity.py ├── test_changes.py ├── tokenization.py └── tokenization_google.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: README.md ================================================ # albert_zh An Implementation of A Lite Bert For Self-Supervised Learning Language Representations with TensorFlow ALBert is based on Bert, but with some improvements. It achieves state of the art performance on main benchmarks with 30% parameters less. For albert_base_zh it only has ten percentage parameters compare of original bert model, and main accuracy is retained. Different version of ALBERT pre-trained model for Chinese, including TensorFlow, PyTorch and Keras, is available now. 海量中文语料上预训练ALBERT模型:参数更少,效果更好。预训练小模型也能拿下13项NLP任务,ALBERT三大改造登顶GLUE基准 clueai工具包: 三行代码,三分钟定制一个NLP的API(零样本学习) 一键运行10个数据集、9个基线模型、不同任务上模型效果的详细对比,见CLUE benchmark 一键运行CLUE中文任务:6个中文分类或句子对任务(新) --------------------------------------------------------------------- 使用方式: 1、克隆项目 git clone https://github.com/brightmart/albert_zh.git 2、运行一键运行脚本(GPU方式): 会自动下载模型和所有任务数据并开始运行。 bash run_classifier_clue.sh 执行该一键运行脚本将会自动下载所有任务数据,并为所有任务找到最优模型,然后测试得到提交结果 模型下载 Download Pre-trained Models of Chinese ----------------------------------------------- 1、albert_tiny_zh, albert_tiny_zh(训练更久,累积学习20亿个样本),文件大小16M、参数为4M 训练和推理预测速度提升约10倍,精度基本保留,模型大小为bert的1/25;语义相似度数据集LCQMC测试集上达到85.4%,相比bert_base仅下降1.5个点。 lcqmc训练使用如下参数: --max_seq_length=128 --train_batch_size=64 --learning_rate=1e-4 --num_train_epochs=5 albert_tiny使用同样的大规模中文语料数据,层数仅为4层、hidden size等向量维度大幅减少; 尝试使用如下学习率来获得更好效果:{2e-5, 6e-5, 1e-4} 【使用场景】任务相对比较简单一些或实时性要求高的任务,如语义相似度等句子对任务、分类任务;比较难的任务如阅读理解等,可以使用其他大模型。 例如,可以使用[Tensorflow Lite](https://www.tensorflow.org/lite)在移动端进行部署,本文[随后](#use_tflite)针对这一点进行了介绍,包括如何把模型转换成Tensorflow Lite格式和对其进行性能测试等。 一键运行albert_tiny_zh(linux,lcqmc任务): 1) git clone https://github.com/brightmart/albert_zh 2) cd albert_zh 3) bash run_classifier_lcqmc.sh 1.1、albert_tiny_google_zh(累积学习10亿个样本,google版本),模型大小16M、性能与albert_tiny_zh一致 1.2、albert_small_google_zh(累积学习10亿个样本,google版本), 速度比bert_base快4倍;LCQMC测试集上比Bert下降仅0.9个点;去掉adam后模型大小18.5M;使用方法,见 #下游任务 Fine-tuning on Downstream Task 2、albert_large_zh,参数量,层数24,文件大小为64M 参数量和模型大小为bert_base的六分之一;在口语化描述相似性数据集LCQMC的测试集上相比bert_base上升0.2个点 3、albert_base_zh(额外训练了1.5亿个实例即 36k steps * batch_size 4096); albert_base_zh(小模型体验版), 参数量12M, 层数12,大小为40M 参数量为bert_base的十分之一,模型大小也十分之一;在口语化描述相似性数据集LCQMC的测试集上相比bert_base下降约0.6~1个点; 相比未预训练,albert_base提升14个点 4、albert_xlarge_zh_177k ; albert_xlarge_zh_183k(优先尝试)参数量,层数24,文件大小为230M 参数量和模型大小为bert_base的二分之一;需要一张大的显卡;完整测试对比将后续添加;batch_size不能太小,否则可能影响精度 ### 快速加载 依托于[Huggingface-Transformers 2.2.2](https://github.com/huggingface/transformers),可轻松调用以上模型。 ``` tokenizer = AutoTokenizer.from_pretrained("MODEL_NAME") model = AutoModel.from_pretrained("MODEL_NAME") ``` 其中`MODEL_NAME`对应列表如下: | 模型名 | MODEL_NAME | | - | - | | albert_tiny_google_zh | voidful/albert_chinese_tiny | | albert_small_google_zh | voidful/albert_chinese_small | | albert_base_zh (from google) | voidful/albert_chinese_base | | albert_large_zh (from google) | voidful/albert_chinese_large | | albert_xlarge_zh (from google) | voidful/albert_chinese_xlarge | | albert_xxlarge_zh (from google) | voidful/albert_chinese_xxlarge | 更多通过transformers使用albert的示例 预训练 Pre-training ----------------------------------------------- #### 生成特定格式的文件(tfrecords) Generate tfrecords Files Run following command 运行以下命令即可。项目自动了一个示例的文本文件(data/news_zh_1.txt) bash create_pretrain_data.sh 如果你有很多文本文件,可以通过传入参数的方式,生成多个特定格式的文件(tfrecords) ###### Support English and Other Non-Chinese Language: If you are doing pre-train for english or other language,which is not chinese, you should set hyperparameter of non_chinese to True on create_pretraining_data.py; otherwise, by default it is doing chinese pre-train using whole word mask of chinese. #### 执行预训练 pre-training on GPU/TPU using the command GPU(brightmart版, tiny模型): export BERT_BASE_DIR=./albert_tiny_zh nohup python3 run_pretraining.py --input_file=./data/tf*.tfrecord \ --output_dir=./my_new_model_path --do_train=True --do_eval=True --bert_config_file=$BERT_BASE_DIR/albert_config_tiny.json \ --train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=51 \ --num_train_steps=125000 --num_warmup_steps=12500 --learning_rate=0.00176 \ --save_checkpoints_steps=2000 --init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt & GPU(Google版本, small模型): export BERT_BASE_DIR=./albert_small_zh_google nohup python3 run_pretraining_google.py --input_file=./data/tf*.tfrecord --eval_batch_size=64 \ --output_dir=./my_new_model_path --do_train=True --do_eval=True --albert_config_file=$BERT_BASE_DIR/albert_config_small_google.json --export_dir=./my_new_model_path_export \ --train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=20 \ --num_train_steps=125000 --num_warmup_steps=12500 --learning_rate=0.00176 \ --save_checkpoints_steps=2000 --init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt TPU, add something like this: --use_tpu=True --tpu_name=grpc://10.240.1.66:8470 --tpu_zone=us-central1-a 注:如果你重头开始训练,可以不指定init_checkpoint; 如果你从现有的模型基础上训练,指定一下BERT_BASE_DIR的路径,并确保bert_config_file和init_checkpoint两个参数的值能对应到相应的文件上; 领域上的预训练,根据数据的大小,可以不用训练特别久。 环境 Environment ----------------------------------------------- Use Python3 + Tensorflow 1.x e.g. Tensorflow 1.4 or 1.5 下游任务 Fine-tuning on Downstream Task ----------------------------------------------- ##### 使用TensorFlow: 以使用albert_base做LCQMC任务为例。LCQMC任务是在口语化描述的数据集上做文本的相似性预测。 We will use LCQMC dataset for fine-tuning, it is oral language corpus, it is used to train and predict semantic similarity of a pair of sentences. 下载LCQMC数据集,包含训练、验证和测试集,训练集包含24万口语化描述的中文句子对,标签为1或0。1为句子语义相似,0为语义不相似。 通过运行下列命令做LCQMC数据集上的fine-tuning: 1. Clone this project: git clone https://github.com/brightmart/albert_zh.git 2. Fine-tuning by running the following command. brightmart版本的tiny模型 export BERT_BASE_DIR=./albert_tiny_zh export TEXT_DIR=./lcqmc nohup python3 run_classifier.py --task_name=lcqmc_pair --do_train=true --do_eval=true --data_dir=$TEXT_DIR --vocab_file=./albert_config/vocab.txt \ --bert_config_file=./albert_config/albert_config_tiny.json --max_seq_length=128 --train_batch_size=64 --learning_rate=1e-4 --num_train_epochs=5 \ --output_dir=./albert_lcqmc_checkpoints --init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt & google版本的small模型 export BERT_BASE_DIR=./albert_small_zh export TEXT_DIR=./lcqmc nohup python3 run_classifier_sp_google.py --task_name=lcqmc_pair --do_train=true --do_eval=true --data_dir=$TEXT_DIR --vocab_file=./albert_config/vocab.txt \ --albert_config_file=./$BERT_BASE_DIR/albert_config_small_google.json --max_seq_length=128 --train_batch_size=64 --learning_rate=1e-4 --num_train_epochs=5 \ --output_dir=./albert_lcqmc_checkpoints --init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt & Notice/注: 1) you need to download pre-trained chinese albert model, and also download LCQMC dataset 你需要下载预训练的模型,并放入到项目当前项目,假设目录名称为albert_tiny_zh; 需要下载LCQMC数据集,并放入到当前项目, 假设数据集目录名称为lcqmc 2) for Fine-tuning, you can try to add small percentage of dropout(e.g. 0.1) by changing parameters of attention_probs_dropout_prob & hidden_dropout_prob on albert_config_xxx.json. By default, we set dropout as zero. 3) you can try different learning rate {2e-5, 6e-5, 1e-4} for better performance Updates ----------------------------------------------- **\*\*\*\*\* 2019-11-03: add google version of albert_small, albert_tiny; add method to deploy ablert_tiny to mobile devices with only 0.1 second inference time for sequence length 128, 60M memory \*\*\*\*\*** **\*\*\*\*\* 2019-10-30: add a simple guide about converting the model to Tensorflow Lite for edge deployment \*\*\*\*\*** **\*\*\*\*\* 2019-10-15: albert_tiny_zh, 10 times fast than bert base for training and inference, accuracy remains \*\*\*\*\*** **\*\*\*\*\* 2019-10-07: more models of albert \*\*\*\*\*** add albert_xlarge_zh; albert_base_zh_additional_steps, training with more instances **\*\*\*\*\* 2019-10-04: PyTorch and Keras versions of albert were supported \*\*\*\*\*** a.Convert to PyTorch version and do your tasks through albert_pytorch b.Load pre-trained model with keras using one line of codes through bert4keras c.Use albert with TensorFlow 2.0: Use or load pre-trained model with tf2.0 through bert-for-tf2 Releasing albert_xlarge on 6th Oct **\*\*\*\*\* 2019-10-02: albert_large_zh,albert_base_zh \*\*\*\*\*** Relesed albert_base_zh with only 10% parameters of bert_base, a small model(40M) & training can be very fast. Relased albert_large_zh with only 16% parameters of bert_base(64M) **\*\*\*\*\* 2019-09-28: codes and test functions \*\*\*\*\*** Add codes and test functions for three main changes of albert from bert ALBERT模型介绍 Introduction of ALBERT ----------------------------------------------- ALBERT模型是BERT的改进版,与最近其他State of the art的模型不同的是,这次是预训练小模型,效果更好、参数更少。 它对BERT进行了三个改造 Three main changes of ALBert from Bert: 1)词嵌入向量参数的因式分解 Factorized embedding parameterization O(V * H) to O(V * E + E * H) 如以ALBert_xxlarge为例,V=30000, H=4096, E=128 那么原先参数为V * H= 30000 * 4096 = 1.23亿个参数,现在则为V * E + E * H = 30000*128+128*4096 = 384万 + 52万 = 436万, 词嵌入相关的参数变化前是变换后的28倍。 2)跨层参数共享 Cross-Layer Parameter Sharing 参数共享能显著减少参数。共享可以分为全连接层、注意力层的参数共享;注意力层的参数对效果的减弱影响小一点。 3)段落连续性任务 Inter-sentence coherence loss. 使用段落连续性任务。正例,使用从一个文档中连续的两个文本段落;负例,使用从一个文档中连续的两个文本段落,但位置调换了。 避免使用原有的NSP任务,原有的任务包含隐含了预测主题这类过于简单的任务。 We maintain that inter-sentence modeling is an important aspect of language understanding, but we propose a loss based primarily on coherence. That is, for ALBERT, we use a sentence-order prediction (SOP) loss, which avoids topic prediction and instead focuses on modeling inter-sentence coherence. The SOP loss uses as positive examples the same technique as BERT (two consecutive segments from the same document), and as negative examples the same two consecutive segments but with their order swapped. This forces the model to learn finer-grained distinctions about discourse-level coherence properties. 其他变化,还有 Other changes: 1)去掉了dropout Remove dropout to enlarge capacity of model. 最大的模型,训练了1百万步后,还是没有过拟合训练数据。说明模型的容量还可以更大,就移除了dropout (dropout可以认为是随机的去掉网络中的一部分,同时使网络变小一些) We also note that, even after training for 1M steps, our largest models still do not overfit to their training data. As a result, we decide to remove dropout to further increase our model capacity. 其他型号的模型,在我们的实现中我们还是会保留原始的dropout的比例,防止模型对训练数据的过拟合。 2)为加快训练速度,使用LAMB做为优化器 Use LAMB as optimizer, to train with big batch size 使用了大的batch_size来训练(4096)。 LAMB优化器使得我们可以训练,特别大的批次batch_size,如高达6万。 3)使用n-gram(uni-gram,bi-gram, tri-gram)来做遮蔽语言模型 Use n-gram as make language model 即以不同的概率使用n-gram,uni-gram的概率最大,bi-gram其次,tri-gram概率最小。 本项目中目前使用的是在中文上做whole word mask,稍后会更新一下与n-gram mask的效果对比。n-gram从spanBERT中来。 训练语料/训练配置 Training Data & Configuration ----------------------------------------------- 30g中文语料,超过100亿汉字,包括多个百科、新闻、互动社区。 预训练序列长度sequence_length设置为512,批次batch_size为4096,训练产生了3.5亿个训练数据(instance);每一个模型默认会训练125k步,albert_xxlarge将训练更久。 作为比较,roberta_zh预训练产生了2.5亿个训练数据、序列长度为256。由于albert_zh预训练生成的训练数据更多、使用的序列长度更长, 我们预计albert_zh会有比roberta_zh更好的性能表现,并且能更好处理较长的文本。 训练使用TPU v3 Pod,我们使用的是v3-256,它包含32个v3-8。每个v3-8机器,含有128G的显存。 模型性能与对比(英文) Performance and Comparision ----------------------------------------------- 中文任务集上效果对比测试 Performance on Chinese datasets ----------------------------------------------- ### 问题匹配语任务:LCQMC(Sentence Pair Matching) | 模型 | 开发集(Dev) | 测试集(Test) | | :------- | :---------: | :---------: | | BERT | 89.4(88.4) | 86.9(86.4) | | ERNIE | 89.8 (89.6) | 87.2 (87.0) | | BERT-wwm |89.4 (89.2) | 87.0 (86.8) | | BERT-wwm-ext | - |- | | RoBERTa-zh-base | 88.7 | 87.0 | | RoBERTa-zh-Large | ***89.9(89.6)*** | 87.2(86.7) | | RoBERTa-zh-Large(20w_steps) | 89.7| 87.0 | | ALBERT-zh-tiny | -- | 85.4 | | ALBERT-zh-small | -- | 86.0 | | ALBERT-zh-small(Pytorch) | -- | 86.8 | | ALBERT-zh-base-additional-36k-steps | 87.8 | 86.3 | | ALBERT-zh-base | 87.2 | 86.3 | | ALBERT-large | 88.7 | 87.1 | | ALBERT-xlarge | 87.3 | ***87.7*** | 注:只跑了一次ALBERT-xlarge,效果还可能提升 ### 自然语言推断:XNLI of Chinese Version | 模型 | 开发集 | 测试集 | | :------- | :---------: | :---------: | | BERT | 77.8 (77.4) | 77.8 (77.5) | | ERNIE | 79.7 (79.4) | 78.6 (78.2) | | BERT-wwm | 79.0 (78.4) | 78.2 (78.0) | | BERT-wwm-ext | 79.4 (78.6) | 78.7 (78.3) | | XLNet | 79.2 | 78.7 | | RoBERTa-zh-base | 79.8 |78.8 | | RoBERTa-zh-Large | 80.2 (80.0) | 79.9 (79.5) | | ALBERT-base | 77.0 | 77.1 | | ALBERT-large | 78.0 | 77.5 | | ALBERT-xlarge | ? | ? | 注:BERT-wwm-ext来自于这里;XLNet来自于这里; RoBERTa-zh-base,指12层RoBERTa中文模型 ### 阅读理解任务:CRMC2018 ### 语言模型、文本段预测准确性、训练时间 Mask Language Model Accuarcy & Training Time | Model | MLM eval acc | SOP eval acc | Training(Hours) | Loss eval | | :------- | :---------: | :---------: | :---------: |:---------: | | albert_zh_base | 79.1% | 99.0% | 6h | 1.01| | albert_zh_large | 80.9% | 98.6% | 22.5h | 0.93| | albert_zh_xlarge | ? | ? | 53h(预估) | ? | | albert_zh_xxlarge | ? | ? | 106h(预估) | ? | 注:? 将很快替换 模型参数和配置 Configuration of Models ----------------------------------------------- 代码实现和测试 Implementation and Code Testing ----------------------------------------------- 通过运行以下命令测试主要的改进点,包括但不限于词嵌入向量参数的因式分解、跨层参数共享、段落连续性任务等。 python test_changes.py ##### 使用TensorFlow Lite(TFLite)在移动端进行部署: 这里我们主要介绍TFLite模型格式转换和性能测试。转换成TFLite模型后,对于如何在移 动端使用该模型,可以参考TFLite提供的[Android/iOS应用完整开发案例教程页面](https://www.tensorflow.org/lite/examples)。 该页面目前已经包含了[文本分类](https://github.com/tensorflow/examples/blob/master/lite/examples/text_classification/android), [文本问答](https://github.com/tensorflow/examples/blob/master/lite/examples/bert_qa/android)两个Android案例。 下面以albert_tiny_zh 为例来介绍TFLite模型格式转换和性能测试: 1. Freeze graph from the checkpoint Ensure to have >=1.14 1.x installed to use the freeze_graph tool as it is removed from 2.x distribution pip install tensorflow==1.15 freeze_graph --input_checkpoint=./albert_model.ckpt \ --output_graph=/tmp/albert_tiny_zh.pb \ --output_node_names=cls/predictions/truediv \ --checkpoint_version=1 --input_meta_graph=./albert_model.ckpt.meta --input_binary=true 2. Convert to TFLite format We are going to use the new experimental tf->tflite converter that's distributed with the Tensorflow nightly build. pip install tf-nightly tflite_convert --graph_def_file=/tmp/albert_tiny_zh.pb \ --input_arrays='input_ids,input_mask,segment_ids,masked_lm_positions,masked_lm_ids,masked_lm_weights' \ --output_arrays='cls/predictions/truediv' \ --input_shapes=1,128:1,128:128:1,128:1,128:1,128 \ --output_file=/tmp/albert_tiny_zh.tflite \ --enable_v1_converter --experimental_new_converter 3. Benchmark the performance of the TFLite model See [here](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark) for details about the performance benchmark tools in TFLite. For example: after building the benchmark tool binary for an Android phone, do the following to get an idea of how the TFLite model performs on the phone adb push /tmp/albert_tiny_zh.tflite /data/local/tmp/ adb shell /data/local/tmp/benchmark_model_performance_options --graph=/data/local/tmp/albert_tiny_zh.tflite --perf_options_list=cpu On an Android phone w/ Qualcomm's SD845 SoC, via the above benchmark tool, as of 2019/11/01, the inference latency is ~120ms w/ this converted TFLite model using 4 threads on CPU, and the memory usage is ~60MB for the model during inference. Note the performance will improve further with future TFLite implementation optimizations. ##### 使用PyTorch版本: download pre-trained model, and convert to PyTorch using: python convert_albert_tf_checkpoint_to_pytorch.py using albert_pytorch ##### 使用Keras加载: bert4keras 适配albert,能成功加载albert_zh的权重,只需要在load_pretrained_model函数里加上albert=True load pre-trained model with bert4keras ##### 使用tf2.0加载: bert-for-tf2 使用案例-基于用户输入预测文本相似性 Use Case-Text Similarity Based on User Input ------------------------------------------------- 功能说明:用户可以通过本例了解如何加载训训练集实现基于用户输入的短文本相似度判断。可以基于该代码将程序灵活地拓展为后台服务或增加文本分类等示例。 涉及代码:similarity.py、args.py 步骤: 1、使用本模型进行文本相似性训练,保存模型文件至相应目录下 2、根据实际情况,修改args.py中的参数,参数说明如下: ```python #模型目录,存放ckpt文件 model_dir = os.path.join(file_path, 'albert_lcqmc_checkpoints/') #config文件,存放模型的json文件 config_name = os.path.join(file_path, 'albert_config/albert_config_tiny.json') #ckpt文件名称 ckpt_name = os.path.join(model_dir, 'model.ckpt') #输出文件目录,训练时的模型输出目录 output_dir = os.path.join(file_path, 'albert_lcqmc_checkpoints/') #vocab文件目录 vocab_file = os.path.join(file_path, 'albert_config/vocab.txt') #数据目录,训练使用的数据集存放目录 data_dir = os.path.join(file_path, 'data/') ``` 本例中的文件结构为: |__args.py |__similarity.py |__data |__albert_config |__albert_lcqmc_checkpoints |__lcqmc 3、修改用户输入单词 打开similarity.py,最底部如下代码: ```python if __name__ == '__main__': sim = BertSim() sim.start_model() sim.predict_sentences([("我喜欢妈妈做的汤", "妈妈做的汤我很喜欢喝")]) ``` 其中sim.start_model()表示加载模型,sim.predict_sentences的输入为一个元组数组,元组中包含两个元素分别为需要判定相似的句子。 4、运行python文件:similarity.py 支持的序列长度与批次大小的关系,12G显存 Trade off between batch Size and sequence length ------------------------------------------------- System | Seq Length | Max Batch Size ------------ | ---------- | -------------- `albert-base` | 64 | 64 ... | 128 | 32 ... | 256 | 16 ... | 320 | 14 ... | 384 | 12 ... | 512 | 6 `albert-large` | 64 | 12 ... | 128 | 6 ... | 256 | 2 ... | 320 | 1 ... | 384 | 0 ... | 512 | 0 `albert-xlarge` | - | - 学习曲线 Training Loss of xlarge of albert_zh ------------------------------------------------- 所有的参数 Parameters of albert_xlarge ------------------------------------------------- #### 技术交流与问题讨论QQ群: 836811304 Join us on QQ group If you have any question, you can raise an issue, or send me an email: brightmart@hotmail.com; Currently how to use PyTorch version of albert is not clear yet, if you know how to do that, just email us or open an issue. You can also send pull request to report you performance on your task or add methods on how to load models for PyTorch and so on. If you have ideas for generate best performance pre-training Chinese model, please also let me know. ##### Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC) Cite Us ----------------------------------------------- Bright Liang Xu, albert_zh, (2019), GitHub repository, https://github.com/brightmart/albert_zh Reference ----------------------------------------------- 1、ALBERT: A Lite BERT For Self-Supervised Learning Of Language Representations 2、BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 3、SpanBERT: Improving Pre-training by Representing and Predicting Spans 4、RoBERTa: A Robustly Optimized BERT Pretraining Approach 5、Large Batch Optimization for Deep Learning: Training BERT in 76 minutes(LAMB) 6、LAMB Optimizer,TensorFlow version 7、预训练小模型也能拿下13项NLP任务,ALBERT三大改造登顶GLUE基准 8、 albert_pytorch 9、load albert with keras 10、load albert with tf2.0 11、repo of albert from google 12、chineseGLUE-中文任务基准测评:公开可用多个任务、基线模型、广泛测评与效果对比 ================================================ FILE: albert_config/albert_config_base.json ================================================ { "attention_probs_dropout_prob": 0.0, "directionality": "bidi", "hidden_act": "gelu", "hidden_dropout_prob": 0.0, "hidden_size": 768, "embedding_size": 128, "initializer_range": 0.02, "intermediate_size": 3072 , "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 12, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "type_vocab_size": 2, "vocab_size": 21128, "ln_type":"postln" } ================================================ FILE: albert_config/albert_config_base_google_fast.json ================================================ { "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "embedding_size": 128, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 12, "num_hidden_groups": 12, "net_structure_type": 0, "gap_size": 0, "num_memory_blocks": 0, "inner_group_num": 1, "down_scale_factor": 1, "type_vocab_size": 2, "vocab_size": 21128 } ================================================ FILE: albert_config/albert_config_large.json ================================================ { "attention_probs_dropout_prob": 0.0, "directionality": "bidi", "hidden_act": "gelu", "hidden_dropout_prob": 0.0, "hidden_size": 1024, "embedding_size": 128, "initializer_range": 0.02, "intermediate_size": 4096, "max_position_embeddings": 512, "num_attention_heads": 16, "num_hidden_layers": 24, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "type_vocab_size": 2, "vocab_size": 21128, "ln_type":"postln" } ================================================ FILE: albert_config/albert_config_small_google.json ================================================ { "attention_probs_dropout_prob": 0.0, "hidden_act": "gelu", "hidden_dropout_prob": 0.0, "embedding_size": 128, "hidden_size": 384, "initializer_range": 0.02, "intermediate_size": 1536, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 6, "num_hidden_groups": 1, "net_structure_type": 0, "gap_size": 0, "num_memory_blocks": 0, "inner_group_num": 1, "down_scale_factor": 1, "type_vocab_size": 2, "vocab_size": 21128 } ================================================ FILE: albert_config/albert_config_tiny.json ================================================ { "attention_probs_dropout_prob": 0.0, "directionality": "bidi", "hidden_act": "gelu", "hidden_dropout_prob": 0.0, "hidden_size": 312, "embedding_size": 128, "initializer_range": 0.02, "intermediate_size": 1248 , "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 4, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "type_vocab_size": 2, "vocab_size": 21128, "ln_type":"postln" } ================================================ FILE: albert_config/albert_config_tiny_google.json ================================================ { "attention_probs_dropout_prob": 0.0, "hidden_act": "gelu", "hidden_dropout_prob": 0.0, "embedding_size": 128, "hidden_size": 312, "initializer_range": 0.02, "intermediate_size": 1248, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 4, "num_hidden_groups": 1, "net_structure_type": 0, "gap_size": 0, "num_memory_blocks": 0, "inner_group_num": 1, "down_scale_factor": 1, "type_vocab_size": 2, "vocab_size": 21128 } ================================================ FILE: albert_config/albert_config_tiny_google_fast.json ================================================ { "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "embedding_size": 128, "hidden_size": 336, "initializer_range": 0.02, "intermediate_size": 1344, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 4, "num_hidden_groups": 12, "net_structure_type": 0, "gap_size": 0, "num_memory_blocks": 0, "inner_group_num": 1, "down_scale_factor": 1, "type_vocab_size": 2, "vocab_size": 21128 } ================================================ FILE: albert_config/albert_config_xlarge.json ================================================ { "attention_probs_dropout_prob": 0.0, "directionality": "bidi", "hidden_act": "gelu", "hidden_dropout_prob": 0.0, "hidden_size": 2048, "embedding_size": 128, "initializer_range": 0.02, "intermediate_size": 8192, "max_position_embeddings": 512, "num_attention_heads": 32, "num_hidden_layers": 24, "pooler_fc_size": 1024, "pooler_num_attention_heads": 64, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "type_vocab_size": 2, "vocab_size": 21128, "ln_type":"postln" } ================================================ FILE: albert_config/albert_config_xxlarge.json ================================================ { "attention_probs_dropout_prob": 0.0, "directionality": "bidi", "hidden_act": "gelu", "hidden_dropout_prob": 0.0, "hidden_size": 4096, "embedding_size": 128, "initializer_range": 0.02, "intermediate_size": 16384, "max_position_embeddings": 512, "num_attention_heads": 64, "num_hidden_layers": 12, "pooler_fc_size": 1024, "pooler_num_attention_heads": 64, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "type_vocab_size": 2, "vocab_size": 21128, "ln_type":"preln" } ================================================ FILE: albert_config/bert_config.json ================================================ { "attention_probs_dropout_prob": 0.0, "directionality": "bidi", "hidden_act": "gelu", "hidden_dropout_prob": 0.0, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 12, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "type_vocab_size": 2, "vocab_size": 21128 } ================================================ FILE: albert_config/vocab.txt ================================================ [PAD] [unused1] [unused2] [unused3] [unused4] [unused5] [unused6] [unused7] [unused8] [unused9] [unused10] [unused11] [unused12] [unused13] [unused14] [unused15] [unused16] [unused17] [unused18] [unused19] [unused20] [unused21] [unused22] [unused23] [unused24] [unused25] [unused26] [unused27] [unused28] [unused29] [unused30] [unused31] [unused32] [unused33] [unused34] [unused35] [unused36] [unused37] [unused38] [unused39] [unused40] [unused41] [unused42] [unused43] [unused44] [unused45] [unused46] [unused47] [unused48] [unused49] [unused50] [unused51] [unused52] [unused53] [unused54] [unused55] [unused56] [unused57] [unused58] [unused59] [unused60] [unused61] [unused62] [unused63] [unused64] [unused65] [unused66] [unused67] [unused68] [unused69] [unused70] [unused71] [unused72] [unused73] [unused74] [unused75] [unused76] [unused77] [unused78] [unused79] [unused80] [unused81] [unused82] [unused83] [unused84] [unused85] [unused86] [unused87] [unused88] [unused89] [unused90] [unused91] [unused92] [unused93] [unused94] [unused95] [unused96] [unused97] [unused98] [unused99] [UNK] [CLS] [SEP] [MASK] ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ [ \ ] ^ _ a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ £ ¤ ¥ § © « ® ° ± ² ³ µ · ¹ º » ¼ × ß æ ÷ ø đ ŋ ɔ ə ɡ ʰ ˇ ˈ ˊ ˋ ˍ ː ˙ ˚ ˢ α β γ δ ε η θ ι κ λ μ ν ο π ρ ς σ τ υ φ χ ψ ω а б в г д е ж з и к л м н о п р с т у ф х ц ч ш ы ь я і ا ب ة ت د ر س ع ل م ن ه و ي ۩ ก ง น ม ย ร อ า เ ๑ ་ ღ ᄀ ᄁ ᄂ ᄃ ᄅ ᄆ ᄇ ᄈ ᄉ ᄋ ᄌ ᄎ ᄏ ᄐ ᄑ ᄒ ᅡ ᅢ ᅣ ᅥ ᅦ ᅧ ᅨ ᅩ ᅪ ᅬ ᅭ ᅮ ᅯ ᅲ ᅳ ᅴ ᅵ ᆨ ᆫ ᆯ ᆷ ᆸ ᆺ ᆻ ᆼ ᗜ ᵃ ᵉ ᵍ ᵏ ᵐ ᵒ ᵘ ‖ „ † • ‥ ‧ 
 ‰ ′ ″ ‹ › ※ ‿ ⁄ ⁱ ⁺ ⁿ ₁ ₂ ₃ ₄ € ℃ № ™ ⅰ ⅱ ⅲ ⅳ ⅴ ← ↑ → ↓ ↔ ↗ ↘ ⇒ ∀ − ∕ ∙ √ ∞ ∟ ∠ ∣ ∥ ∩ ∮ ∶ ∼ ∽ ≈ ≒ ≡ ≤ ≥ ≦ ≧ ≪ ≫ ⊙ ⋅ ⋈ ⋯ ⌒ ① ② ③ ④ ⑤ ⑥ ⑦ ⑧ ⑨ ⑩ ⑴ ⑵ ⑶ ⑷ ⑸ ⒈ ⒉ ⒊ ⒋ ⓒ ⓔ ⓘ ─ ━ │ ┃ ┅ ┆ ┊ ┌ └ ├ ┣ ═ ║ ╚ ╞ ╠ ╭ ╮ ╯ ╰ ╱ ╳ ▂ ▃ ▅ ▇ █ ▉ ▋ ▌ ▍ ▎ ■ □ ▪ ▫ ▬ ▲ △ ▶ ► ▼ ▽ ◆ ◇ ○ ◎ ● ◕ ◠ ◢ ◤ ☀ ★ ☆ ☕ ☞ ☺ ☼ ♀ ♂ ♠ ♡ ♣ ♥ ♦ ♪ ♫ ♬ ✈ ✔ ✕ ✖ ✦ ✨ ✪ ✰ ✿ ❀ ❤ ➜ ➤ ⦿ 、 。 〃 々 〇 〈 〉 《 》 「 」 『 』 【 】 〓 〔 〕 〖 〗 〜 〝 〞 ぁ あ ぃ い う ぇ え お か き く け こ さ し す せ そ た ち っ つ て と な に ぬ ね の は ひ ふ へ ほ ま み む め も ゃ や ゅ ゆ ょ よ ら り る れ ろ わ を ん ゜ ゝ ァ ア ィ イ ゥ ウ ェ エ ォ オ カ キ ク ケ コ サ シ ス セ ソ タ チ ッ ツ テ ト ナ ニ ヌ ネ ノ ハ ヒ フ ヘ ホ マ ミ ム メ モ ャ ヤ ュ ユ ョ ヨ ラ リ ル レ ロ ワ ヲ ン ヶ ・ ー ヽ ㄅ ㄆ ㄇ ㄉ ㄋ ㄌ ㄍ ㄎ ㄏ ㄒ ㄚ ㄛ ㄞ ㄟ ㄢ ㄤ ㄥ ㄧ ㄨ ㆍ ㈦ ㊣ ㎡ 㗎 一 丁 七 万 丈 三 上 下 不 与 丐 丑 专 且 丕 世 丘 丙 业 丛 东 丝 丞 丟 両 丢 两 严 並 丧 丨 个 丫 中 丰 串 临 丶 丸 丹 为 主 丼 丽 举 丿 乂 乃 久 么 义 之 乌 乍 乎 乏 乐 乒 乓 乔 乖 乗 乘 乙 乜 九 乞 也 习 乡 书 乩 买 乱 乳 乾 亀 亂 了 予 争 事 二 于 亏 云 互 五 井 亘 亙 亚 些 亜 亞 亟 亡 亢 交 亥 亦 产 亨 亩 享 京 亭 亮 亲 亳 亵 人 亿 什 仁 仃 仄 仅 仆 仇 今 介 仍 从 仏 仑 仓 仔 仕 他 仗 付 仙 仝 仞 仟 代 令 以 仨 仪 们 仮 仰 仲 件 价 任 份 仿 企 伉 伊 伍 伎 伏 伐 休 伕 众 优 伙 会 伝 伞 伟 传 伢 伤 伦 伪 伫 伯 估 伴 伶 伸 伺 似 伽 佃 但 佇 佈 位 低 住 佐 佑 体 佔 何 佗 佘 余 佚 佛 作 佝 佞 佟 你 佢 佣 佤 佥 佩 佬 佯 佰 佳 併 佶 佻 佼 使 侃 侄 來 侈 例 侍 侏 侑 侖 侗 供 依 侠 価 侣 侥 侦 侧 侨 侬 侮 侯 侵 侶 侷 便 係 促 俄 俊 俎 俏 俐 俑 俗 俘 俚 保 俞 俟 俠 信 俨 俩 俪 俬 俭 修 俯 俱 俳 俸 俺 俾 倆 倉 個 倌 倍 倏 們 倒 倔 倖 倘 候 倚 倜 借 倡 値 倦 倩 倪 倫 倬 倭 倶 债 值 倾 偃 假 偈 偉 偌 偎 偏 偕 做 停 健 側 偵 偶 偷 偻 偽 偿 傀 傅 傍 傑 傘 備 傚 傢 傣 傥 储 傩 催 傭 傲 傳 債 傷 傻 傾 僅 働 像 僑 僕 僖 僚 僥 僧 僭 僮 僱 僵 價 僻 儀 儂 億 儆 儉 儋 儒 儕 儘 償 儡 優 儲 儷 儼 儿 兀 允 元 兄 充 兆 兇 先 光 克 兌 免 児 兑 兒 兔 兖 党 兜 兢 入 內 全 兩 八 公 六 兮 兰 共 兲 关 兴 兵 其 具 典 兹 养 兼 兽 冀 内 円 冇 冈 冉 冊 册 再 冏 冒 冕 冗 写 军 农 冠 冢 冤 冥 冨 冪 冬 冯 冰 冲 决 况 冶 冷 冻 冼 冽 冾 净 凄 准 凇 凈 凉 凋 凌 凍 减 凑 凛 凜 凝 几 凡 凤 処 凪 凭 凯 凰 凱 凳 凶 凸 凹 出 击 函 凿 刀 刁 刃 分 切 刈 刊 刍 刎 刑 划 列 刘 则 刚 创 初 删 判 別 刨 利 刪 别 刮 到 制 刷 券 刹 刺 刻 刽 剁 剂 剃 則 剉 削 剋 剌 前 剎 剐 剑 剔 剖 剛 剜 剝 剣 剤 剥 剧 剩 剪 副 割 創 剷 剽 剿 劃 劇 劈 劉 劊 劍 劏 劑 力 劝 办 功 加 务 劣 动 助 努 劫 劭 励 劲 劳 労 劵 効 劾 势 勁 勃 勇 勉 勋 勐 勒 動 勖 勘 務 勛 勝 勞 募 勢 勤 勧 勳 勵 勸 勺 勻 勾 勿 匀 包 匆 匈 匍 匐 匕 化 北 匙 匝 匠 匡 匣 匪 匮 匯 匱 匹 区 医 匾 匿 區 十 千 卅 升 午 卉 半 卍 华 协 卑 卒 卓 協 单 卖 南 単 博 卜 卞 卟 占 卡 卢 卤 卦 卧 卫 卮 卯 印 危 即 却 卵 卷 卸 卻 卿 厂 厄 厅 历 厉 压 厌 厕 厘 厚 厝 原 厢 厥 厦 厨 厩 厭 厮 厲 厳 去 县 叁 参 參 又 叉 及 友 双 反 収 发 叔 取 受 变 叙 叛 叟 叠 叡 叢 口 古 句 另 叨 叩 只 叫 召 叭 叮 可 台 叱 史 右 叵 叶 号 司 叹 叻 叼 叽 吁 吃 各 吆 合 吉 吊 吋 同 名 后 吏 吐 向 吒 吓 吕 吖 吗 君 吝 吞 吟 吠 吡 否 吧 吨 吩 含 听 吭 吮 启 吱 吳 吴 吵 吶 吸 吹 吻 吼 吽 吾 呀 呂 呃 呆 呈 告 呋 呎 呐 呓 呕 呗 员 呛 呜 呢 呤 呦 周 呱 呲 味 呵 呷 呸 呻 呼 命 咀 咁 咂 咄 咆 咋 和 咎 咏 咐 咒 咔 咕 咖 咗 咘 咙 咚 咛 咣 咤 咦 咧 咨 咩 咪 咫 咬 咭 咯 咱 咲 咳 咸 咻 咽 咿 哀 品 哂 哄 哆 哇 哈 哉 哋 哌 响 哎 哏 哐 哑 哒 哔 哗 哟 員 哥 哦 哧 哨 哩 哪 哭 哮 哲 哺 哼 哽 唁 唄 唆 唇 唉 唏 唐 唑 唔 唠 唤 唧 唬 售 唯 唰 唱 唳 唷 唸 唾 啃 啄 商 啉 啊 問 啓 啕 啖 啜 啞 啟 啡 啤 啥 啦 啧 啪 啫 啬 啮 啰 啱 啲 啵 啶 啷 啸 啻 啼 啾 喀 喂 喃 善 喆 喇 喉 喊 喋 喎 喏 喔 喘 喙 喚 喜 喝 喟 喧 喪 喫 喬 單 喰 喱 喲 喳 喵 営 喷 喹 喺 喻 喽 嗅 嗆 嗇 嗎 嗑 嗒 嗓 嗔 嗖 嗚 嗜 嗝 嗟 嗡 嗣 嗤 嗦 嗨 嗪 嗬 嗯 嗰 嗲 嗳 嗶 嗷 嗽 嘀 嘅 嘆 嘈 嘉 嘌 嘍 嘎 嘔 嘖 嘗 嘘 嘚 嘛 嘜 嘞 嘟 嘢 嘣 嘤 嘧 嘩 嘭 嘮 嘯 嘰 嘱 嘲 嘴 嘶 嘸 嘹 嘻 嘿 噁 噌 噎 噓 噔 噗 噙 噜 噠 噢 噤 器 噩 噪 噬 噱 噴 噶 噸 噹 噻 噼 嚀 嚇 嚎 嚏 嚐 嚓 嚕 嚟 嚣 嚥 嚨 嚮 嚴 嚷 嚼 囂 囉 囊 囍 囑 囔 囗 囚 四 囝 回 囟 因 囡 团 団 囤 囧 囪 囫 园 困 囱 囲 図 围 囹 固 国 图 囿 圃 圄 圆 圈 國 圍 圏 園 圓 圖 團 圜 土 圣 圧 在 圩 圭 地 圳 场 圻 圾 址 坂 均 坊 坍 坎 坏 坐 坑 块 坚 坛 坝 坞 坟 坠 坡 坤 坦 坨 坪 坯 坳 坵 坷 垂 垃 垄 型 垒 垚 垛 垠 垢 垣 垦 垩 垫 垭 垮 垵 埂 埃 埋 城 埔 埕 埗 域 埠 埤 埵 執 埸 培 基 埼 堀 堂 堃 堅 堆 堇 堑 堕 堙 堡 堤 堪 堯 堰 報 場 堵 堺 堿 塊 塌 塑 塔 塗 塘 塚 塞 塢 塩 填 塬 塭 塵 塾 墀 境 墅 墉 墊 墒 墓 増 墘 墙 墜 增 墟 墨 墩 墮 墳 墻 墾 壁 壅 壆 壇 壊 壑 壓 壕 壘 壞 壟 壢 壤 壩 士 壬 壮 壯 声 売 壳 壶 壹 壺 壽 处 备 変 复 夏 夔 夕 外 夙 多 夜 够 夠 夢 夥 大 天 太 夫 夭 央 夯 失 头 夷 夸 夹 夺 夾 奂 奄 奇 奈 奉 奋 奎 奏 奐 契 奔 奕 奖 套 奘 奚 奠 奢 奥 奧 奪 奬 奮 女 奴 奶 奸 她 好 如 妃 妄 妆 妇 妈 妊 妍 妒 妓 妖 妘 妙 妝 妞 妣 妤 妥 妨 妩 妪 妮 妲 妳 妹 妻 妾 姆 姉 姊 始 姍 姐 姑 姒 姓 委 姗 姚 姜 姝 姣 姥 姦 姨 姪 姫 姬 姹 姻 姿 威 娃 娄 娅 娆 娇 娉 娑 娓 娘 娛 娜 娟 娠 娣 娥 娩 娱 娲 娴 娶 娼 婀 婁 婆 婉 婊 婕 婚 婢 婦 婧 婪 婭 婴 婵 婶 婷 婺 婿 媒 媚 媛 媞 媧 媲 媳 媽 媾 嫁 嫂 嫉 嫌 嫑 嫔 嫖 嫘 嫚 嫡 嫣 嫦 嫩 嫲 嫵 嫻 嬅 嬉 嬌 嬗 嬛 嬢 嬤 嬪 嬰 嬴 嬷 嬸 嬿 孀 孃 子 孑 孔 孕 孖 字 存 孙 孚 孛 孜 孝 孟 孢 季 孤 学 孩 孪 孫 孬 孰 孱 孳 孵 學 孺 孽 孿 宁 它 宅 宇 守 安 宋 完 宏 宓 宕 宗 官 宙 定 宛 宜 宝 实 実 宠 审 客 宣 室 宥 宦 宪 宫 宮 宰 害 宴 宵 家 宸 容 宽 宾 宿 寂 寄 寅 密 寇 富 寐 寒 寓 寛 寝 寞 察 寡 寢 寥 實 寧 寨 審 寫 寬 寮 寰 寵 寶 寸 对 寺 寻 导 対 寿 封 専 射 将 將 專 尉 尊 尋 對 導 小 少 尔 尕 尖 尘 尚 尝 尤 尧 尬 就 尴 尷 尸 尹 尺 尻 尼 尽 尾 尿 局 屁 层 屄 居 屆 屈 屉 届 屋 屌 屍 屎 屏 屐 屑 展 屜 属 屠 屡 屢 層 履 屬 屯 山 屹 屿 岀 岁 岂 岌 岐 岑 岔 岖 岗 岘 岙 岚 岛 岡 岩 岫 岬 岭 岱 岳 岷 岸 峇 峋 峒 峙 峡 峤 峥 峦 峨 峪 峭 峯 峰 峴 島 峻 峽 崁 崂 崆 崇 崎 崑 崔 崖 崗 崙 崛 崧 崩 崭 崴 崽 嵇 嵊 嵋 嵌 嵐 嵘 嵩 嵬 嵯 嶂 嶄 嶇 嶋 嶙 嶺 嶼 嶽 巅 巍 巒 巔 巖 川 州 巡 巢 工 左 巧 巨 巩 巫 差 己 已 巳 巴 巷 巻 巽 巾 巿 币 市 布 帅 帆 师 希 帐 帑 帕 帖 帘 帚 帛 帜 帝 帥 带 帧 師 席 帮 帯 帰 帳 帶 帷 常 帼 帽 幀 幂 幄 幅 幌 幔 幕 幟 幡 幢 幣 幫 干 平 年 并 幸 幹 幺 幻 幼 幽 幾 广 庁 広 庄 庆 庇 床 序 庐 库 应 底 庖 店 庙 庚 府 庞 废 庠 度 座 庫 庭 庵 庶 康 庸 庹 庾 廁 廂 廃 廈 廉 廊 廓 廖 廚 廝 廟 廠 廢 廣 廬 廳 延 廷 建 廿 开 弁 异 弃 弄 弈 弊 弋 式 弑 弒 弓 弔 引 弗 弘 弛 弟 张 弥 弦 弧 弩 弭 弯 弱 張 強 弹 强 弼 弾 彅 彆 彈 彌 彎 归 当 录 彗 彙 彝 形 彤 彥 彦 彧 彩 彪 彫 彬 彭 彰 影 彷 役 彻 彼 彿 往 征 径 待 徇 很 徉 徊 律 後 徐 徑 徒 従 徕 得 徘 徙 徜 從 徠 御 徨 復 循 徬 微 徳 徴 徵 德 徹 徼 徽 心 必 忆 忌 忍 忏 忐 忑 忒 忖 志 忘 忙 応 忠 忡 忤 忧 忪 快 忱 念 忻 忽 忿 怀 态 怂 怅 怆 怎 怏 怒 怔 怕 怖 怙 怜 思 怠 怡 急 怦 性 怨 怪 怯 怵 总 怼 恁 恃 恆 恋 恍 恐 恒 恕 恙 恚 恢 恣 恤 恥 恨 恩 恪 恫 恬 恭 息 恰 恳 恵 恶 恸 恺 恻 恼 恿 悄 悅 悉 悌 悍 悔 悖 悚 悟 悠 患 悦 您 悩 悪 悬 悯 悱 悲 悴 悵 悶 悸 悻 悼 悽 情 惆 惇 惊 惋 惑 惕 惘 惚 惜 惟 惠 惡 惦 惧 惨 惩 惫 惬 惭 惮 惯 惰 惱 想 惴 惶 惹 惺 愁 愆 愈 愉 愍 意 愕 愚 愛 愜 感 愣 愤 愧 愫 愷 愿 慄 慈 態 慌 慎 慑 慕 慘 慚 慟 慢 慣 慧 慨 慫 慮 慰 慳 慵 慶 慷 慾 憂 憊 憋 憎 憐 憑 憔 憚 憤 憧 憨 憩 憫 憬 憲 憶 憾 懂 懇 懈 應 懊 懋 懑 懒 懦 懲 懵 懶 懷 懸 懺 懼 懾 懿 戀 戈 戊 戌 戍 戎 戏 成 我 戒 戕 或 战 戚 戛 戟 戡 戦 截 戬 戮 戰 戲 戳 戴 戶 户 戸 戻 戾 房 所 扁 扇 扈 扉 手 才 扎 扑 扒 打 扔 払 托 扛 扣 扦 执 扩 扪 扫 扬 扭 扮 扯 扰 扱 扳 扶 批 扼 找 承 技 抄 抉 把 抑 抒 抓 投 抖 抗 折 抚 抛 抜 択 抟 抠 抡 抢 护 报 抨 披 抬 抱 抵 抹 押 抽 抿 拂 拄 担 拆 拇 拈 拉 拋 拌 拍 拎 拐 拒 拓 拔 拖 拗 拘 拙 拚 招 拜 拟 拡 拢 拣 拥 拦 拧 拨 择 括 拭 拮 拯 拱 拳 拴 拷 拼 拽 拾 拿 持 挂 指 挈 按 挎 挑 挖 挙 挚 挛 挝 挞 挟 挠 挡 挣 挤 挥 挨 挪 挫 振 挲 挹 挺 挽 挾 捂 捅 捆 捉 捋 捌 捍 捎 捏 捐 捕 捞 损 捡 换 捣 捧 捨 捩 据 捱 捲 捶 捷 捺 捻 掀 掂 掃 掇 授 掉 掌 掏 掐 排 掖 掘 掙 掛 掠 採 探 掣 接 控 推 掩 措 掬 掰 掲 掳 掴 掷 掸 掺 揀 揃 揄 揆 揉 揍 描 提 插 揖 揚 換 握 揣 揩 揪 揭 揮 援 揶 揸 揹 揽 搀 搁 搂 搅 損 搏 搐 搓 搔 搖 搗 搜 搞 搡 搪 搬 搭 搵 搶 携 搽 摀 摁 摄 摆 摇 摈 摊 摒 摔 摘 摞 摟 摧 摩 摯 摳 摸 摹 摺 摻 撂 撃 撅 撇 撈 撐 撑 撒 撓 撕 撚 撞 撤 撥 撩 撫 撬 播 撮 撰 撲 撵 撷 撸 撻 撼 撿 擀 擁 擂 擄 擅 擇 擊 擋 操 擎 擒 擔 擘 據 擞 擠 擡 擢 擦 擬 擰 擱 擲 擴 擷 擺 擼 擾 攀 攏 攒 攔 攘 攙 攜 攝 攞 攢 攣 攤 攥 攪 攫 攬 支 收 攸 改 攻 放 政 故 效 敌 敍 敎 敏 救 敕 敖 敗 敘 教 敛 敝 敞 敢 散 敦 敬 数 敲 整 敵 敷 數 斂 斃 文 斋 斌 斎 斐 斑 斓 斗 料 斛 斜 斟 斡 斤 斥 斧 斩 斫 斬 断 斯 新 斷 方 於 施 旁 旃 旅 旋 旌 旎 族 旖 旗 无 既 日 旦 旧 旨 早 旬 旭 旮 旱 时 旷 旺 旻 昀 昂 昆 昇 昉 昊 昌 明 昏 易 昔 昕 昙 星 映 春 昧 昨 昭 是 昱 昴 昵 昶 昼 显 晁 時 晃 晉 晋 晌 晏 晒 晓 晔 晕 晖 晗 晚 晝 晞 晟 晤 晦 晨 晩 普 景 晰 晴 晶 晷 智 晾 暂 暄 暇 暈 暉 暌 暐 暑 暖 暗 暝 暢 暧 暨 暫 暮 暱 暴 暸 暹 曄 曆 曇 曉 曖 曙 曜 曝 曠 曦 曬 曰 曲 曳 更 書 曹 曼 曾 替 最 會 月 有 朋 服 朐 朔 朕 朗 望 朝 期 朦 朧 木 未 末 本 札 朮 术 朱 朴 朵 机 朽 杀 杂 权 杆 杈 杉 李 杏 材 村 杓 杖 杜 杞 束 杠 条 来 杨 杭 杯 杰 東 杳 杵 杷 杼 松 板 极 构 枇 枉 枋 析 枕 林 枚 果 枝 枢 枣 枪 枫 枭 枯 枰 枱 枳 架 枷 枸 柄 柏 某 柑 柒 染 柔 柘 柚 柜 柞 柠 柢 查 柩 柬 柯 柱 柳 柴 柵 査 柿 栀 栃 栄 栅 标 栈 栉 栋 栎 栏 树 栓 栖 栗 校 栩 株 样 核 根 格 栽 栾 桀 桁 桂 桃 桅 框 案 桉 桌 桎 桐 桑 桓 桔 桜 桠 桡 桢 档 桥 桦 桧 桨 桩 桶 桿 梁 梅 梆 梏 梓 梗 條 梟 梢 梦 梧 梨 梭 梯 械 梳 梵 梶 检 棂 棄 棉 棋 棍 棒 棕 棗 棘 棚 棟 棠 棣 棧 森 棱 棲 棵 棹 棺 椁 椅 椋 植 椎 椒 検 椪 椭 椰 椹 椽 椿 楂 楊 楓 楔 楚 楝 楞 楠 楣 楨 楫 業 楮 極 楷 楸 楹 楼 楽 概 榄 榆 榈 榉 榔 榕 榖 榛 榜 榨 榫 榭 榮 榱 榴 榷 榻 槁 槃 構 槌 槍 槎 槐 槓 様 槛 槟 槤 槭 槲 槳 槻 槽 槿 樁 樂 樊 樑 樓 標 樞 樟 模 樣 権 横 樫 樯 樱 樵 樸 樹 樺 樽 樾 橄 橇 橋 橐 橘 橙 機 橡 橢 橫 橱 橹 橼 檀 檄 檎 檐 檔 檗 檜 檢 檬 檯 檳 檸 檻 櫃 櫚 櫛 櫥 櫸 櫻 欄 權 欒 欖 欠 次 欢 欣 欧 欲 欸 欺 欽 款 歆 歇 歉 歌 歎 歐 歓 歙 歛 歡 止 正 此 步 武 歧 歩 歪 歯 歲 歳 歴 歷 歸 歹 死 歼 殁 殃 殆 殇 殉 殊 残 殒 殓 殖 殘 殞 殡 殤 殭 殯 殲 殴 段 殷 殺 殼 殿 毀 毁 毂 毅 毆 毋 母 毎 每 毒 毓 比 毕 毗 毘 毙 毛 毡 毫 毯 毽 氈 氏 氐 民 氓 气 氖 気 氙 氛 氟 氡 氢 氣 氤 氦 氧 氨 氪 氫 氮 氯 氰 氲 水 氷 永 氹 氾 汀 汁 求 汆 汇 汉 汎 汐 汕 汗 汙 汛 汝 汞 江 池 污 汤 汨 汩 汪 汰 汲 汴 汶 汹 決 汽 汾 沁 沂 沃 沅 沈 沉 沌 沏 沐 沒 沓 沖 沙 沛 沟 没 沢 沣 沥 沦 沧 沪 沫 沭 沮 沱 河 沸 油 治 沼 沽 沾 沿 況 泄 泉 泊 泌 泓 法 泗 泛 泞 泠 泡 波 泣 泥 注 泪 泫 泮 泯 泰 泱 泳 泵 泷 泸 泻 泼 泽 泾 洁 洄 洋 洒 洗 洙 洛 洞 津 洩 洪 洮 洱 洲 洵 洶 洸 洹 活 洼 洽 派 流 浃 浄 浅 浆 浇 浊 测 济 浏 浑 浒 浓 浔 浙 浚 浜 浣 浦 浩 浪 浬 浮 浯 浴 海 浸 涂 涅 涇 消 涉 涌 涎 涓 涔 涕 涙 涛 涝 涞 涟 涠 涡 涣 涤 润 涧 涨 涩 涪 涮 涯 液 涵 涸 涼 涿 淀 淄 淅 淆 淇 淋 淌 淑 淒 淖 淘 淙 淚 淞 淡 淤 淦 淨 淩 淪 淫 淬 淮 深 淳 淵 混 淹 淺 添 淼 清 済 渉 渊 渋 渍 渎 渐 渔 渗 渙 渚 減 渝 渠 渡 渣 渤 渥 渦 温 測 渭 港 渲 渴 游 渺 渾 湃 湄 湊 湍 湖 湘 湛 湟 湧 湫 湮 湯 湳 湾 湿 満 溃 溅 溉 溏 源 準 溜 溝 溟 溢 溥 溧 溪 溫 溯 溱 溴 溶 溺 溼 滁 滂 滄 滅 滇 滋 滌 滑 滓 滔 滕 滙 滚 滝 滞 滟 满 滢 滤 滥 滦 滨 滩 滬 滯 滲 滴 滷 滸 滾 滿 漁 漂 漆 漉 漏 漓 演 漕 漠 漢 漣 漩 漪 漫 漬 漯 漱 漲 漳 漸 漾 漿 潆 潇 潋 潍 潑 潔 潘 潛 潜 潞 潟 潢 潤 潦 潧 潭 潮 潰 潴 潸 潺 潼 澀 澄 澆 澈 澍 澎 澗 澜 澡 澤 澧 澱 澳 澹 激 濁 濂 濃 濑 濒 濕 濘 濛 濟 濠 濡 濤 濫 濬 濮 濯 濱 濺 濾 瀅 瀆 瀉 瀋 瀏 瀑 瀕 瀘 瀚 瀛 瀝 瀞 瀟 瀧 瀨 瀬 瀰 瀾 灌 灏 灑 灘 灝 灞 灣 火 灬 灭 灯 灰 灵 灶 灸 灼 災 灾 灿 炀 炁 炅 炉 炊 炎 炒 炔 炕 炖 炙 炜 炫 炬 炭 炮 炯 炳 炷 炸 点 為 炼 炽 烁 烂 烃 烈 烊 烏 烘 烙 烛 烟 烤 烦 烧 烨 烩 烫 烬 热 烯 烷 烹 烽 焉 焊 焕 焖 焗 焘 焙 焚 焜 無 焦 焯 焰 焱 然 焼 煅 煉 煊 煌 煎 煒 煖 煙 煜 煞 煤 煥 煦 照 煨 煩 煮 煲 煸 煽 熄 熊 熏 熒 熔 熙 熟 熠 熨 熬 熱 熵 熹 熾 燁 燃 燄 燈 燉 燊 燎 燒 燔 燕 燙 燜 營 燥 燦 燧 燭 燮 燴 燻 燼 燿 爆 爍 爐 爛 爪 爬 爭 爰 爱 爲 爵 父 爷 爸 爹 爺 爻 爽 爾 牆 片 版 牌 牍 牒 牙 牛 牝 牟 牠 牡 牢 牦 牧 物 牯 牲 牴 牵 特 牺 牽 犀 犁 犄 犊 犍 犒 犢 犧 犬 犯 状 犷 犸 犹 狀 狂 狄 狈 狎 狐 狒 狗 狙 狞 狠 狡 狩 独 狭 狮 狰 狱 狸 狹 狼 狽 猎 猕 猖 猗 猙 猛 猜 猝 猥 猩 猪 猫 猬 献 猴 猶 猷 猾 猿 獄 獅 獎 獐 獒 獗 獠 獣 獨 獭 獰 獲 獵 獷 獸 獺 獻 獼 獾 玄 率 玉 王 玑 玖 玛 玟 玠 玥 玩 玫 玮 环 现 玲 玳 玷 玺 玻 珀 珂 珅 珈 珉 珊 珍 珏 珐 珑 珙 珞 珠 珣 珥 珩 珪 班 珮 珲 珺 現 球 琅 理 琇 琉 琊 琍 琏 琐 琛 琢 琥 琦 琨 琪 琬 琮 琰 琲 琳 琴 琵 琶 琺 琼 瑀 瑁 瑄 瑋 瑕 瑗 瑙 瑚 瑛 瑜 瑞 瑟 瑠 瑣 瑤 瑩 瑪 瑯 瑰 瑶 瑾 璀 璁 璃 璇 璉 璋 璎 璐 璜 璞 璟 璧 璨 環 璽 璿 瓊 瓏 瓒 瓜 瓢 瓣 瓤 瓦 瓮 瓯 瓴 瓶 瓷 甄 甌 甕 甘 甙 甚 甜 生 產 産 甥 甦 用 甩 甫 甬 甭 甯 田 由 甲 申 电 男 甸 町 画 甾 畀 畅 界 畏 畑 畔 留 畜 畝 畢 略 畦 番 畫 異 畲 畳 畴 當 畸 畹 畿 疆 疇 疊 疏 疑 疔 疖 疗 疙 疚 疝 疟 疡 疣 疤 疥 疫 疮 疯 疱 疲 疳 疵 疸 疹 疼 疽 疾 痂 病 症 痈 痉 痊 痍 痒 痔 痕 痘 痙 痛 痞 痠 痢 痣 痤 痧 痨 痪 痫 痰 痱 痴 痹 痺 痼 痿 瘀 瘁 瘋 瘍 瘓 瘘 瘙 瘟 瘠 瘡 瘢 瘤 瘦 瘧 瘩 瘪 瘫 瘴 瘸 瘾 療 癇 癌 癒 癖 癜 癞 癡 癢 癣 癥 癫 癬 癮 癱 癲 癸 発 登 發 白 百 皂 的 皆 皇 皈 皋 皎 皑 皓 皖 皙 皚 皮 皰 皱 皴 皺 皿 盂 盃 盅 盆 盈 益 盎 盏 盐 监 盒 盔 盖 盗 盘 盛 盜 盞 盟 盡 監 盤 盥 盧 盪 目 盯 盱 盲 直 相 盹 盼 盾 省 眈 眉 看 県 眙 眞 真 眠 眦 眨 眩 眯 眶 眷 眸 眺 眼 眾 着 睁 睇 睏 睐 睑 睛 睜 睞 睡 睢 督 睥 睦 睨 睪 睫 睬 睹 睽 睾 睿 瞄 瞅 瞇 瞋 瞌 瞎 瞑 瞒 瞓 瞞 瞟 瞠 瞥 瞧 瞩 瞪 瞬 瞭 瞰 瞳 瞻 瞼 瞿 矇 矍 矗 矚 矛 矜 矢 矣 知 矩 矫 短 矮 矯 石 矶 矽 矾 矿 码 砂 砌 砍 砒 研 砖 砗 砚 砝 砣 砥 砧 砭 砰 砲 破 砷 砸 砺 砼 砾 础 硅 硐 硒 硕 硝 硫 硬 确 硯 硼 碁 碇 碉 碌 碍 碎 碑 碓 碗 碘 碚 碛 碟 碣 碧 碩 碰 碱 碳 碴 確 碼 碾 磁 磅 磊 磋 磐 磕 磚 磡 磨 磬 磯 磲 磷 磺 礁 礎 礙 礡 礦 礪 礫 礴 示 礼 社 祀 祁 祂 祇 祈 祉 祎 祐 祕 祖 祗 祚 祛 祜 祝 神 祟 祠 祢 祥 票 祭 祯 祷 祸 祺 祿 禀 禁 禄 禅 禍 禎 福 禛 禦 禧 禪 禮 禱 禹 禺 离 禽 禾 禿 秀 私 秃 秆 秉 秋 种 科 秒 秘 租 秣 秤 秦 秧 秩 秭 积 称 秸 移 秽 稀 稅 程 稍 税 稔 稗 稚 稜 稞 稟 稠 稣 種 稱 稲 稳 稷 稹 稻 稼 稽 稿 穀 穂 穆 穌 積 穎 穗 穢 穩 穫 穴 究 穷 穹 空 穿 突 窃 窄 窈 窍 窑 窒 窓 窕 窖 窗 窘 窜 窝 窟 窠 窥 窦 窨 窩 窪 窮 窯 窺 窿 竄 竅 竇 竊 立 竖 站 竜 竞 竟 章 竣 童 竭 端 競 竹 竺 竽 竿 笃 笆 笈 笋 笏 笑 笔 笙 笛 笞 笠 符 笨 第 笹 笺 笼 筆 等 筊 筋 筍 筏 筐 筑 筒 答 策 筛 筝 筠 筱 筲 筵 筷 筹 签 简 箇 箋 箍 箏 箐 箔 箕 算 箝 管 箩 箫 箭 箱 箴 箸 節 篁 範 篆 篇 築 篑 篓 篙 篝 篠 篡 篤 篩 篪 篮 篱 篷 簇 簌 簍 簡 簦 簧 簪 簫 簷 簸 簽 簾 簿 籁 籃 籌 籍 籐 籟 籠 籤 籬 籮 籲 米 类 籼 籽 粄 粉 粑 粒 粕 粗 粘 粟 粤 粥 粧 粪 粮 粱 粲 粳 粵 粹 粼 粽 精 粿 糅 糊 糍 糕 糖 糗 糙 糜 糞 糟 糠 糧 糬 糯 糰 糸 系 糾 紀 紂 約 紅 紉 紊 紋 納 紐 紓 純 紗 紘 紙 級 紛 紜 素 紡 索 紧 紫 紮 累 細 紳 紹 紺 終 絃 組 絆 経 結 絕 絞 絡 絢 給 絨 絮 統 絲 絳 絵 絶 絹 綁 綏 綑 經 継 続 綜 綠 綢 綦 綫 綬 維 綱 網 綴 綵 綸 綺 綻 綽 綾 綿 緊 緋 総 緑 緒 緘 線 緝 緞 締 緣 編 緩 緬 緯 練 緹 緻 縁 縄 縈 縛 縝 縣 縫 縮 縱 縴 縷 總 績 繁 繃 繆 繇 繋 織 繕 繚 繞 繡 繩 繪 繫 繭 繳 繹 繼 繽 纂 續 纍 纏 纓 纔 纖 纜 纠 红 纣 纤 约 级 纨 纪 纫 纬 纭 纯 纰 纱 纲 纳 纵 纶 纷 纸 纹 纺 纽 纾 线 绀 练 组 绅 细 织 终 绊 绍 绎 经 绑 绒 结 绔 绕 绘 给 绚 绛 络 绝 绞 统 绡 绢 绣 绥 绦 继 绩 绪 绫 续 绮 绯 绰 绳 维 绵 绶 绷 绸 绻 综 绽 绾 绿 缀 缄 缅 缆 缇 缈 缉 缎 缓 缔 缕 编 缘 缙 缚 缜 缝 缠 缢 缤 缥 缨 缩 缪 缭 缮 缰 缱 缴 缸 缺 缽 罂 罄 罌 罐 网 罔 罕 罗 罚 罡 罢 罩 罪 置 罰 署 罵 罷 罹 羁 羅 羈 羊 羌 美 羔 羚 羞 羟 羡 羣 群 羥 羧 羨 義 羯 羲 羸 羹 羽 羿 翁 翅 翊 翌 翎 習 翔 翘 翟 翠 翡 翦 翩 翰 翱 翳 翹 翻 翼 耀 老 考 耄 者 耆 耋 而 耍 耐 耒 耕 耗 耘 耙 耦 耨 耳 耶 耷 耸 耻 耽 耿 聂 聆 聊 聋 职 聒 联 聖 聘 聚 聞 聪 聯 聰 聲 聳 聴 聶 職 聽 聾 聿 肃 肄 肅 肆 肇 肉 肋 肌 肏 肓 肖 肘 肚 肛 肝 肠 股 肢 肤 肥 肩 肪 肮 肯 肱 育 肴 肺 肽 肾 肿 胀 胁 胃 胄 胆 背 胍 胎 胖 胚 胛 胜 胝 胞 胡 胤 胥 胧 胫 胭 胯 胰 胱 胳 胴 胶 胸 胺 能 脂 脅 脆 脇 脈 脉 脊 脍 脏 脐 脑 脓 脖 脘 脚 脛 脣 脩 脫 脯 脱 脲 脳 脸 脹 脾 腆 腈 腊 腋 腌 腎 腐 腑 腓 腔 腕 腥 腦 腩 腫 腭 腮 腰 腱 腳 腴 腸 腹 腺 腻 腼 腾 腿 膀 膈 膊 膏 膑 膘 膚 膛 膜 膝 膠 膦 膨 膩 膳 膺 膻 膽 膾 膿 臀 臂 臃 臆 臉 臊 臍 臓 臘 臟 臣 臥 臧 臨 自 臬 臭 至 致 臺 臻 臼 臾 舀 舂 舅 舆 與 興 舉 舊 舌 舍 舎 舐 舒 舔 舖 舗 舛 舜 舞 舟 航 舫 般 舰 舱 舵 舶 舷 舸 船 舺 舾 艇 艋 艘 艙 艦 艮 良 艰 艱 色 艳 艷 艹 艺 艾 节 芃 芈 芊 芋 芍 芎 芒 芙 芜 芝 芡 芥 芦 芩 芪 芫 芬 芭 芮 芯 花 芳 芷 芸 芹 芻 芽 芾 苁 苄 苇 苋 苍 苏 苑 苒 苓 苔 苕 苗 苛 苜 苞 苟 苡 苣 若 苦 苫 苯 英 苷 苹 苻 茁 茂 范 茄 茅 茉 茎 茏 茗 茜 茧 茨 茫 茬 茭 茯 茱 茲 茴 茵 茶 茸 茹 茼 荀 荃 荆 草 荊 荏 荐 荒 荔 荖 荘 荚 荞 荟 荠 荡 荣 荤 荥 荧 荨 荪 荫 药 荳 荷 荸 荻 荼 荽 莅 莆 莉 莊 莎 莒 莓 莖 莘 莞 莠 莢 莧 莪 莫 莱 莲 莴 获 莹 莺 莽 莿 菀 菁 菅 菇 菈 菊 菌 菏 菓 菖 菘 菜 菟 菠 菡 菩 華 菱 菲 菸 菽 萁 萃 萄 萊 萋 萌 萍 萎 萘 萝 萤 营 萦 萧 萨 萩 萬 萱 萵 萸 萼 落 葆 葉 著 葚 葛 葡 董 葦 葩 葫 葬 葭 葯 葱 葳 葵 葷 葺 蒂 蒋 蒐 蒔 蒙 蒜 蒞 蒟 蒡 蒨 蒲 蒸 蒹 蒻 蒼 蒿 蓁 蓄 蓆 蓉 蓋 蓑 蓓 蓖 蓝 蓟 蓦 蓬 蓮 蓼 蓿 蔑 蔓 蔔 蔗 蔘 蔚 蔡 蔣 蔥 蔫 蔬 蔭 蔵 蔷 蔺 蔻 蔼 蔽 蕁 蕃 蕈 蕉 蕊 蕎 蕙 蕤 蕨 蕩 蕪 蕭 蕲 蕴 蕻 蕾 薄 薅 薇 薈 薊 薏 薑 薔 薙 薛 薦 薨 薩 薪 薬 薯 薰 薹 藉 藍 藏 藐 藓 藕 藜 藝 藤 藥 藩 藹 藻 藿 蘆 蘇 蘊 蘋 蘑 蘚 蘭 蘸 蘼 蘿 虎 虏 虐 虑 虔 處 虚 虛 虜 虞 號 虢 虧 虫 虬 虱 虹 虻 虽 虾 蚀 蚁 蚂 蚊 蚌 蚓 蚕 蚜 蚝 蚣 蚤 蚩 蚪 蚯 蚱 蚵 蛀 蛆 蛇 蛊 蛋 蛎 蛐 蛔 蛙 蛛 蛟 蛤 蛭 蛮 蛰 蛳 蛹 蛻 蛾 蜀 蜂 蜃 蜆 蜇 蜈 蜊 蜍 蜒 蜓 蜕 蜗 蜘 蜚 蜜 蜡 蜢 蜥 蜱 蜴 蜷 蜻 蜿 蝇 蝈 蝉 蝌 蝎 蝕 蝗 蝙 蝟 蝠 蝦 蝨 蝴 蝶 蝸 蝼 螂 螃 融 螞 螢 螨 螯 螳 螺 蟀 蟄 蟆 蟋 蟎 蟑 蟒 蟠 蟬 蟲 蟹 蟻 蟾 蠅 蠍 蠔 蠕 蠛 蠟 蠡 蠢 蠣 蠱 蠶 蠹 蠻 血 衄 衅 衆 行 衍 術 衔 街 衙 衛 衝 衞 衡 衢 衣 补 表 衩 衫 衬 衮 衰 衲 衷 衹 衾 衿 袁 袂 袄 袅 袈 袋 袍 袒 袖 袜 袞 袤 袪 被 袭 袱 裁 裂 装 裆 裊 裏 裔 裕 裘 裙 補 裝 裟 裡 裤 裨 裱 裳 裴 裸 裹 製 裾 褂 複 褐 褒 褓 褔 褚 褥 褪 褫 褲 褶 褻 襁 襄 襟 襠 襪 襬 襯 襲 西 要 覃 覆 覇 見 規 覓 視 覚 覦 覧 親 覬 観 覷 覺 覽 觀 见 观 规 觅 视 览 觉 觊 觎 觐 觑 角 觞 解 觥 触 觸 言 訂 計 訊 討 訓 訕 訖 託 記 訛 訝 訟 訣 訥 訪 設 許 訳 訴 訶 診 註 証 詆 詐 詔 評 詛 詞 詠 詡 詢 詣 試 詩 詫 詬 詭 詮 詰 話 該 詳 詹 詼 誅 誇 誉 誌 認 誓 誕 誘 語 誠 誡 誣 誤 誥 誦 誨 說 説 読 誰 課 誹 誼 調 諄 談 請 諏 諒 論 諗 諜 諡 諦 諧 諫 諭 諮 諱 諳 諷 諸 諺 諾 謀 謁 謂 謄 謊 謎 謐 謔 謗 謙 講 謝 謠 謨 謬 謹 謾 譁 證 譎 譏 識 譙 譚 譜 警 譬 譯 議 譲 譴 護 譽 讀 變 讓 讚 讞 计 订 认 讥 讧 讨 让 讪 讫 训 议 讯 记 讲 讳 讴 讶 讷 许 讹 论 讼 讽 设 访 诀 证 诃 评 诅 识 诈 诉 诊 诋 词 诏 译 试 诗 诘 诙 诚 诛 话 诞 诟 诠 诡 询 诣 诤 该 详 诧 诩 诫 诬 语 误 诰 诱 诲 说 诵 诶 请 诸 诺 读 诽 课 诿 谀 谁 调 谄 谅 谆 谈 谊 谋 谌 谍 谎 谏 谐 谑 谒 谓 谔 谕 谗 谘 谙 谚 谛 谜 谟 谢 谣 谤 谥 谦 谧 谨 谩 谪 谬 谭 谯 谱 谲 谴 谶 谷 豁 豆 豇 豈 豉 豊 豌 豎 豐 豔 豚 象 豢 豪 豫 豬 豹 豺 貂 貅 貌 貓 貔 貘 貝 貞 負 財 貢 貧 貨 販 貪 貫 責 貯 貰 貳 貴 貶 買 貸 費 貼 貽 貿 賀 賁 賂 賃 賄 資 賈 賊 賑 賓 賜 賞 賠 賡 賢 賣 賤 賦 質 賬 賭 賴 賺 購 賽 贅 贈 贊 贍 贏 贓 贖 贛 贝 贞 负 贡 财 责 贤 败 账 货 质 贩 贪 贫 贬 购 贮 贯 贰 贱 贲 贴 贵 贷 贸 费 贺 贻 贼 贾 贿 赁 赂 赃 资 赅 赈 赊 赋 赌 赎 赏 赐 赓 赔 赖 赘 赚 赛 赝 赞 赠 赡 赢 赣 赤 赦 赧 赫 赭 走 赳 赴 赵 赶 起 趁 超 越 趋 趕 趙 趟 趣 趨 足 趴 趵 趸 趺 趾 跃 跄 跆 跋 跌 跎 跑 跖 跚 跛 距 跟 跡 跤 跨 跩 跪 路 跳 践 跷 跹 跺 跻 踉 踊 踌 踏 踐 踝 踞 踟 踢 踩 踪 踮 踱 踴 踵 踹 蹂 蹄 蹇 蹈 蹉 蹊 蹋 蹑 蹒 蹙 蹟 蹣 蹤 蹦 蹩 蹬 蹭 蹲 蹴 蹶 蹺 蹼 蹿 躁 躇 躉 躊 躋 躍 躏 躪 身 躬 躯 躲 躺 軀 車 軋 軌 軍 軒 軟 転 軸 軼 軽 軾 較 載 輒 輓 輔 輕 輛 輝 輟 輩 輪 輯 輸 輻 輾 輿 轄 轅 轆 轉 轍 轎 轟 车 轧 轨 轩 转 轭 轮 软 轰 轲 轴 轶 轻 轼 载 轿 较 辄 辅 辆 辇 辈 辉 辊 辍 辐 辑 输 辕 辖 辗 辘 辙 辛 辜 辞 辟 辣 辦 辨 辩 辫 辭 辮 辯 辰 辱 農 边 辺 辻 込 辽 达 迁 迂 迄 迅 过 迈 迎 运 近 返 还 这 进 远 违 连 迟 迢 迤 迥 迦 迩 迪 迫 迭 述 迴 迷 迸 迹 迺 追 退 送 适 逃 逅 逆 选 逊 逍 透 逐 递 途 逕 逗 這 通 逛 逝 逞 速 造 逢 連 逮 週 進 逵 逶 逸 逻 逼 逾 遁 遂 遅 遇 遊 運 遍 過 遏 遐 遑 遒 道 達 違 遗 遙 遛 遜 遞 遠 遢 遣 遥 遨 適 遭 遮 遲 遴 遵 遶 遷 選 遺 遼 遽 避 邀 邁 邂 邃 還 邇 邈 邊 邋 邏 邑 邓 邕 邛 邝 邢 那 邦 邨 邪 邬 邮 邯 邰 邱 邳 邵 邸 邹 邺 邻 郁 郅 郊 郎 郑 郜 郝 郡 郢 郤 郦 郧 部 郫 郭 郴 郵 郷 郸 都 鄂 鄉 鄒 鄔 鄙 鄞 鄢 鄧 鄭 鄰 鄱 鄲 鄺 酉 酊 酋 酌 配 酐 酒 酗 酚 酝 酢 酣 酥 酩 酪 酬 酮 酯 酰 酱 酵 酶 酷 酸 酿 醃 醇 醉 醋 醍 醐 醒 醚 醛 醜 醞 醣 醪 醫 醬 醮 醯 醴 醺 釀 釁 采 釉 释 釋 里 重 野 量 釐 金 釗 釘 釜 針 釣 釦 釧 釵 鈀 鈉 鈍 鈎 鈔 鈕 鈞 鈣 鈦 鈪 鈴 鈺 鈾 鉀 鉄 鉅 鉉 鉑 鉗 鉚 鉛 鉤 鉴 鉻 銀 銃 銅 銑 銓 銖 銘 銜 銬 銭 銮 銳 銷 銹 鋁 鋅 鋒 鋤 鋪 鋰 鋸 鋼 錄 錐 錘 錚 錠 錢 錦 錨 錫 錮 錯 録 錳 錶 鍊 鍋 鍍 鍛 鍥 鍰 鍵 鍺 鍾 鎂 鎊 鎌 鎏 鎔 鎖 鎗 鎚 鎧 鎬 鎮 鎳 鏈 鏖 鏗 鏘 鏞 鏟 鏡 鏢 鏤 鏽 鐘 鐮 鐲 鐳 鐵 鐸 鐺 鑄 鑊 鑑 鑒 鑣 鑫 鑰 鑲 鑼 鑽 鑾 鑿 针 钉 钊 钎 钏 钒 钓 钗 钙 钛 钜 钝 钞 钟 钠 钡 钢 钣 钤 钥 钦 钧 钨 钩 钮 钯 钰 钱 钳 钴 钵 钺 钻 钼 钾 钿 铀 铁 铂 铃 铄 铅 铆 铉 铎 铐 铛 铜 铝 铠 铡 铢 铣 铤 铨 铩 铬 铭 铮 铰 铲 铵 银 铸 铺 链 铿 销 锁 锂 锄 锅 锆 锈 锉 锋 锌 锏 锐 锑 错 锚 锟 锡 锢 锣 锤 锥 锦 锭 键 锯 锰 锲 锵 锹 锺 锻 镀 镁 镂 镇 镉 镌 镍 镐 镑 镕 镖 镗 镛 镜 镣 镭 镯 镰 镳 镶 長 长 門 閃 閉 開 閎 閏 閑 閒 間 閔 閘 閡 関 閣 閥 閨 閩 閱 閲 閹 閻 閾 闆 闇 闊 闌 闍 闔 闕 闖 闘 關 闡 闢 门 闪 闫 闭 问 闯 闰 闲 间 闵 闷 闸 闹 闺 闻 闽 闾 阀 阁 阂 阅 阆 阇 阈 阉 阎 阐 阑 阔 阕 阖 阙 阚 阜 队 阡 阪 阮 阱 防 阳 阴 阵 阶 阻 阿 陀 陂 附 际 陆 陇 陈 陋 陌 降 限 陕 陛 陝 陞 陟 陡 院 陣 除 陨 险 陪 陰 陲 陳 陵 陶 陷 陸 険 陽 隅 隆 隈 隊 隋 隍 階 随 隐 隔 隕 隘 隙 際 障 隠 隣 隧 隨 險 隱 隴 隶 隸 隻 隼 隽 难 雀 雁 雄 雅 集 雇 雉 雋 雌 雍 雎 雏 雑 雒 雕 雖 雙 雛 雜 雞 離 難 雨 雪 雯 雰 雲 雳 零 雷 雹 電 雾 需 霁 霄 霆 震 霈 霉 霊 霍 霎 霏 霑 霓 霖 霜 霞 霧 霭 霰 露 霸 霹 霽 霾 靂 靄 靈 青 靓 靖 静 靚 靛 靜 非 靠 靡 面 靥 靦 革 靳 靴 靶 靼 鞅 鞋 鞍 鞏 鞑 鞘 鞠 鞣 鞦 鞭 韆 韋 韌 韓 韜 韦 韧 韩 韬 韭 音 韵 韶 韻 響 頁 頂 頃 項 順 須 頌 預 頑 頒 頓 頗 領 頜 頡 頤 頫 頭 頰 頷 頸 頹 頻 頼 顆 題 額 顎 顏 顔 願 顛 類 顧 顫 顯 顱 顴 页 顶 顷 项 顺 须 顼 顽 顾 顿 颁 颂 预 颅 领 颇 颈 颉 颊 颌 颍 颐 频 颓 颔 颖 颗 题 颚 颛 颜 额 颞 颠 颡 颢 颤 颦 颧 風 颯 颱 颳 颶 颼 飄 飆 风 飒 飓 飕 飘 飙 飚 飛 飞 食 飢 飨 飩 飪 飯 飲 飼 飽 飾 餃 餅 餉 養 餌 餐 餒 餓 餘 餚 餛 餞 餡 館 餮 餵 餾 饅 饈 饋 饌 饍 饑 饒 饕 饗 饞 饥 饨 饪 饬 饭 饮 饯 饰 饱 饲 饴 饵 饶 饷 饺 饼 饽 饿 馀 馁 馄 馅 馆 馈 馋 馍 馏 馒 馔 首 馗 香 馥 馨 馬 馭 馮 馳 馴 駁 駄 駅 駆 駐 駒 駕 駛 駝 駭 駱 駿 騁 騎 騏 験 騙 騨 騰 騷 驀 驅 驊 驍 驒 驕 驗 驚 驛 驟 驢 驥 马 驭 驮 驯 驰 驱 驳 驴 驶 驷 驸 驹 驻 驼 驾 驿 骁 骂 骄 骅 骆 骇 骈 骊 骋 验 骏 骐 骑 骗 骚 骛 骜 骞 骠 骡 骤 骥 骧 骨 骯 骰 骶 骷 骸 骼 髂 髅 髋 髏 髒 髓 體 髖 高 髦 髪 髮 髯 髻 鬃 鬆 鬍 鬓 鬚 鬟 鬢 鬣 鬥 鬧 鬱 鬼 魁 魂 魄 魅 魇 魍 魏 魔 魘 魚 魯 魷 鮑 鮨 鮪 鮭 鮮 鯉 鯊 鯖 鯛 鯨 鯰 鯽 鰍 鰓 鰭 鰲 鰻 鰾 鱈 鱉 鱔 鱗 鱷 鱸 鱼 鱿 鲁 鲈 鲍 鲑 鲛 鲜 鲟 鲢 鲤 鲨 鲫 鲱 鲲 鲶 鲷 鲸 鳃 鳄 鳅 鳌 鳍 鳕 鳖 鳗 鳝 鳞 鳥 鳩 鳳 鳴 鳶 鴉 鴕 鴛 鴦 鴨 鴻 鴿 鵑 鵜 鵝 鵡 鵬 鵰 鵲 鶘 鶩 鶯 鶴 鷗 鷲 鷹 鷺 鸚 鸞 鸟 鸠 鸡 鸢 鸣 鸥 鸦 鸨 鸪 鸭 鸯 鸳 鸵 鸽 鸾 鸿 鹂 鹃 鹄 鹅 鹈 鹉 鹊 鹌 鹏 鹑 鹕 鹘 鹜 鹞 鹤 鹦 鹧 鹫 鹭 鹰 鹳 鹵 鹹 鹼 鹽 鹿 麂 麋 麒 麓 麗 麝 麟 麥 麦 麩 麴 麵 麸 麺 麻 麼 麽 麾 黃 黄 黍 黎 黏 黑 黒 黔 默 黛 黜 黝 點 黠 黨 黯 黴 鼋 鼎 鼐 鼓 鼠 鼬 鼹 鼻 鼾 齁 齊 齋 齐 齒 齡 齢 齣 齦 齿 龄 龅 龈 龊 龋 龌 龍 龐 龔 龕 龙 龚 龛 龜 龟 ︰ ︱ ︶ ︿ ﹁ ﹂ ﹍ ﹏ ﹐ ﹑ ﹒ ﹔ ﹕ ﹖ ﹗ ﹙ ﹚ ﹝ ﹞ ﹡ ﹣ ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ 。 「 」 、 ・ ッ ー イ ク シ ス ト ノ フ ラ ル ン ゙ ゚  ̄ ¥ 👍 🔥 😂 😎 ... yam 10 2017 12 11 2016 20 30 15 06 lofter ##s 2015 by 16 14 18 13 24 17 2014 21 ##0 22 19 25 23 com 100 00 05 2013 ##a 03 09 08 28 ##2 50 01 04 ##1 27 02 2012 ##3 26 ##e 07 ##8 ##5 ##6 ##4 ##9 ##7 29 2011 40 ##t 2010 ##o ##d ##i 2009 ##n app www the ##m 31 ##c ##l ##y ##r ##g 2008 60 http 200 qq ##p 80 ##f google pixnet 90 cookies tripadvisor 500 ##er ##k 35 ##h facebook 2007 2000 70 ##b of ##x ##u 45 300 iphone 32 1000 2006 48 ip 36 in 38 3d ##w ##ing 55 ctrip ##on ##v 33 ##の to 34 400 id 2005 it 37 windows llc top 99 42 39 000 led at ##an 41 51 52 46 49 43 53 44 ##z android 58 and 59 2004 56 vr ##か 5000 2003 47 blogthis twitter 54 ##le 150 ok 2018 57 75 cn no ios ##in ##mm ##00 800 on te 3000 65 2001 360 95 ig lv 120 ##ng ##を ##us ##に pc てす ── 600 ##te 85 2002 88 ##ed html ncc wifi email 64 blog is ##10 ##て mail online ##al dvd ##ic studio ##は ##℃ ##ia ##と line vip 72 ##q 98 ##ce ##en for ##is ##ra ##es ##j usb net cp 1999 asia 4g ##cm diy new 3c ##お ta 66 language vs apple tw 86 web ##ne ipad 62 you ##re 101 68 ##tion ps de bt pony atm ##2017 1998 67 ##ch ceo ##or go ##na av pro cafe 96 pinterest 97 63 pixstyleme3c ##ta more said ##2016 1997 mp3 700 ##ll nba jun ##20 92 tv 1995 pm 61 76 nbsp 250 ##ie linux ##ma cd 110 hd ##17 78 ##ion 77 6000 am ##th ##st 94 ##se ##et 69 180 gdp my 105 81 abc 89 flash 79 one 93 1990 1996 ##ck gps ##も ##ly web885 106 2020 91 ##ge 4000 1500 xd boss isbn 1994 org ##ry me love ##11 0fork 73 ##12 3g ##ter ##ar 71 82 ##la hotel 130 1970 pk 83 87 140 ie ##os ##30 ##el 74 ##50 seo cpu ##ml p2p 84 may ##る sun tue internet cc posted youtube ##at ##ン ##man ii ##ル ##15 abs nt pdf yahoo ago 1980 ##it news mac 104 ##てす ##me ##り java 1992 spa ##de ##nt hk all plus la 1993 ##mb ##16 ##ve west ##da 160 air ##い ##ps から ##to 1989 logo htc php https fi momo ##son sat ##ke ##80 ebd suv wi day apk ##88 ##um mv galaxy wiki or brake ##ス 1200 する this 1991 mon ##こ ❤2017 po ##ない javascript life home june ##ss system 900 ##ー ##0 pp 1988 world fb 4k br ##as ic ai leonardo safari ##60 live free xx wed win7 kiehl ##co lg o2o ##go us 235 1949 mm しい vfm kanye ##90 ##2015 ##id jr ##ey 123 rss ##sa ##ro ##am ##no thu fri 350 ##sh ##ki 103 comments name ##のて ##pe ##ine max 1987 8000 uber ##mi ##ton wordpress office 1986 1985 ##ment 107 bd win10 ##ld ##li gmail bb dior ##rs ##ri ##rd ##ます up cad ##® dr して read ##21 をお ##io ##99 url 1984 pvc paypal show policy ##40 ##ty ##18 with ##★ ##01 txt 102 ##ba dna from post mini ar taiwan john ##ga privacy agoda ##13 ##ny word ##24 ##22 ##by ##ur ##hz 1982 ##ang 265 cookie netscape 108 ##ka ##~ ##ad house share note ibm code hello nike sim survey ##016 1979 1950 wikia ##32 ##017 5g cbc ##tor ##kg 1983 ##rt ##14 campaign store 2500 os ##ct ##ts ##° 170 api ##ns 365 excel ##な ##ao ##ら ##し ~~ ##nd university 163 には 518 ##70 ##ya ##il ##25 pierre ipo 0020 897 ##23 hotels ##ian のお 125 years 6606 ##ers ##26 high ##day time ##ay bug ##line ##く ##す ##be xp talk2yam yamservice 10000 coco ##dy sony ##ies 1978 microsoft david people ##ha 1960 instagram intel その ##ot iso 1981 ##va 115 ##mo ##land xxx man co ltxsw ##ation baby 220 ##pa ##ol 1945 7000 tag 450 ##ue msn ##31 oppo ##ト ##ca control ##om st chrome ##ure ##ん be ##き lol ##19 した ##bo 240 lady ##100 ##way ##から 4600 ##ko ##do ##un 4s corporation 168 ##ni herme ##28 cp 978 ##up ##06 ui ##ds ppt admin three します bbc re 128 ##48 ca ##015 ##35 hp ##ee tpp ##た ##ive ×× root ##cc ##ました ##ble ##ity adobe park 114 et oled city ##ex ##ler ##ap china ##book 20000 view ##ice global ##km your hong ##mg out ##ms ng ebay ##29 menu ubuntu ##cy rom ##view open ktv do server ##lo if english ##ね ##5 ##oo 1600 ##02 step1 kong club 135 july inc 1976 mr hi ##net touch ##ls ##ii michael lcd ##05 ##33 phone james step2 1300 ios9 ##box dc ##2 ##ley samsung 111 280 pokemon css ##ent ##les いいえ ##1 s8 atom play bmw ##said sa etf ctrl ♥yoyo♥ ##55 2025 ##2014 ##66 adidas amazon 1958 ##ber ##ner visa ##77 ##der 1800 connectivity ##hi firefox 109 118 hr so style mark pop ol skip 1975 as ##27 ##ir ##61 190 mba ##う ##ai le ##ver 1900 cafe2017 lte super 113 129 ##ron amd like ##☆ are ##ster we ##sk paul data international ##ft longchamp ssd good ##ート ##ti reply ##my ↓↓↓ apr star ##ker source 136 js 112 get force photo ##one 126 ##2013 ##ow link bbs 1972 goods ##lin python 119 ##ip game ##ics ##ません blue ##● 520 ##45 page itunes ##03 1955 260 1968 gt gif 618 ##ff ##47 group くたさい about bar ganji ##nce music lee not 1977 1971 1973 ##per an faq comment ##って days ##ock 116 ##bs 1974 1969 v1 player 1956 xbox sql fm f1 139 ##ah 210 ##lv ##mp ##000 melody 1957 ##3 550 17life 199 1966 xml market ##au ##71 999 ##04 what gl ##95 ##age tips ##68 book ##ting mysql can 1959 230 ##ung wonderland watch 10℃ ##ction 9000 mar mobile 1946 1962 article ##db part ▲top party って 1967 1964 1948 ##07 ##ore ##op この dj ##78 ##38 010 main 225 1965 ##ong art 320 ad 134 020 ##73 117 pm2 japan 228 ##08 ts 1963 ##ica der sm ##36 2019 ##wa ct ##7 ##や ##64 1937 homemesh search ##85 ##れは ##tv ##di macbook ##9 ##くたさい service ##♥ type った 750 ##ier ##si ##75 ##います ##ok best ##ット goris lock ##った cf 3m big ##ut ftp carol ##vi 10 1961 happy sd ##ac 122 anti pe cnn iii 1920 138 ##ラ 1940 esp jan tags ##98 ##51 august vol ##86 154 ##™ ##fs ##れ ##sion design ac ##ム press jordan ppp that key check ##6 ##tt ##㎡ 1080p ##lt power ##42 1952 ##bc vivi ##ック he 133 121 jpg ##rry 201 175 3500 1947 nb ##ted ##rn しています 1954 usd ##t00 master ##ンク 001 model ##58 al ##09 1953 ##34 ram goo ても ##ui 127 1930 red ##ary rpg item ##pm ##41 270 ##za project ##2012 hot td blogabstract ##ger ##62 650 ##44 gr2 ##します ##m black electronic nfc year asus また html5 cindy ##hd m3 132 esc ##od booking ##53 fed tvb ##81 ##ina mit 165 ##いる chan 192 distribution next になる peter bios steam cm 1941 にも pk10 ##ix ##65 ##91 dec nasa ##ana icecat 00z b1 will ##46 li se ##ji ##み ##ard oct ##ain jp ##ze ##bi cio ##56 smart h5 ##39 ##port curve vpn ##nm ##dia utc ##あり 12345678910 ##52 rmvb chanel a4 miss ##and ##im media who ##63 she girl 5s 124 vera ##して class vivo king ##フ ##ei national ab 1951 5cm 888 145 ipod ap 1100 5mm 211 ms 2756 ##69 mp4 msci ##po ##89 131 mg index 380 ##bit ##out ##zz ##97 ##67 158 apec ##8 photoshop opec ¥799 ては ##96 ##tes ##ast 2g ○○ ##ール ¥2899 ##ling ##よ ##ory 1938 ##ical kitty content ##43 step3 ##cn win8 155 vc 1400 iphone7 robert ##した tcl 137 beauty ##87 en dollars ##ys ##oc step pay yy a1 ##2011 ##lly ##ks ##♪ 1939 188 download 1944 sep exe ph います school gb center pr street ##board uv ##37 ##lan winrar ##que ##ua ##com 1942 1936 480 gpu ##4 ettoday fu tom ##54 ##ren ##via 149 ##72 b2b 144 ##79 ##tch rose arm mb ##49 ##ial ##nn nvidia step4 mvp 00㎡ york 156 ##イ how cpi 591 2765 gov kg joe ##xx mandy pa ##ser copyright fashion 1935 don ##け ecu ##ist ##art erp wap have ##lm talk ##ek ##ning ##if ch ##ite video 1943 cs san iot look ##84 ##2010 ##ku october ##ux trump ##hs ##ide box 141 first ##ins april ##ight ##83 185 angel protected aa 151 162 x1 m2 ##fe ##× ##ho size 143 min ofo fun gomaji ex hdmi food dns march chris kevin ##のか ##lla ##pp ##ec ag ems 6s 720p ##rm ##ham off ##92 asp team fandom ed 299 ▌♥ ##ell info されています ##82 sina 4066 161 ##able ##ctor 330 399 315 dll rights ltd idc jul 3kg 1927 142 ma surface ##76 ##ク ~~~ 304 mall eps 146 green ##59 map space donald v2 sodu ##light 1931 148 1700 まて 310 reserved htm ##han ##57 2d 178 mod ##ise ##tions 152 ti ##shi doc 1933 icp 055 wang ##ram shopping aug ##pi ##well now wam b2 からお ##hu 236 1928 ##gb 266 f2 ##93 153 mix ##ef ##uan bwl ##plus ##res core ##ess tea 5℃ hktvmall nhk ##ate list ##ese 301 feb 4m inn ての nov 159 12345 daniel ##ci pass ##bet ##nk coffee 202 ssl airbnb ##ute fbi woshipm skype ea cg sp ##fc ##www yes edge alt 007 ##94 fpga ##ght ##gs iso9001 さい ##ile ##wood ##uo image lin icon american ##em 1932 set says ##king ##tive blogger ##74 なと 256 147 ##ox ##zy ##red ##ium ##lf nokia claire ##リ ##ding november lohas ##500 ##tic ##マ ##cs ##ある ##che ##ire ##gy ##ult db january win ##カ 166 road ptt ##ま ##つ 198 ##fa ##mer anna pchome はい udn ef 420 ##time ##tte 2030 ##ア g20 white かかります 1929 308 garden eleven di ##おります chen 309b 777 172 young cosplay ちてない 4500 bat ##123 ##tra ##ては kindle npc steve etc ##ern ##| call xperia ces travel sk s7 ##ous 1934 ##int みいたたけます 183 edu file cho qr ##car ##our 186 ##ant ##d eric 1914 rends ##jo ##する mastercard ##2000 kb ##min 290 ##ino vista ##ris ##ud jack 2400 ##set 169 pos 1912 ##her ##ou taipei しく 205 beta ##ませんか 232 ##fi express 255 body ##ill aphojoy user december meiki ##ick tweet richard ##av ##ᆫ iphone6 ##dd ちてすか views ##mark 321 pd ##00 times ##▲ level ##ash 10g point 5l ##ome 208 koreanmall ##ak george q2 206 wma tcp ##200 スタッフ full mlb ##lle ##watch tm run 179 911 smith business ##und 1919 color ##tal 222 171 ##less moon 4399 ##rl update pcb shop 499 157 little なし end ##mhz van dsp easy 660 ##house ##key history ##o oh ##001 ##hy ##web oem let was ##2009 ##gg review ##wan 182 ##°c 203 uc title ##val united 233 2021 ##ons doi trivago overdope sbs ##ance ##ち grand special 573032185 imf 216 wx17house ##so ##ーム audi ##he london william ##rp ##ake science beach cfa amp ps4 880 ##800 ##link ##hp crm ferragamo bell make ##eng 195 under zh photos 2300 ##style ##ント via 176 da ##gi company i7 ##ray thomas 370 ufo i5 ##max plc ben back research 8g 173 mike ##pc ##ッフ september 189 ##ace vps february 167 pantos wp lisa 1921 ★★ jquery night long offer ##berg ##news 1911 ##いて ray fks wto せます over 164 340 ##all ##rus 1924 ##888 ##works blogtitle loftpermalink ##→ 187 martin test ling km ##め 15000 fda v3 ##ja ##ロ wedding かある outlet family ##ea をこ ##top story ##ness salvatore ##lu 204 swift 215 room している oracle ##ul 1925 sam b2c week pi rock ##のは ##a ##けと ##ean ##300 ##gle cctv after chinese ##back powered x2 ##tan 1918 ##nes ##イン canon only 181 ##zi ##las say ##oe 184 ##sd 221 ##bot ##world ##zo sky made top100 just 1926 pmi 802 234 gap ##vr 177 les 174 ▲topoct ball vogue vi ing ofweek cos ##list ##ort ▲topmay ##なら ##lon として last ##tc ##of ##bus ##gen real eva ##コ a3 nas ##lie ##ria ##coin ##bt ▲topapr his 212 cat nata vive health ⋯⋯ drive sir ▲topmar du cup ##カー ##ook ##よう ##sy alex msg tour しました 3ce ##word 193 ebooks r8 block 318 ##より 2200 nice pvp 207 months 1905 rewards ##ther 1917 0800 ##xi ##チ ##sc micro 850 gg blogfp op 1922 daily m1 264 true ##bb ml ##tar ##のお ##ky anthony 196 253 ##yo state 218 ##ara ##aa ##rc ##tz ##ston より gear ##eo ##ade ge see 1923 ##win ##ura ss heart ##den ##ita down ##sm el png 2100 610 rakuten whatsapp bay dream add ##use 680 311 pad gucci mpv ##ode ##fo island ▲topjun ##▼ 223 jason 214 chicago ##❤ しの ##hone io ##れる ##ことか sogo be2 ##ology 990 cloud vcd ##con 2~3 ##ford ##joy ##kb ##こさいます ##rade but ##ach docker ##ful rfid ul ##ase hit ford ##star 580 ##○ 11 a2 sdk reading edited ##are cmos ##mc 238 siri light ##ella ##ため bloomberg ##read pizza ##ison jimmy ##vm college node journal ba 18k ##play 245 ##cer 20 magic ##yu 191 jump 288 tt ##ings asr ##lia 3200 step5 network ##cd mc いします 1234 pixstyleme 273 ##600 2800 money ★★★★★ 1280 12 430 bl みの act ##tus tokyo ##rial ##life emba ##ae saas tcs ##rk ##wang summer ##sp ko ##ving 390 premium ##その netflix ##ヒ uk mt ##lton right frank two 209 える ##ple ##cal 021 ##んな ##sen ##ville hold nexus dd ##ius てお ##mah ##なく tila zero 820 ce ##tin resort ##ws charles old p10 5d report ##360 ##ru ##には bus vans lt ##est pv ##レ links rebecca ##ツ ##dm azure ##365 きな limited bit 4gb ##mon 1910 moto ##eam 213 1913 var eos なとの 226 blogspot された 699 e3 dos dm fc ##ments ##ik ##kw boy ##bin ##ata 960 er ##せ 219 ##vin ##tu ##ula 194 ##∥ station ##ろ ##ature 835 files zara hdr top10 nature 950 magazine s6 marriott ##シ avira case ##っと tab ##ran tony ##home oculus im ##ral jean saint cry 307 rosie ##force ##ini ice ##bert のある ##nder ##mber pet 2600 ##◆ plurk ▲topdec ##sis 00kg ▲topnov 720 ##ence tim ##ω ##nc ##ても ##name log ips great ikea malaysia unix ##イト 3600 ##ncy ##nie 12000 akb48 ##ye ##oid 404 ##chi ##いた oa xuehai ##1000 ##orm ##rf 275 さん ##ware ##リー 980 ho ##pro text ##era 560 bob 227 ##ub ##2008 8891 scp avi ##zen 2022 mi wu museum qvod apache lake jcb ▲topaug ★★★ ni ##hr hill 302 ne weibo 490 ruby ##ーシ ##ヶ ##row 4d ▲topjul iv ##ish github 306 mate 312 ##スト ##lot ##ane andrew のハイト ##tina t1 rf ed2k ##vel ##900 way final りの ns 5a 705 197 ##メ sweet bytes ##ene ▲topjan 231 ##cker ##2007 ##px 100g topapp 229 helpapp rs low 14k g4g care 630 ldquo あり ##fork leave rm edition ##gan ##zon ##qq ▲topsep ##google ##ism gold 224 explorer ##zer toyota category select visual ##labels restaurant ##md posts s1 ##ico もっと angelababy 123456 217 sports s3 mbc 1915 してくたさい shell x86 candy ##new kbs face xl 470 ##here 4a swissinfo v8 ▲topfeb dram ##ual ##vice 3a ##wer sport q1 ios10 public int card ##c ep au rt ##れた 1080 bill ##mll kim 30 460 wan ##uk ##ミ x3 298 0t scott ##ming 239 e5 ##3d h7n9 worldcat brown ##あります ##vo ##led ##580 ##ax 249 410 ##ert paris ##~6 polo 925 ##lr 599 ##ナ capital ##hing bank cv 1g ##chat ##s ##たい adc ##ule 2m ##e digital hotmail 268 ##pad 870 bbq quot ##ring before wali ##まて mcu 2k 2b という costco 316 north 333 switch ##city ##p philips ##mann management panasonic ##cl ##vd ##ping ##rge alice ##lk ##ましょう css3 ##ney vision alpha ##ular ##400 ##tter lz にお ##ありません mode gre 1916 pci ##tm 237 1~2 ##yan ##そ について ##let ##キ work war coach ah mary ##ᅵ huang ##pt a8 pt follow ##berry 1895 ##ew a5 ghost ##ション ##wn ##og south ##code girls ##rid action villa git r11 table games ##cket error ##anonymoussaid ##ag here ##ame ##gc qa ##■ ##lis gmp ##gin vmalife ##cher yu wedding ##tis demo dragon 530 soho social bye ##rant river orz acer 325 ##↑ ##ース ##ats 261 del ##ven 440 ups ##ように ##ター 305 value macd yougou ##dn 661 ##ano ll ##urt ##rent continue script ##wen ##ect paper 263 319 shift ##chel ##フト ##cat 258 x5 fox 243 ##さん car aaa ##blog loading ##yn ##tp kuso 799 si sns イカせるテンマ ヒンクテンマ3 rmb vdc forest central prime help ultra ##rmb ##ような 241 square 688 ##しい のないフロクに ##field ##reen ##ors ##ju c1 start 510 ##air ##map cdn ##wo cba stephen m8 100km ##get opera ##base ##ood vsa com™ ##aw ##ail 251 なのて count t2 ##ᅡ ##een 2700 hop ##gp vsc tree ##eg ##ose 816 285 ##ories ##shop alphago v4 1909 simon ##ᆼ fluke62max zip スホンサー ##sta louis cr bas ##~10 bc ##yer hadoop ##ube ##wi 1906 0755 hola ##low place centre 5v d3 ##fer 252 ##750 ##media 281 540 0l exchange 262 series ##ハー ##san eb ##bank ##k q3 ##nge ##mail take ##lp 259 1888 client east cache event vincent ##ールを きを ##nse sui 855 adchoice ##и ##stry ##なたの 246 ##zone ga apps sea ##ab 248 cisco ##タ ##rner kymco ##care dha ##pu ##yi minkoff royal p1 への annie 269 collection kpi playstation 257 になります 866 bh ##bar queen 505 radio 1904 andy armani ##xy manager iherb ##ery ##share spring raid johnson 1908 ##ob volvo hall ##ball v6 our taylor ##hk bi 242 ##cp kate bo water technology ##rie サイトは 277 ##ona ##sl hpv 303 gtx hip rdquo jayz stone ##lex ##rum namespace ##やり 620 ##ale ##atic des ##erson ##ql ##ves ##type enter ##この ##てきます d2 ##168 ##mix ##bian との a9 jj ky ##lc access movie ##hc リストに tower ##ration ##mit ます ##nch ua tel prefix ##o2 1907 ##point 1901 ott ~10 ##http ##ury baidu ##ink member ##logy bigbang nownews ##js ##shot ##tb ##こと 247 eba ##tics ##lus ける v5 spark ##ama there ##ions god ##lls ##down hiv ##ress burberry day2 ##kv ◆◆ jeff related film edit joseph 283 ##ark cx 32gb order g9 30000 ##ans ##tty s5 ##bee かあります thread xr buy sh 005 land spotify mx ##ari 276 ##verse ×email sf why ##ことて 244 7headlines nego sunny dom exo 401 666 positioning fit rgb ##tton 278 kiss alexa adam lp みリストを ##g mp ##ties ##llow amy ##du np 002 institute 271 ##rth ##lar 2345 590 ##des sidebar 15 imax site ##cky ##kit ##ime ##009 season 323 ##fun ##ンター ##ひ gogoro a7 pu lily fire twd600 ##ッセーシを いて ##vis 30ml ##cture ##をお information ##オ close friday ##くれる yi nick てすか ##tta ##tel 6500 ##lock cbd economy 254 かお 267 tinker double 375 8gb voice ##app oops channel today 985 ##right raw xyz ##+ jim edm ##cent 7500 supreme 814 ds ##its ##asia dropbox ##てすか ##tti books 272 100ml ##tle ##ller ##ken ##more ##boy sex 309 ##dom t3 ##ider ##なります ##unch 1903 810 feel 5500 ##かった ##put により s2 mo ##gh men ka amoled div ##tr ##n1 port howard ##tags ken dnf ##nus adsense ##а ide ##へ buff thunder ##town ##ique has ##body auto pin ##erry tee てした 295 number ##the ##013 object psp cool udnbkk 16gb ##mic miui ##tro most r2 ##alk ##nity 1880 ±0 ##いました 428 s4 law version ##oa n1 sgs docomo ##tf ##ack henry fc2 ##ded ##sco ##014 ##rite 286 0mm linkedin ##ada ##now wii ##ndy ucbug ##◎ sputniknews legalminer ##ika ##xp 2gb ##bu q10 oo b6 come ##rman cheese ming maker ##gm nikon ##fig ppi kelly ##ります jchere てきます ted md 003 fgo tech ##tto dan soc ##gl ##len hair earth 640 521 img ##pper ##a1 ##てきる ##ロク acca ##ition ##ference suite ##ig outlook ##mond ##cation 398 ##pr 279 101vip 358 ##999 282 64gb 3800 345 airport ##over 284 ##おり jones ##ith lab ##su ##いるのて co2 town piece ##llo no1 vmware 24h ##qi focus reader ##admin ##ora tb false ##log 1898 know lan 838 ##ces f4 ##ume motel stop ##oper na flickr netcomponents ##af ##─ pose williams local ##ound ##cg ##site ##iko いお 274 5m gsm con ##ath 1902 friends ##hip cell 317 ##rey 780 cream ##cks 012 ##dp facebooktwitterpinterestgoogle sso 324 shtml song swiss ##mw ##キンク lumia xdd string tiffany 522 marc られた insee russell sc dell ##ations ok camera 289 ##vs ##flow ##late classic 287 ##nter stay g1 mtv 512 ##ever ##lab ##nger qe sata ryan d1 50ml cms ##cing su 292 3300 editor 296 ##nap security sunday association ##ens ##700 ##bra acg ##かり sofascore とは mkv ##ign jonathan gary build labels ##oto tesla moba qi gohappy general ajax 1024 ##かる サイト society ##test ##urs wps fedora ##ich mozilla 328 ##480 ##dr usa urn ##lina ##r grace ##die ##try ##ader 1250 ##なり elle 570 ##chen ##ᆯ price ##ten uhz ##ough eq ##hen states push session balance wow 506 ##cus ##py when ##ward ##ep 34e wong library prada ##サイト ##cle running ##ree 313 ck date q4 ##ctive ##ool ##> mk ##ira ##163 388 die secret rq dota buffet は1ヶ e6 ##ez pan 368 ha ##card ##cha 2a ##さ alan day3 eye f3 ##end france keep adi rna tvbs ##ala solo nova ##え ##tail ##ょう support ##ries ##なる ##ved base copy iis fps ##ways hero hgih profile fish mu ssh entertainment chang ##wd click cake ##ond pre ##tom kic pixel ##ov ##fl product 6a ##pd dear ##gate es yumi audio ##² ##sky echo bin where ##ture 329 ##ape find sap isis ##なと nand ##101 ##load ##ream band a6 525 never ##post festival 50cm ##we 555 guide 314 zenfone ##ike 335 gd forum jessica strong alexander ##ould software allen ##ious program 360° else lohasthree ##gar することかてきます please ##れます rc ##ggle ##ric bim 50000 ##own eclipse 355 brian 3ds ##side 061 361 ##other ##ける ##tech ##ator 485 engine ##ged ##t plaza ##fit cia ngo westbrook shi tbs 50mm ##みませんか sci 291 reuters ##ily contextlink ##hn af ##cil bridge very ##cel 1890 cambridge ##ize 15g ##aid ##data 790 frm ##head award butler ##sun meta ##mar america ps3 puma pmid ##すか lc 670 kitchen ##lic オーフン5 きなしソフトサーヒス そして day1 future ★★★★ ##text ##page ##rris pm1 ##ket fans ##っています 1001 christian bot kids trackback ##hai c3 display ##hl n2 1896 idea さんも ##sent airmail ##ug ##men pwm けます 028 ##lution 369 852 awards schemas 354 asics wikipedia font ##tional ##vy c2 293 ##れている ##dget ##ein っている contact pepper スキル 339 ##~5 294 ##uel ##ument 730 ##hang みてす q5 ##sue rain ##ndi wei swatch ##cept わせ 331 popular ##ste ##tag p2 501 trc 1899 ##west ##live justin honda ping messenger ##rap v9 543 ##とは unity appqq はすへて 025 leo ##tone ##テ ##ass uniqlo ##010 502 her jane memory moneydj ##tical human 12306 していると ##m2 coc miacare ##mn tmt ##core vim kk ##may fan target use too 338 435 2050 867 737 fast ##2c services ##ope omega energy ##わ pinkoi 1a ##なから ##rain jackson ##ement ##シャンルの 374 366 そんな p9 rd ##ᆨ 1111 ##tier ##vic zone ##│ 385 690 dl isofix cpa m4 322 kimi めて davis ##lay lulu ##uck 050 weeks qs ##hop 920 ##n ae ##ear ~5 eia 405 ##fly korea jpeg boost ##ship small ##リア 1860 eur 297 425 valley ##iel simple ##ude rn k2 ##ena されます non patrick しているから ##ナー feed 5757 30g process well qqmei ##thing they aws lu pink ##ters ##kin または board ##vertisement wine ##ien unicode ##dge r1 359 ##tant いを ##twitter ##3c cool1 される ##れて ##l isp ##012 standard 45㎡2 402 ##150 matt ##fu 326 ##iner googlemsn pixnetfacebookyahoo ##ラン x7 886 ##uce メーカー sao ##ev ##きました ##file 9678 403 xddd shirt 6l ##rio ##hat 3mm givenchy ya bang ##lio monday crystal ロクイン ##abc 336 head 890 ubuntuforumwikilinuxpastechat ##vc ##~20 ##rity cnc 7866 ipv6 null 1897 ##ost yang imsean tiger ##fet ##ンス 352 ##= dji 327 ji maria ##come ##んて foundation 3100 ##beth ##なった 1m 601 active ##aft ##don 3p sr 349 emma ##khz living 415 353 1889 341 709 457 sas x6 ##face pptv x4 ##mate han sophie ##jing 337 fifa ##mand other sale inwedding ##gn てきちゃいます ##mmy ##pmlast bad nana nbc してみてくたさいね なとはお ##wu ##かあります ##あ note7 single ##340 せからこ してくたさい♪この しにはとんとんワークケートを するとあなたにもっとマッチした ならワークケートへ もみつかっちゃうかも ワークケートの ##bel window ##dio ##ht union age 382 14 ##ivity ##y コメント domain neo ##isa ##lter 5k f5 steven ##cts powerpoint tft self g2 ft ##テル zol ##act mwc 381 343 もう nbapop 408 てある eds ace ##room previous author tomtom il ##ets hu financial ☆☆☆ っています bp 5t chi 1gb ##hg fairmont cross 008 gay h2 function ##けて 356 also 1b 625 ##ータ ##raph 1894 3~5 ##ils i3 334 avenue ##host による ##bon ##tsu message navigation 50g fintech h6 ##ことを 8cm ##ject ##vas ##firm credit ##wf xxxx form ##nor ##space huawei plan json sbl ##dc machine 921 392 wish ##120 ##sol windows7 edward ##ために development washington ##nsis lo 818 ##sio ##ym ##bor planet ##~8 ##wt ieee gpa ##めて camp ann gm ##tw ##oka connect ##rss ##work ##atus wall chicken soul 2mm ##times fa ##ather ##cord 009 ##eep hitachi gui harry ##pan e1 disney ##press ##ーション wind 386 frigidaire ##tl liu hsu 332 basic von ev いた てきる スホンサーサイト learning ##ull expedia archives change ##wei santa cut ins 6gb turbo brand cf1 508 004 return 747 ##rip h1 ##nis ##をこ 128gb ##にお 3t application しており emc rx ##oon 384 quick 412 15058 wilson wing chapter ##bug beyond ##cms ##dar ##oh zoom e2 trip sb ##nba rcep 342 aspx ci 080 gc gnu める ##count advanced dance dv ##url ##ging 367 8591 am09 shadow battle 346 ##i ##cia ##という emily ##のてす ##tation host ff techorz sars ##mini ##mporary ##ering nc 4200 798 ##next cma ##mbps ##gas ##ift ##dot ##ィ 455 ##~17 amana ##りの 426 ##ros ir 00㎡1 ##eet ##ible ##↓ 710 ˋ▽ˊ ##aka dcs iq ##v l1 ##lor maggie ##011 ##iu 588 ##~1 830 ##gt 1tb articles create ##burg ##iki database fantasy ##rex ##cam dlc dean ##you hard path gaming victoria maps cb ##lee ##itor overchicstoretvhome systems ##xt 416 p3 sarah 760 ##nan 407 486 x9 install second 626 ##ann ##ph ##rcle ##nic 860 ##nar ec ##とう 768 metro chocolate ##rian ~4 ##table ##しています skin ##sn 395 mountain ##0mm inparadise 6m 7x24 ib 4800 ##jia eeworld creative g5 g3 357 parker ecfa village からの 18000 sylvia サーヒス hbl ##ques ##onsored ##x2 ##きます ##v4 ##tein ie6 383 ##stack 389 ver ##ads ##baby sound bbe ##110 ##lone ##uid ads 022 gundam 351 thinkpad 006 scrum match ##ave mems ##470 ##oy ##なりました ##talk glass lamigo span ##eme job ##a5 jay wade kde 498 ##lace ocean tvg ##covery ##r3 ##ners ##rea junior think ##aine cover ##ision ##sia ↓↓ ##bow msi 413 458 406 ##love 711 801 soft z2 ##pl 456 1840 mobil mind ##uy 427 nginx ##oi めた ##rr 6221 ##mple ##sson ##ーシてす 371 ##nts 91tv comhd crv3000 ##uard 1868 397 deep lost field gallery ##bia rate spf redis traction 930 icloud 011 なら fe jose 372 ##tory into sohu fx 899 379 kicstart2 ##hia すく ##~3 ##sit ra 24 ##walk ##xure 500g ##pact pacific xa natural carlo ##250 ##walker 1850 ##can cto gigi 516 ##サー pen ##hoo ob matlab ##b ##yy 13913459 ##iti mango ##bbs sense c5 oxford ##ニア walker jennifer ##ola course ##bre 701 ##pus ##rder lucky 075 ##ぁ ivy なお ##nia sotheby side ##ugh joy ##orage ##ush ##bat ##dt 364 r9 ##2d ##gio 511 country wear ##lax ##~7 ##moon 393 seven study 411 348 lonzo 8k ##ェ evolution ##イフ ##kk gs kd ##レス arduino 344 b12 ##lux arpg ##rdon cook ##x5 dark five ##als ##ida とても sign 362 ##ちの something 20mm ##nda 387 ##posted fresh tf 1870 422 cam ##mine ##skip ##form ##ssion education 394 ##tee dyson stage ##jie want ##night epson pack あります ##ppy テリヘル ##█ wd ##eh ##rence left ##lvin golden mhz discovery ##trix ##n2 loft ##uch ##dra ##sse speed ~1 1mdb sorry welcome ##urn wave gaga ##lmer teddy ##160 トラックハック せよ 611 ##f2016 378 rp ##sha rar ##あなたに ##きた 840 holiday ##ュー 373 074 ##vg ##nos ##rail gartner gi 6p ##dium kit 488 b3 eco ##ろう 20g sean ##stone autocad nu ##np f16 write 029 m5 ##ias images atp ##dk fsm 504 1350 ve 52kb ##xxx ##のに ##cake 414 unit lim ru 1v ##ification published angela 16g analytics ak ##q ##nel gmt ##icon again ##₂ ##bby ios11 445 かこさいます waze いてす ##ハ 9985 ##ust ##ティー framework ##007 iptv delete 52sykb cl wwdc 027 30cm ##fw ##ての 1389 ##xon brandt ##ses ##dragon tc vetements anne monte modern official ##へて ##ere ##nne ##oud もちろん 50 etnews ##a2 ##graphy 421 863 ##ちゃん 444 ##rtex ##てお l2 ##gma mount ccd たと archive morning tan ddos e7 ##ホ day4 ##ウ gis 453 its 495 factory bruce pg ##ito ってくたさい guest cdma ##lling 536 n3 しかし 3~4 mega eyes ro 13 women dac church ##jun singapore ##facebook 6991 starbucks ##tos ##stin ##shine zen ##mu tina 20℃ 1893 ##たけて 503 465 request ##gence qt ##っ 1886 347 363 q7 ##zzi diary ##tore 409 ##ead 468 cst ##osa canada agent va ##jiang ##ちは ##ーク ##lam sg ##nix ##sday ##よって g6 ##master bing ##zl charlie 16 8mm nb40 ##ーン thai ##ルフ ln284ct ##itz ##2f bonnie ##food ##lent originals ##stro ##lts 418 ∟∣ ##bscribe children ntd yesstyle ##かも hmv ##tment d5 2cm arts sms ##pn ##я ##いい topios9 539 lifestyle virtual ##ague xz ##deo muji 024 unt ##nnis ##ᅩ faq1 1884 396 ##ette fly 64㎡ はしめまして 441 curry ##pop のこ release ##← ##◆◆ ##cast 073 ありな 500ml ##ews 5c ##stle ios7 ##ima 787 dog lenovo ##r4 roger 013 cbs vornado 100m 417 ##desk ##クok ##ald 1867 9595 2900 ##van oil ##x some break common ##jy ##lines g7 twice 419 ella nano belle にこ ##mes ##self ##note jb ##ことかてきます benz ##との ##ova 451 save ##wing ##ますのて kai りは ##hua ##rect rainer ##unge 448 ##0m adsl ##かな guestname ##uma ##kins ##zu tokichoi ##price county ##med ##mus rmk 391 address vm えて openload ##group ##hin ##iginal amg urban ##oz jobs emi ##public beautiful ##sch album ##dden ##bell jerry works hostel miller ##drive ##rmin ##10 376 boot 828 ##370 ##fx ##cm~ 1885 ##nome ##ctionary ##oman ##lish ##cr ##hm 433 ##how 432 francis xi c919 b5 evernote ##uc vga ##3000 coupe ##urg ##cca ##uality 019 6g れる multi ##また ##ett em hey ##ani ##tax ##rma inside than 740 leonnhurt ##jin ict れた bird notes 200mm くの ##dical ##lli result 442 iu ee 438 smap gopro ##last yin pure 998 32g けた 5kg ##dan ##rame mama ##oot bean marketing ##hur 2l bella sync xuite ##ground 515 discuz ##getrelax ##ince ##bay ##5s cj ##イス gmat apt ##pass jing ##rix c4 rich ##とても niusnews ##ello bag 770 ##eting ##mobile 18 culture 015 ##のてすか 377 1020 area ##ience 616 details gp universal silver dit はお private ddd u11 kanshu ##ified fung ##nny dx ##520 tai 475 023 ##fr ##lean 3s ##pin 429 ##rin 25000 ly rick ##bility usb3 banner ##baru ##gion metal dt vdf 1871 karl qualcomm bear 1010 oldid ian jo ##tors population ##ernel 1882 mmorpg ##mv ##bike 603 ##© ww friend ##ager exhibition ##del ##pods fpx structure ##free ##tings kl ##rley ##copyright ##mma california 3400 orange yoga 4l canmake honey ##anda ##コメント 595 nikkie ##ルハイト dhl publishing ##mall ##gnet 20cm 513 ##クセス ##┅ e88 970 ##dog fishbase ##! ##" ### ##$ ##% ##& ##' ##( ##) ##* ##+ ##, ##- ##. ##/ ##: ##; ##< ##= ##> ##? ##@ ##[ ##\ ##] ##^ ##_ ##{ ##| ##} ##~ ##£ ##¤ ##¥ ##§ ##« ##± ##³ ##µ ##· ##¹ ##º ##» ##¼ ##ß ##æ ##÷ ##ø ##đ ##ŋ ##ɔ ##ə ##ɡ ##ʰ ##ˇ ##ˈ ##ˊ ##ˋ ##ˍ ##ː ##˙ ##˚ ##ˢ ##α ##β ##γ ##δ ##ε ##η ##θ ##ι ##κ ##λ ##μ ##ν ##ο ##π ##ρ ##ς ##σ ##τ ##υ ##φ ##χ ##ψ ##б ##в ##г ##д ##е ##ж ##з ##к ##л ##м ##н ##о ##п ##р ##с ##т ##у ##ф ##х ##ц ##ч ##ш ##ы ##ь ##і ##ا ##ب ##ة ##ت ##د ##ر ##س ##ع ##ل ##م ##ن ##ه ##و ##ي ##۩ ##ก ##ง ##น ##ม ##ย ##ร ##อ ##า ##เ ##๑ ##་ ##ღ ##ᄀ ##ᄁ ##ᄂ ##ᄃ ##ᄅ ##ᄆ ##ᄇ ##ᄈ ##ᄉ ##ᄋ ##ᄌ ##ᄎ ##ᄏ ##ᄐ ##ᄑ ##ᄒ ##ᅢ ##ᅣ ##ᅥ ##ᅦ ##ᅧ ##ᅨ ##ᅪ ##ᅬ ##ᅭ ##ᅮ ##ᅯ ##ᅲ ##ᅳ ##ᅴ ##ᆷ ##ᆸ ##ᆺ ##ᆻ ##ᗜ ##ᵃ ##ᵉ ##ᵍ ##ᵏ ##ᵐ ##ᵒ ##ᵘ ##‖ ##„ ##† ##• ##‥ ##‧ ##
 ##‰ ##′ ##″ ##‹ ##› ##※ ##‿ ##⁄ ##ⁱ ##⁺ ##ⁿ ##₁ ##₃ ##₄ ##€ ##№ ##ⅰ ##ⅱ ##ⅲ ##ⅳ ##ⅴ ##↔ ##↗ ##↘ ##⇒ ##∀ ##− ##∕ ##∙ ##√ ##∞ ##∟ ##∠ ##∣ ##∩ ##∮ ##∶ ##∼ ##∽ ##≈ ##≒ ##≡ ##≤ ##≥ ##≦ ##≧ ##≪ ##≫ ##⊙ ##⋅ ##⋈ ##⋯ ##⌒ ##① ##② ##③ ##④ ##⑤ ##⑥ ##⑦ ##⑧ ##⑨ ##⑩ ##⑴ ##⑵ ##⑶ ##⑷ ##⑸ ##⒈ ##⒉ ##⒊ ##⒋ ##ⓒ ##ⓔ ##ⓘ ##━ ##┃ ##┆ ##┊ ##┌ ##└ ##├ ##┣ ##═ ##║ ##╚ ##╞ ##╠ ##╭ ##╮ ##╯ ##╰ ##╱ ##╳ ##▂ ##▃ ##▅ ##▇ ##▉ ##▋ ##▌ ##▍ ##▎ ##□ ##▪ ##▫ ##▬ ##△ ##▶ ##► ##▽ ##◇ ##◕ ##◠ ##◢ ##◤ ##☀ ##☕ ##☞ ##☺ ##☼ ##♀ ##♂ ##♠ ##♡ ##♣ ##♦ ##♫ ##♬ ##✈ ##✔ ##✕ ##✖ ##✦ ##✨ ##✪ ##✰ ##✿ ##❀ ##➜ ##➤ ##⦿ ##、 ##。 ##〃 ##々 ##〇 ##〈 ##〉 ##《 ##》 ##「 ##」 ##『 ##』 ##【 ##】 ##〓 ##〔 ##〕 ##〖 ##〗 ##〜 ##〝 ##〞 ##ぃ ##ぇ ##ぬ ##ふ ##ほ ##む ##ゃ ##ゅ ##ゆ ##ょ ##゜ ##ゝ ##ァ ##ゥ ##エ ##ォ ##ケ ##サ ##セ ##ソ ##ッ ##ニ ##ヌ ##ネ ##ノ ##ヘ ##モ ##ャ ##ヤ ##ュ ##ユ ##ョ ##ヨ ##ワ ##ヲ ##・ ##ヽ ##ㄅ ##ㄆ ##ㄇ ##ㄉ ##ㄋ ##ㄌ ##ㄍ ##ㄎ ##ㄏ ##ㄒ ##ㄚ ##ㄛ ##ㄞ ##ㄟ ##ㄢ ##ㄤ ##ㄥ ##ㄧ ##ㄨ ##ㆍ ##㈦ ##㊣ ##㗎 ##一 ##丁 ##七 ##万 ##丈 ##三 ##上 ##下 ##不 ##与 ##丐 ##丑 ##专 ##且 ##丕 ##世 ##丘 ##丙 ##业 ##丛 ##东 ##丝 ##丞 ##丟 ##両 ##丢 ##两 ##严 ##並 ##丧 ##丨 ##个 ##丫 ##中 ##丰 ##串 ##临 ##丶 ##丸 ##丹 ##为 ##主 ##丼 ##丽 ##举 ##丿 ##乂 ##乃 ##久 ##么 ##义 ##之 ##乌 ##乍 ##乎 ##乏 ##乐 ##乒 ##乓 ##乔 ##乖 ##乗 ##乘 ##乙 ##乜 ##九 ##乞 ##也 ##习 ##乡 ##书 ##乩 ##买 ##乱 ##乳 ##乾 ##亀 ##亂 ##了 ##予 ##争 ##事 ##二 ##于 ##亏 ##云 ##互 ##五 ##井 ##亘 ##亙 ##亚 ##些 ##亜 ##亞 ##亟 ##亡 ##亢 ##交 ##亥 ##亦 ##产 ##亨 ##亩 ##享 ##京 ##亭 ##亮 ##亲 ##亳 ##亵 ##人 ##亿 ##什 ##仁 ##仃 ##仄 ##仅 ##仆 ##仇 ##今 ##介 ##仍 ##从 ##仏 ##仑 ##仓 ##仔 ##仕 ##他 ##仗 ##付 ##仙 ##仝 ##仞 ##仟 ##代 ##令 ##以 ##仨 ##仪 ##们 ##仮 ##仰 ##仲 ##件 ##价 ##任 ##份 ##仿 ##企 ##伉 ##伊 ##伍 ##伎 ##伏 ##伐 ##休 ##伕 ##众 ##优 ##伙 ##会 ##伝 ##伞 ##伟 ##传 ##伢 ##伤 ##伦 ##伪 ##伫 ##伯 ##估 ##伴 ##伶 ##伸 ##伺 ##似 ##伽 ##佃 ##但 ##佇 ##佈 ##位 ##低 ##住 ##佐 ##佑 ##体 ##佔 ##何 ##佗 ##佘 ##余 ##佚 ##佛 ##作 ##佝 ##佞 ##佟 ##你 ##佢 ##佣 ##佤 ##佥 ##佩 ##佬 ##佯 ##佰 ##佳 ##併 ##佶 ##佻 ##佼 ##使 ##侃 ##侄 ##來 ##侈 ##例 ##侍 ##侏 ##侑 ##侖 ##侗 ##供 ##依 ##侠 ##価 ##侣 ##侥 ##侦 ##侧 ##侨 ##侬 ##侮 ##侯 ##侵 ##侶 ##侷 ##便 ##係 ##促 ##俄 ##俊 ##俎 ##俏 ##俐 ##俑 ##俗 ##俘 ##俚 ##保 ##俞 ##俟 ##俠 ##信 ##俨 ##俩 ##俪 ##俬 ##俭 ##修 ##俯 ##俱 ##俳 ##俸 ##俺 ##俾 ##倆 ##倉 ##個 ##倌 ##倍 ##倏 ##們 ##倒 ##倔 ##倖 ##倘 ##候 ##倚 ##倜 ##借 ##倡 ##値 ##倦 ##倩 ##倪 ##倫 ##倬 ##倭 ##倶 ##债 ##值 ##倾 ##偃 ##假 ##偈 ##偉 ##偌 ##偎 ##偏 ##偕 ##做 ##停 ##健 ##側 ##偵 ##偶 ##偷 ##偻 ##偽 ##偿 ##傀 ##傅 ##傍 ##傑 ##傘 ##備 ##傚 ##傢 ##傣 ##傥 ##储 ##傩 ##催 ##傭 ##傲 ##傳 ##債 ##傷 ##傻 ##傾 ##僅 ##働 ##像 ##僑 ##僕 ##僖 ##僚 ##僥 ##僧 ##僭 ##僮 ##僱 ##僵 ##價 ##僻 ##儀 ##儂 ##億 ##儆 ##儉 ##儋 ##儒 ##儕 ##儘 ##償 ##儡 ##優 ##儲 ##儷 ##儼 ##儿 ##兀 ##允 ##元 ##兄 ##充 ##兆 ##兇 ##先 ##光 ##克 ##兌 ##免 ##児 ##兑 ##兒 ##兔 ##兖 ##党 ##兜 ##兢 ##入 ##內 ##全 ##兩 ##八 ##公 ##六 ##兮 ##兰 ##共 ##兲 ##关 ##兴 ##兵 ##其 ##具 ##典 ##兹 ##养 ##兼 ##兽 ##冀 ##内 ##円 ##冇 ##冈 ##冉 ##冊 ##册 ##再 ##冏 ##冒 ##冕 ##冗 ##写 ##军 ##农 ##冠 ##冢 ##冤 ##冥 ##冨 ##冪 ##冬 ##冯 ##冰 ##冲 ##决 ##况 ##冶 ##冷 ##冻 ##冼 ##冽 ##冾 ##净 ##凄 ##准 ##凇 ##凈 ##凉 ##凋 ##凌 ##凍 ##减 ##凑 ##凛 ##凜 ##凝 ##几 ##凡 ##凤 ##処 ##凪 ##凭 ##凯 ##凰 ##凱 ##凳 ##凶 ##凸 ##凹 ##出 ##击 ##函 ##凿 ##刀 ##刁 ##刃 ##分 ##切 ##刈 ##刊 ##刍 ##刎 ##刑 ##划 ##列 ##刘 ##则 ##刚 ##创 ##初 ##删 ##判 ##別 ##刨 ##利 ##刪 ##别 ##刮 ##到 ##制 ##刷 ##券 ##刹 ##刺 ##刻 ##刽 ##剁 ##剂 ##剃 ##則 ##剉 ##削 ##剋 ##剌 ##前 ##剎 ##剐 ##剑 ##剔 ##剖 ##剛 ##剜 ##剝 ##剣 ##剤 ##剥 ##剧 ##剩 ##剪 ##副 ##割 ##創 ##剷 ##剽 ##剿 ##劃 ##劇 ##劈 ##劉 ##劊 ##劍 ##劏 ##劑 ##力 ##劝 ##办 ##功 ##加 ##务 ##劣 ##动 ##助 ##努 ##劫 ##劭 ##励 ##劲 ##劳 ##労 ##劵 ##効 ##劾 ##势 ##勁 ##勃 ##勇 ##勉 ##勋 ##勐 ##勒 ##動 ##勖 ##勘 ##務 ##勛 ##勝 ##勞 ##募 ##勢 ##勤 ##勧 ##勳 ##勵 ##勸 ##勺 ##勻 ##勾 ##勿 ##匀 ##包 ##匆 ##匈 ##匍 ##匐 ##匕 ##化 ##北 ##匙 ##匝 ##匠 ##匡 ##匣 ##匪 ##匮 ##匯 ##匱 ##匹 ##区 ##医 ##匾 ##匿 ##區 ##十 ##千 ##卅 ##升 ##午 ##卉 ##半 ##卍 ##华 ##协 ##卑 ##卒 ##卓 ##協 ##单 ##卖 ##南 ##単 ##博 ##卜 ##卞 ##卟 ##占 ##卡 ##卢 ##卤 ##卦 ##卧 ##卫 ##卮 ##卯 ##印 ##危 ##即 ##却 ##卵 ##卷 ##卸 ##卻 ##卿 ##厂 ##厄 ##厅 ##历 ##厉 ##压 ##厌 ##厕 ##厘 ##厚 ##厝 ##原 ##厢 ##厥 ##厦 ##厨 ##厩 ##厭 ##厮 ##厲 ##厳 ##去 ##县 ##叁 ##参 ##參 ##又 ##叉 ##及 ##友 ##双 ##反 ##収 ##发 ##叔 ##取 ##受 ##变 ##叙 ##叛 ##叟 ##叠 ##叡 ##叢 ##口 ##古 ##句 ##另 ##叨 ##叩 ##只 ##叫 ##召 ##叭 ##叮 ##可 ##台 ##叱 ##史 ##右 ##叵 ##叶 ##号 ##司 ##叹 ##叻 ##叼 ##叽 ##吁 ##吃 ##各 ##吆 ##合 ##吉 ##吊 ##吋 ##同 ##名 ##后 ##吏 ##吐 ##向 ##吒 ##吓 ##吕 ##吖 ##吗 ##君 ##吝 ##吞 ##吟 ##吠 ##吡 ##否 ##吧 ##吨 ##吩 ##含 ##听 ##吭 ##吮 ##启 ##吱 ##吳 ##吴 ##吵 ##吶 ##吸 ##吹 ##吻 ##吼 ##吽 ##吾 ##呀 ##呂 ##呃 ##呆 ##呈 ##告 ##呋 ##呎 ##呐 ##呓 ##呕 ##呗 ##员 ##呛 ##呜 ##呢 ##呤 ##呦 ##周 ##呱 ##呲 ##味 ##呵 ##呷 ##呸 ##呻 ##呼 ##命 ##咀 ##咁 ##咂 ##咄 ##咆 ##咋 ##和 ##咎 ##咏 ##咐 ##咒 ##咔 ##咕 ##咖 ##咗 ##咘 ##咙 ##咚 ##咛 ##咣 ##咤 ##咦 ##咧 ##咨 ##咩 ##咪 ##咫 ##咬 ##咭 ##咯 ##咱 ##咲 ##咳 ##咸 ##咻 ##咽 ##咿 ##哀 ##品 ##哂 ##哄 ##哆 ##哇 ##哈 ##哉 ##哋 ##哌 ##响 ##哎 ##哏 ##哐 ##哑 ##哒 ##哔 ##哗 ##哟 ##員 ##哥 ##哦 ##哧 ##哨 ##哩 ##哪 ##哭 ##哮 ##哲 ##哺 ##哼 ##哽 ##唁 ##唄 ##唆 ##唇 ##唉 ##唏 ##唐 ##唑 ##唔 ##唠 ##唤 ##唧 ##唬 ##售 ##唯 ##唰 ##唱 ##唳 ##唷 ##唸 ##唾 ##啃 ##啄 ##商 ##啉 ##啊 ##問 ##啓 ##啕 ##啖 ##啜 ##啞 ##啟 ##啡 ##啤 ##啥 ##啦 ##啧 ##啪 ##啫 ##啬 ##啮 ##啰 ##啱 ##啲 ##啵 ##啶 ##啷 ##啸 ##啻 ##啼 ##啾 ##喀 ##喂 ##喃 ##善 ##喆 ##喇 ##喉 ##喊 ##喋 ##喎 ##喏 ##喔 ##喘 ##喙 ##喚 ##喜 ##喝 ##喟 ##喧 ##喪 ##喫 ##喬 ##單 ##喰 ##喱 ##喲 ##喳 ##喵 ##営 ##喷 ##喹 ##喺 ##喻 ##喽 ##嗅 ##嗆 ##嗇 ##嗎 ##嗑 ##嗒 ##嗓 ##嗔 ##嗖 ##嗚 ##嗜 ##嗝 ##嗟 ##嗡 ##嗣 ##嗤 ##嗦 ##嗨 ##嗪 ##嗬 ##嗯 ##嗰 ##嗲 ##嗳 ##嗶 ##嗷 ##嗽 ##嘀 ##嘅 ##嘆 ##嘈 ##嘉 ##嘌 ##嘍 ##嘎 ##嘔 ##嘖 ##嘗 ##嘘 ##嘚 ##嘛 ##嘜 ##嘞 ##嘟 ##嘢 ##嘣 ##嘤 ##嘧 ##嘩 ##嘭 ##嘮 ##嘯 ##嘰 ##嘱 ##嘲 ##嘴 ##嘶 ##嘸 ##嘹 ##嘻 ##嘿 ##噁 ##噌 ##噎 ##噓 ##噔 ##噗 ##噙 ##噜 ##噠 ##噢 ##噤 ##器 ##噩 ##噪 ##噬 ##噱 ##噴 ##噶 ##噸 ##噹 ##噻 ##噼 ##嚀 ##嚇 ##嚎 ##嚏 ##嚐 ##嚓 ##嚕 ##嚟 ##嚣 ##嚥 ##嚨 ##嚮 ##嚴 ##嚷 ##嚼 ##囂 ##囉 ##囊 ##囍 ##囑 ##囔 ##囗 ##囚 ##四 ##囝 ##回 ##囟 ##因 ##囡 ##团 ##団 ##囤 ##囧 ##囪 ##囫 ##园 ##困 ##囱 ##囲 ##図 ##围 ##囹 ##固 ##国 ##图 ##囿 ##圃 ##圄 ##圆 ##圈 ##國 ##圍 ##圏 ##園 ##圓 ##圖 ##團 ##圜 ##土 ##圣 ##圧 ##在 ##圩 ##圭 ##地 ##圳 ##场 ##圻 ##圾 ##址 ##坂 ##均 ##坊 ##坍 ##坎 ##坏 ##坐 ##坑 ##块 ##坚 ##坛 ##坝 ##坞 ##坟 ##坠 ##坡 ##坤 ##坦 ##坨 ##坪 ##坯 ##坳 ##坵 ##坷 ##垂 ##垃 ##垄 ##型 ##垒 ##垚 ##垛 ##垠 ##垢 ##垣 ##垦 ##垩 ##垫 ##垭 ##垮 ##垵 ##埂 ##埃 ##埋 ##城 ##埔 ##埕 ##埗 ##域 ##埠 ##埤 ##埵 ##執 ##埸 ##培 ##基 ##埼 ##堀 ##堂 ##堃 ##堅 ##堆 ##堇 ##堑 ##堕 ##堙 ##堡 ##堤 ##堪 ##堯 ##堰 ##報 ##場 ##堵 ##堺 ##堿 ##塊 ##塌 ##塑 ##塔 ##塗 ##塘 ##塚 ##塞 ##塢 ##塩 ##填 ##塬 ##塭 ##塵 ##塾 ##墀 ##境 ##墅 ##墉 ##墊 ##墒 ##墓 ##増 ##墘 ##墙 ##墜 ##增 ##墟 ##墨 ##墩 ##墮 ##墳 ##墻 ##墾 ##壁 ##壅 ##壆 ##壇 ##壊 ##壑 ##壓 ##壕 ##壘 ##壞 ##壟 ##壢 ##壤 ##壩 ##士 ##壬 ##壮 ##壯 ##声 ##売 ##壳 ##壶 ##壹 ##壺 ##壽 ##处 ##备 ##変 ##复 ##夏 ##夔 ##夕 ##外 ##夙 ##多 ##夜 ##够 ##夠 ##夢 ##夥 ##大 ##天 ##太 ##夫 ##夭 ##央 ##夯 ##失 ##头 ##夷 ##夸 ##夹 ##夺 ##夾 ##奂 ##奄 ##奇 ##奈 ##奉 ##奋 ##奎 ##奏 ##奐 ##契 ##奔 ##奕 ##奖 ##套 ##奘 ##奚 ##奠 ##奢 ##奥 ##奧 ##奪 ##奬 ##奮 ##女 ##奴 ##奶 ##奸 ##她 ##好 ##如 ##妃 ##妄 ##妆 ##妇 ##妈 ##妊 ##妍 ##妒 ##妓 ##妖 ##妘 ##妙 ##妝 ##妞 ##妣 ##妤 ##妥 ##妨 ##妩 ##妪 ##妮 ##妲 ##妳 ##妹 ##妻 ##妾 ##姆 ##姉 ##姊 ##始 ##姍 ##姐 ##姑 ##姒 ##姓 ##委 ##姗 ##姚 ##姜 ##姝 ##姣 ##姥 ##姦 ##姨 ##姪 ##姫 ##姬 ##姹 ##姻 ##姿 ##威 ##娃 ##娄 ##娅 ##娆 ##娇 ##娉 ##娑 ##娓 ##娘 ##娛 ##娜 ##娟 ##娠 ##娣 ##娥 ##娩 ##娱 ##娲 ##娴 ##娶 ##娼 ##婀 ##婁 ##婆 ##婉 ##婊 ##婕 ##婚 ##婢 ##婦 ##婧 ##婪 ##婭 ##婴 ##婵 ##婶 ##婷 ##婺 ##婿 ##媒 ##媚 ##媛 ##媞 ##媧 ##媲 ##媳 ##媽 ##媾 ##嫁 ##嫂 ##嫉 ##嫌 ##嫑 ##嫔 ##嫖 ##嫘 ##嫚 ##嫡 ##嫣 ##嫦 ##嫩 ##嫲 ##嫵 ##嫻 ##嬅 ##嬉 ##嬌 ##嬗 ##嬛 ##嬢 ##嬤 ##嬪 ##嬰 ##嬴 ##嬷 ##嬸 ##嬿 ##孀 ##孃 ##子 ##孑 ##孔 ##孕 ##孖 ##字 ##存 ##孙 ##孚 ##孛 ##孜 ##孝 ##孟 ##孢 ##季 ##孤 ##学 ##孩 ##孪 ##孫 ##孬 ##孰 ##孱 ##孳 ##孵 ##學 ##孺 ##孽 ##孿 ##宁 ##它 ##宅 ##宇 ##守 ##安 ##宋 ##完 ##宏 ##宓 ##宕 ##宗 ##官 ##宙 ##定 ##宛 ##宜 ##宝 ##实 ##実 ##宠 ##审 ##客 ##宣 ##室 ##宥 ##宦 ##宪 ##宫 ##宮 ##宰 ##害 ##宴 ##宵 ##家 ##宸 ##容 ##宽 ##宾 ##宿 ##寂 ##寄 ##寅 ##密 ##寇 ##富 ##寐 ##寒 ##寓 ##寛 ##寝 ##寞 ##察 ##寡 ##寢 ##寥 ##實 ##寧 ##寨 ##審 ##寫 ##寬 ##寮 ##寰 ##寵 ##寶 ##寸 ##对 ##寺 ##寻 ##导 ##対 ##寿 ##封 ##専 ##射 ##将 ##將 ##專 ##尉 ##尊 ##尋 ##對 ##導 ##小 ##少 ##尔 ##尕 ##尖 ##尘 ##尚 ##尝 ##尤 ##尧 ##尬 ##就 ##尴 ##尷 ##尸 ##尹 ##尺 ##尻 ##尼 ##尽 ##尾 ##尿 ##局 ##屁 ##层 ##屄 ##居 ##屆 ##屈 ##屉 ##届 ##屋 ##屌 ##屍 ##屎 ##屏 ##屐 ##屑 ##展 ##屜 ##属 ##屠 ##屡 ##屢 ##層 ##履 ##屬 ##屯 ##山 ##屹 ##屿 ##岀 ##岁 ##岂 ##岌 ##岐 ##岑 ##岔 ##岖 ##岗 ##岘 ##岙 ##岚 ##岛 ##岡 ##岩 ##岫 ##岬 ##岭 ##岱 ##岳 ##岷 ##岸 ##峇 ##峋 ##峒 ##峙 ##峡 ##峤 ##峥 ##峦 ##峨 ##峪 ##峭 ##峯 ##峰 ##峴 ##島 ##峻 ##峽 ##崁 ##崂 ##崆 ##崇 ##崎 ##崑 ##崔 ##崖 ##崗 ##崙 ##崛 ##崧 ##崩 ##崭 ##崴 ##崽 ##嵇 ##嵊 ##嵋 ##嵌 ##嵐 ##嵘 ##嵩 ##嵬 ##嵯 ##嶂 ##嶄 ##嶇 ##嶋 ##嶙 ##嶺 ##嶼 ##嶽 ##巅 ##巍 ##巒 ##巔 ##巖 ##川 ##州 ##巡 ##巢 ##工 ##左 ##巧 ##巨 ##巩 ##巫 ##差 ##己 ##已 ##巳 ##巴 ##巷 ##巻 ##巽 ##巾 ##巿 ##币 ##市 ##布 ##帅 ##帆 ##师 ##希 ##帐 ##帑 ##帕 ##帖 ##帘 ##帚 ##帛 ##帜 ##帝 ##帥 ##带 ##帧 ##師 ##席 ##帮 ##帯 ##帰 ##帳 ##帶 ##帷 ##常 ##帼 ##帽 ##幀 ##幂 ##幄 ##幅 ##幌 ##幔 ##幕 ##幟 ##幡 ##幢 ##幣 ##幫 ##干 ##平 ##年 ##并 ##幸 ##幹 ##幺 ##幻 ##幼 ##幽 ##幾 ##广 ##庁 ##広 ##庄 ##庆 ##庇 ##床 ##序 ##庐 ##库 ##应 ##底 ##庖 ##店 ##庙 ##庚 ##府 ##庞 ##废 ##庠 ##度 ##座 ##庫 ##庭 ##庵 ##庶 ##康 ##庸 ##庹 ##庾 ##廁 ##廂 ##廃 ##廈 ##廉 ##廊 ##廓 ##廖 ##廚 ##廝 ##廟 ##廠 ##廢 ##廣 ##廬 ##廳 ##延 ##廷 ##建 ##廿 ##开 ##弁 ##异 ##弃 ##弄 ##弈 ##弊 ##弋 ##式 ##弑 ##弒 ##弓 ##弔 ##引 ##弗 ##弘 ##弛 ##弟 ##张 ##弥 ##弦 ##弧 ##弩 ##弭 ##弯 ##弱 ##張 ##強 ##弹 ##强 ##弼 ##弾 ##彅 ##彆 ##彈 ##彌 ##彎 ##归 ##当 ##录 ##彗 ##彙 ##彝 ##形 ##彤 ##彥 ##彦 ##彧 ##彩 ##彪 ##彫 ##彬 ##彭 ##彰 ##影 ##彷 ##役 ##彻 ##彼 ##彿 ##往 ##征 ##径 ##待 ##徇 ##很 ##徉 ##徊 ##律 ##後 ##徐 ##徑 ##徒 ##従 ##徕 ##得 ##徘 ##徙 ##徜 ##從 ##徠 ##御 ##徨 ##復 ##循 ##徬 ##微 ##徳 ##徴 ##徵 ##德 ##徹 ##徼 ##徽 ##心 ##必 ##忆 ##忌 ##忍 ##忏 ##忐 ##忑 ##忒 ##忖 ##志 ##忘 ##忙 ##応 ##忠 ##忡 ##忤 ##忧 ##忪 ##快 ##忱 ##念 ##忻 ##忽 ##忿 ##怀 ##态 ##怂 ##怅 ##怆 ##怎 ##怏 ##怒 ##怔 ##怕 ##怖 ##怙 ##怜 ##思 ##怠 ##怡 ##急 ##怦 ##性 ##怨 ##怪 ##怯 ##怵 ##总 ##怼 ##恁 ##恃 ##恆 ##恋 ##恍 ##恐 ##恒 ##恕 ##恙 ##恚 ##恢 ##恣 ##恤 ##恥 ##恨 ##恩 ##恪 ##恫 ##恬 ##恭 ##息 ##恰 ##恳 ##恵 ##恶 ##恸 ##恺 ##恻 ##恼 ##恿 ##悄 ##悅 ##悉 ##悌 ##悍 ##悔 ##悖 ##悚 ##悟 ##悠 ##患 ##悦 ##您 ##悩 ##悪 ##悬 ##悯 ##悱 ##悲 ##悴 ##悵 ##悶 ##悸 ##悻 ##悼 ##悽 ##情 ##惆 ##惇 ##惊 ##惋 ##惑 ##惕 ##惘 ##惚 ##惜 ##惟 ##惠 ##惡 ##惦 ##惧 ##惨 ##惩 ##惫 ##惬 ##惭 ##惮 ##惯 ##惰 ##惱 ##想 ##惴 ##惶 ##惹 ##惺 ##愁 ##愆 ##愈 ##愉 ##愍 ##意 ##愕 ##愚 ##愛 ##愜 ##感 ##愣 ##愤 ##愧 ##愫 ##愷 ##愿 ##慄 ##慈 ##態 ##慌 ##慎 ##慑 ##慕 ##慘 ##慚 ##慟 ##慢 ##慣 ##慧 ##慨 ##慫 ##慮 ##慰 ##慳 ##慵 ##慶 ##慷 ##慾 ##憂 ##憊 ##憋 ##憎 ##憐 ##憑 ##憔 ##憚 ##憤 ##憧 ##憨 ##憩 ##憫 ##憬 ##憲 ##憶 ##憾 ##懂 ##懇 ##懈 ##應 ##懊 ##懋 ##懑 ##懒 ##懦 ##懲 ##懵 ##懶 ##懷 ##懸 ##懺 ##懼 ##懾 ##懿 ##戀 ##戈 ##戊 ##戌 ##戍 ##戎 ##戏 ##成 ##我 ##戒 ##戕 ##或 ##战 ##戚 ##戛 ##戟 ##戡 ##戦 ##截 ##戬 ##戮 ##戰 ##戲 ##戳 ##戴 ##戶 ##户 ##戸 ##戻 ##戾 ##房 ##所 ##扁 ##扇 ##扈 ##扉 ##手 ##才 ##扎 ##扑 ##扒 ##打 ##扔 ##払 ##托 ##扛 ##扣 ##扦 ##执 ##扩 ##扪 ##扫 ##扬 ##扭 ##扮 ##扯 ##扰 ##扱 ##扳 ##扶 ##批 ##扼 ##找 ##承 ##技 ##抄 ##抉 ##把 ##抑 ##抒 ##抓 ##投 ##抖 ##抗 ##折 ##抚 ##抛 ##抜 ##択 ##抟 ##抠 ##抡 ##抢 ##护 ##报 ##抨 ##披 ##抬 ##抱 ##抵 ##抹 ##押 ##抽 ##抿 ##拂 ##拄 ##担 ##拆 ##拇 ##拈 ##拉 ##拋 ##拌 ##拍 ##拎 ##拐 ##拒 ##拓 ##拔 ##拖 ##拗 ##拘 ##拙 ##拚 ##招 ##拜 ##拟 ##拡 ##拢 ##拣 ##拥 ##拦 ##拧 ##拨 ##择 ##括 ##拭 ##拮 ##拯 ##拱 ##拳 ##拴 ##拷 ##拼 ##拽 ##拾 ##拿 ##持 ##挂 ##指 ##挈 ##按 ##挎 ##挑 ##挖 ##挙 ##挚 ##挛 ##挝 ##挞 ##挟 ##挠 ##挡 ##挣 ##挤 ##挥 ##挨 ##挪 ##挫 ##振 ##挲 ##挹 ##挺 ##挽 ##挾 ##捂 ##捅 ##捆 ##捉 ##捋 ##捌 ##捍 ##捎 ##捏 ##捐 ##捕 ##捞 ##损 ##捡 ##换 ##捣 ##捧 ##捨 ##捩 ##据 ##捱 ##捲 ##捶 ##捷 ##捺 ##捻 ##掀 ##掂 ##掃 ##掇 ##授 ##掉 ##掌 ##掏 ##掐 ##排 ##掖 ##掘 ##掙 ##掛 ##掠 ##採 ##探 ##掣 ##接 ##控 ##推 ##掩 ##措 ##掬 ##掰 ##掲 ##掳 ##掴 ##掷 ##掸 ##掺 ##揀 ##揃 ##揄 ##揆 ##揉 ##揍 ##描 ##提 ##插 ##揖 ##揚 ##換 ##握 ##揣 ##揩 ##揪 ##揭 ##揮 ##援 ##揶 ##揸 ##揹 ##揽 ##搀 ##搁 ##搂 ##搅 ##損 ##搏 ##搐 ##搓 ##搔 ##搖 ##搗 ##搜 ##搞 ##搡 ##搪 ##搬 ##搭 ##搵 ##搶 ##携 ##搽 ##摀 ##摁 ##摄 ##摆 ##摇 ##摈 ##摊 ##摒 ##摔 ##摘 ##摞 ##摟 ##摧 ##摩 ##摯 ##摳 ##摸 ##摹 ##摺 ##摻 ##撂 ##撃 ##撅 ##撇 ##撈 ##撐 ##撑 ##撒 ##撓 ##撕 ##撚 ##撞 ##撤 ##撥 ##撩 ##撫 ##撬 ##播 ##撮 ##撰 ##撲 ##撵 ##撷 ##撸 ##撻 ##撼 ##撿 ##擀 ##擁 ##擂 ##擄 ##擅 ##擇 ##擊 ##擋 ##操 ##擎 ##擒 ##擔 ##擘 ##據 ##擞 ##擠 ##擡 ##擢 ##擦 ##擬 ##擰 ##擱 ##擲 ##擴 ##擷 ##擺 ##擼 ##擾 ##攀 ##攏 ##攒 ##攔 ##攘 ##攙 ##攜 ##攝 ##攞 ##攢 ##攣 ##攤 ##攥 ##攪 ##攫 ##攬 ##支 ##收 ##攸 ##改 ##攻 ##放 ##政 ##故 ##效 ##敌 ##敍 ##敎 ##敏 ##救 ##敕 ##敖 ##敗 ##敘 ##教 ##敛 ##敝 ##敞 ##敢 ##散 ##敦 ##敬 ##数 ##敲 ##整 ##敵 ##敷 ##數 ##斂 ##斃 ##文 ##斋 ##斌 ##斎 ##斐 ##斑 ##斓 ##斗 ##料 ##斛 ##斜 ##斟 ##斡 ##斤 ##斥 ##斧 ##斩 ##斫 ##斬 ##断 ##斯 ##新 ##斷 ##方 ##於 ##施 ##旁 ##旃 ##旅 ##旋 ##旌 ##旎 ##族 ##旖 ##旗 ##无 ##既 ##日 ##旦 ##旧 ##旨 ##早 ##旬 ##旭 ##旮 ##旱 ##时 ##旷 ##旺 ##旻 ##昀 ##昂 ##昆 ##昇 ##昉 ##昊 ##昌 ##明 ##昏 ##易 ##昔 ##昕 ##昙 ##星 ##映 ##春 ##昧 ##昨 ##昭 ##是 ##昱 ##昴 ##昵 ##昶 ##昼 ##显 ##晁 ##時 ##晃 ##晉 ##晋 ##晌 ##晏 ##晒 ##晓 ##晔 ##晕 ##晖 ##晗 ##晚 ##晝 ##晞 ##晟 ##晤 ##晦 ##晨 ##晩 ##普 ##景 ##晰 ##晴 ##晶 ##晷 ##智 ##晾 ##暂 ##暄 ##暇 ##暈 ##暉 ##暌 ##暐 ##暑 ##暖 ##暗 ##暝 ##暢 ##暧 ##暨 ##暫 ##暮 ##暱 ##暴 ##暸 ##暹 ##曄 ##曆 ##曇 ##曉 ##曖 ##曙 ##曜 ##曝 ##曠 ##曦 ##曬 ##曰 ##曲 ##曳 ##更 ##書 ##曹 ##曼 ##曾 ##替 ##最 ##會 ##月 ##有 ##朋 ##服 ##朐 ##朔 ##朕 ##朗 ##望 ##朝 ##期 ##朦 ##朧 ##木 ##未 ##末 ##本 ##札 ##朮 ##术 ##朱 ##朴 ##朵 ##机 ##朽 ##杀 ##杂 ##权 ##杆 ##杈 ##杉 ##李 ##杏 ##材 ##村 ##杓 ##杖 ##杜 ##杞 ##束 ##杠 ##条 ##来 ##杨 ##杭 ##杯 ##杰 ##東 ##杳 ##杵 ##杷 ##杼 ##松 ##板 ##极 ##构 ##枇 ##枉 ##枋 ##析 ##枕 ##林 ##枚 ##果 ##枝 ##枢 ##枣 ##枪 ##枫 ##枭 ##枯 ##枰 ##枱 ##枳 ##架 ##枷 ##枸 ##柄 ##柏 ##某 ##柑 ##柒 ##染 ##柔 ##柘 ##柚 ##柜 ##柞 ##柠 ##柢 ##查 ##柩 ##柬 ##柯 ##柱 ##柳 ##柴 ##柵 ##査 ##柿 ##栀 ##栃 ##栄 ##栅 ##标 ##栈 ##栉 ##栋 ##栎 ##栏 ##树 ##栓 ##栖 ##栗 ##校 ##栩 ##株 ##样 ##核 ##根 ##格 ##栽 ##栾 ##桀 ##桁 ##桂 ##桃 ##桅 ##框 ##案 ##桉 ##桌 ##桎 ##桐 ##桑 ##桓 ##桔 ##桜 ##桠 ##桡 ##桢 ##档 ##桥 ##桦 ##桧 ##桨 ##桩 ##桶 ##桿 ##梁 ##梅 ##梆 ##梏 ##梓 ##梗 ##條 ##梟 ##梢 ##梦 ##梧 ##梨 ##梭 ##梯 ##械 ##梳 ##梵 ##梶 ##检 ##棂 ##棄 ##棉 ##棋 ##棍 ##棒 ##棕 ##棗 ##棘 ##棚 ##棟 ##棠 ##棣 ##棧 ##森 ##棱 ##棲 ##棵 ##棹 ##棺 ##椁 ##椅 ##椋 ##植 ##椎 ##椒 ##検 ##椪 ##椭 ##椰 ##椹 ##椽 ##椿 ##楂 ##楊 ##楓 ##楔 ##楚 ##楝 ##楞 ##楠 ##楣 ##楨 ##楫 ##業 ##楮 ##極 ##楷 ##楸 ##楹 ##楼 ##楽 ##概 ##榄 ##榆 ##榈 ##榉 ##榔 ##榕 ##榖 ##榛 ##榜 ##榨 ##榫 ##榭 ##榮 ##榱 ##榴 ##榷 ##榻 ##槁 ##槃 ##構 ##槌 ##槍 ##槎 ##槐 ##槓 ##様 ##槛 ##槟 ##槤 ##槭 ##槲 ##槳 ##槻 ##槽 ##槿 ##樁 ##樂 ##樊 ##樑 ##樓 ##標 ##樞 ##樟 ##模 ##樣 ##権 ##横 ##樫 ##樯 ##樱 ##樵 ##樸 ##樹 ##樺 ##樽 ##樾 ##橄 ##橇 ##橋 ##橐 ##橘 ##橙 ##機 ##橡 ##橢 ##橫 ##橱 ##橹 ##橼 ##檀 ##檄 ##檎 ##檐 ##檔 ##檗 ##檜 ##檢 ##檬 ##檯 ##檳 ##檸 ##檻 ##櫃 ##櫚 ##櫛 ##櫥 ##櫸 ##櫻 ##欄 ##權 ##欒 ##欖 ##欠 ##次 ##欢 ##欣 ##欧 ##欲 ##欸 ##欺 ##欽 ##款 ##歆 ##歇 ##歉 ##歌 ##歎 ##歐 ##歓 ##歙 ##歛 ##歡 ##止 ##正 ##此 ##步 ##武 ##歧 ##歩 ##歪 ##歯 ##歲 ##歳 ##歴 ##歷 ##歸 ##歹 ##死 ##歼 ##殁 ##殃 ##殆 ##殇 ##殉 ##殊 ##残 ##殒 ##殓 ##殖 ##殘 ##殞 ##殡 ##殤 ##殭 ##殯 ##殲 ##殴 ##段 ##殷 ##殺 ##殼 ##殿 ##毀 ##毁 ##毂 ##毅 ##毆 ##毋 ##母 ##毎 ##每 ##毒 ##毓 ##比 ##毕 ##毗 ##毘 ##毙 ##毛 ##毡 ##毫 ##毯 ##毽 ##氈 ##氏 ##氐 ##民 ##氓 ##气 ##氖 ##気 ##氙 ##氛 ##氟 ##氡 ##氢 ##氣 ##氤 ##氦 ##氧 ##氨 ##氪 ##氫 ##氮 ##氯 ##氰 ##氲 ##水 ##氷 ##永 ##氹 ##氾 ##汀 ##汁 ##求 ##汆 ##汇 ##汉 ##汎 ##汐 ##汕 ##汗 ##汙 ##汛 ##汝 ##汞 ##江 ##池 ##污 ##汤 ##汨 ##汩 ##汪 ##汰 ##汲 ##汴 ##汶 ##汹 ##決 ##汽 ##汾 ##沁 ##沂 ##沃 ##沅 ##沈 ##沉 ##沌 ##沏 ##沐 ##沒 ##沓 ##沖 ##沙 ##沛 ##沟 ##没 ##沢 ##沣 ##沥 ##沦 ##沧 ##沪 ##沫 ##沭 ##沮 ##沱 ##河 ##沸 ##油 ##治 ##沼 ##沽 ##沾 ##沿 ##況 ##泄 ##泉 ##泊 ##泌 ##泓 ##法 ##泗 ##泛 ##泞 ##泠 ##泡 ##波 ##泣 ##泥 ##注 ##泪 ##泫 ##泮 ##泯 ##泰 ##泱 ##泳 ##泵 ##泷 ##泸 ##泻 ##泼 ##泽 ##泾 ##洁 ##洄 ##洋 ##洒 ##洗 ##洙 ##洛 ##洞 ##津 ##洩 ##洪 ##洮 ##洱 ##洲 ##洵 ##洶 ##洸 ##洹 ##活 ##洼 ##洽 ##派 ##流 ##浃 ##浄 ##浅 ##浆 ##浇 ##浊 ##测 ##济 ##浏 ##浑 ##浒 ##浓 ##浔 ##浙 ##浚 ##浜 ##浣 ##浦 ##浩 ##浪 ##浬 ##浮 ##浯 ##浴 ##海 ##浸 ##涂 ##涅 ##涇 ##消 ##涉 ##涌 ##涎 ##涓 ##涔 ##涕 ##涙 ##涛 ##涝 ##涞 ##涟 ##涠 ##涡 ##涣 ##涤 ##润 ##涧 ##涨 ##涩 ##涪 ##涮 ##涯 ##液 ##涵 ##涸 ##涼 ##涿 ##淀 ##淄 ##淅 ##淆 ##淇 ##淋 ##淌 ##淑 ##淒 ##淖 ##淘 ##淙 ##淚 ##淞 ##淡 ##淤 ##淦 ##淨 ##淩 ##淪 ##淫 ##淬 ##淮 ##深 ##淳 ##淵 ##混 ##淹 ##淺 ##添 ##淼 ##清 ##済 ##渉 ##渊 ##渋 ##渍 ##渎 ##渐 ##渔 ##渗 ##渙 ##渚 ##減 ##渝 ##渠 ##渡 ##渣 ##渤 ##渥 ##渦 ##温 ##測 ##渭 ##港 ##渲 ##渴 ##游 ##渺 ##渾 ##湃 ##湄 ##湊 ##湍 ##湖 ##湘 ##湛 ##湟 ##湧 ##湫 ##湮 ##湯 ##湳 ##湾 ##湿 ##満 ##溃 ##溅 ##溉 ##溏 ##源 ##準 ##溜 ##溝 ##溟 ##溢 ##溥 ##溧 ##溪 ##溫 ##溯 ##溱 ##溴 ##溶 ##溺 ##溼 ##滁 ##滂 ##滄 ##滅 ##滇 ##滋 ##滌 ##滑 ##滓 ##滔 ##滕 ##滙 ##滚 ##滝 ##滞 ##滟 ##满 ##滢 ##滤 ##滥 ##滦 ##滨 ##滩 ##滬 ##滯 ##滲 ##滴 ##滷 ##滸 ##滾 ##滿 ##漁 ##漂 ##漆 ##漉 ##漏 ##漓 ##演 ##漕 ##漠 ##漢 ##漣 ##漩 ##漪 ##漫 ##漬 ##漯 ##漱 ##漲 ##漳 ##漸 ##漾 ##漿 ##潆 ##潇 ##潋 ##潍 ##潑 ##潔 ##潘 ##潛 ##潜 ##潞 ##潟 ##潢 ##潤 ##潦 ##潧 ##潭 ##潮 ##潰 ##潴 ##潸 ##潺 ##潼 ##澀 ##澄 ##澆 ##澈 ##澍 ##澎 ##澗 ##澜 ##澡 ##澤 ##澧 ##澱 ##澳 ##澹 ##激 ##濁 ##濂 ##濃 ##濑 ##濒 ##濕 ##濘 ##濛 ##濟 ##濠 ##濡 ##濤 ##濫 ##濬 ##濮 ##濯 ##濱 ##濺 ##濾 ##瀅 ##瀆 ##瀉 ##瀋 ##瀏 ##瀑 ##瀕 ##瀘 ##瀚 ##瀛 ##瀝 ##瀞 ##瀟 ##瀧 ##瀨 ##瀬 ##瀰 ##瀾 ##灌 ##灏 ##灑 ##灘 ##灝 ##灞 ##灣 ##火 ##灬 ##灭 ##灯 ##灰 ##灵 ##灶 ##灸 ##灼 ##災 ##灾 ##灿 ##炀 ##炁 ##炅 ##炉 ##炊 ##炎 ##炒 ##炔 ##炕 ##炖 ##炙 ##炜 ##炫 ##炬 ##炭 ##炮 ##炯 ##炳 ##炷 ##炸 ##点 ##為 ##炼 ##炽 ##烁 ##烂 ##烃 ##烈 ##烊 ##烏 ##烘 ##烙 ##烛 ##烟 ##烤 ##烦 ##烧 ##烨 ##烩 ##烫 ##烬 ##热 ##烯 ##烷 ##烹 ##烽 ##焉 ##焊 ##焕 ##焖 ##焗 ##焘 ##焙 ##焚 ##焜 ##無 ##焦 ##焯 ##焰 ##焱 ##然 ##焼 ##煅 ##煉 ##煊 ##煌 ##煎 ##煒 ##煖 ##煙 ##煜 ##煞 ##煤 ##煥 ##煦 ##照 ##煨 ##煩 ##煮 ##煲 ##煸 ##煽 ##熄 ##熊 ##熏 ##熒 ##熔 ##熙 ##熟 ##熠 ##熨 ##熬 ##熱 ##熵 ##熹 ##熾 ##燁 ##燃 ##燄 ##燈 ##燉 ##燊 ##燎 ##燒 ##燔 ##燕 ##燙 ##燜 ##營 ##燥 ##燦 ##燧 ##燭 ##燮 ##燴 ##燻 ##燼 ##燿 ##爆 ##爍 ##爐 ##爛 ##爪 ##爬 ##爭 ##爰 ##爱 ##爲 ##爵 ##父 ##爷 ##爸 ##爹 ##爺 ##爻 ##爽 ##爾 ##牆 ##片 ##版 ##牌 ##牍 ##牒 ##牙 ##牛 ##牝 ##牟 ##牠 ##牡 ##牢 ##牦 ##牧 ##物 ##牯 ##牲 ##牴 ##牵 ##特 ##牺 ##牽 ##犀 ##犁 ##犄 ##犊 ##犍 ##犒 ##犢 ##犧 ##犬 ##犯 ##状 ##犷 ##犸 ##犹 ##狀 ##狂 ##狄 ##狈 ##狎 ##狐 ##狒 ##狗 ##狙 ##狞 ##狠 ##狡 ##狩 ##独 ##狭 ##狮 ##狰 ##狱 ##狸 ##狹 ##狼 ##狽 ##猎 ##猕 ##猖 ##猗 ##猙 ##猛 ##猜 ##猝 ##猥 ##猩 ##猪 ##猫 ##猬 ##献 ##猴 ##猶 ##猷 ##猾 ##猿 ##獄 ##獅 ##獎 ##獐 ##獒 ##獗 ##獠 ##獣 ##獨 ##獭 ##獰 ##獲 ##獵 ##獷 ##獸 ##獺 ##獻 ##獼 ##獾 ##玄 ##率 ##玉 ##王 ##玑 ##玖 ##玛 ##玟 ##玠 ##玥 ##玩 ##玫 ##玮 ##环 ##现 ##玲 ##玳 ##玷 ##玺 ##玻 ##珀 ##珂 ##珅 ##珈 ##珉 ##珊 ##珍 ##珏 ##珐 ##珑 ##珙 ##珞 ##珠 ##珣 ##珥 ##珩 ##珪 ##班 ##珮 ##珲 ##珺 ##現 ##球 ##琅 ##理 ##琇 ##琉 ##琊 ##琍 ##琏 ##琐 ##琛 ##琢 ##琥 ##琦 ##琨 ##琪 ##琬 ##琮 ##琰 ##琲 ##琳 ##琴 ##琵 ##琶 ##琺 ##琼 ##瑀 ##瑁 ##瑄 ##瑋 ##瑕 ##瑗 ##瑙 ##瑚 ##瑛 ##瑜 ##瑞 ##瑟 ##瑠 ##瑣 ##瑤 ##瑩 ##瑪 ##瑯 ##瑰 ##瑶 ##瑾 ##璀 ##璁 ##璃 ##璇 ##璉 ##璋 ##璎 ##璐 ##璜 ##璞 ##璟 ##璧 ##璨 ##環 ##璽 ##璿 ##瓊 ##瓏 ##瓒 ##瓜 ##瓢 ##瓣 ##瓤 ##瓦 ##瓮 ##瓯 ##瓴 ##瓶 ##瓷 ##甄 ##甌 ##甕 ##甘 ##甙 ##甚 ##甜 ##生 ##產 ##産 ##甥 ##甦 ##用 ##甩 ##甫 ##甬 ##甭 ##甯 ##田 ##由 ##甲 ##申 ##电 ##男 ##甸 ##町 ##画 ##甾 ##畀 ##畅 ##界 ##畏 ##畑 ##畔 ##留 ##畜 ##畝 ##畢 ##略 ##畦 ##番 ##畫 ##異 ##畲 ##畳 ##畴 ##當 ##畸 ##畹 ##畿 ##疆 ##疇 ##疊 ##疏 ##疑 ##疔 ##疖 ##疗 ##疙 ##疚 ##疝 ##疟 ##疡 ##疣 ##疤 ##疥 ##疫 ##疮 ##疯 ##疱 ##疲 ##疳 ##疵 ##疸 ##疹 ##疼 ##疽 ##疾 ##痂 ##病 ##症 ##痈 ##痉 ##痊 ##痍 ##痒 ##痔 ##痕 ##痘 ##痙 ##痛 ##痞 ##痠 ##痢 ##痣 ##痤 ##痧 ##痨 ##痪 ##痫 ##痰 ##痱 ##痴 ##痹 ##痺 ##痼 ##痿 ##瘀 ##瘁 ##瘋 ##瘍 ##瘓 ##瘘 ##瘙 ##瘟 ##瘠 ##瘡 ##瘢 ##瘤 ##瘦 ##瘧 ##瘩 ##瘪 ##瘫 ##瘴 ##瘸 ##瘾 ##療 ##癇 ##癌 ##癒 ##癖 ##癜 ##癞 ##癡 ##癢 ##癣 ##癥 ##癫 ##癬 ##癮 ##癱 ##癲 ##癸 ##発 ##登 ##發 ##白 ##百 ##皂 ##的 ##皆 ##皇 ##皈 ##皋 ##皎 ##皑 ##皓 ##皖 ##皙 ##皚 ##皮 ##皰 ##皱 ##皴 ##皺 ##皿 ##盂 ##盃 ##盅 ##盆 ##盈 ##益 ##盎 ##盏 ##盐 ##监 ##盒 ##盔 ##盖 ##盗 ##盘 ##盛 ##盜 ##盞 ##盟 ##盡 ##監 ##盤 ##盥 ##盧 ##盪 ##目 ##盯 ##盱 ##盲 ##直 ##相 ##盹 ##盼 ##盾 ##省 ##眈 ##眉 ##看 ##県 ##眙 ##眞 ##真 ##眠 ##眦 ##眨 ##眩 ##眯 ##眶 ##眷 ##眸 ##眺 ##眼 ##眾 ##着 ##睁 ##睇 ##睏 ##睐 ##睑 ##睛 ##睜 ##睞 ##睡 ##睢 ##督 ##睥 ##睦 ##睨 ##睪 ##睫 ##睬 ##睹 ##睽 ##睾 ##睿 ##瞄 ##瞅 ##瞇 ##瞋 ##瞌 ##瞎 ##瞑 ##瞒 ##瞓 ##瞞 ##瞟 ##瞠 ##瞥 ##瞧 ##瞩 ##瞪 ##瞬 ##瞭 ##瞰 ##瞳 ##瞻 ##瞼 ##瞿 ##矇 ##矍 ##矗 ##矚 ##矛 ##矜 ##矢 ##矣 ##知 ##矩 ##矫 ##短 ##矮 ##矯 ##石 ##矶 ##矽 ##矾 ##矿 ##码 ##砂 ##砌 ##砍 ##砒 ##研 ##砖 ##砗 ##砚 ##砝 ##砣 ##砥 ##砧 ##砭 ##砰 ##砲 ##破 ##砷 ##砸 ##砺 ##砼 ##砾 ##础 ##硅 ##硐 ##硒 ##硕 ##硝 ##硫 ##硬 ##确 ##硯 ##硼 ##碁 ##碇 ##碉 ##碌 ##碍 ##碎 ##碑 ##碓 ##碗 ##碘 ##碚 ##碛 ##碟 ##碣 ##碧 ##碩 ##碰 ##碱 ##碳 ##碴 ##確 ##碼 ##碾 ##磁 ##磅 ##磊 ##磋 ##磐 ##磕 ##磚 ##磡 ##磨 ##磬 ##磯 ##磲 ##磷 ##磺 ##礁 ##礎 ##礙 ##礡 ##礦 ##礪 ##礫 ##礴 ##示 ##礼 ##社 ##祀 ##祁 ##祂 ##祇 ##祈 ##祉 ##祎 ##祐 ##祕 ##祖 ##祗 ##祚 ##祛 ##祜 ##祝 ##神 ##祟 ##祠 ##祢 ##祥 ##票 ##祭 ##祯 ##祷 ##祸 ##祺 ##祿 ##禀 ##禁 ##禄 ##禅 ##禍 ##禎 ##福 ##禛 ##禦 ##禧 ##禪 ##禮 ##禱 ##禹 ##禺 ##离 ##禽 ##禾 ##禿 ##秀 ##私 ##秃 ##秆 ##秉 ##秋 ##种 ##科 ##秒 ##秘 ##租 ##秣 ##秤 ##秦 ##秧 ##秩 ##秭 ##积 ##称 ##秸 ##移 ##秽 ##稀 ##稅 ##程 ##稍 ##税 ##稔 ##稗 ##稚 ##稜 ##稞 ##稟 ##稠 ##稣 ##種 ##稱 ##稲 ##稳 ##稷 ##稹 ##稻 ##稼 ##稽 ##稿 ##穀 ##穂 ##穆 ##穌 ##積 ##穎 ##穗 ##穢 ##穩 ##穫 ##穴 ##究 ##穷 ##穹 ##空 ##穿 ##突 ##窃 ##窄 ##窈 ##窍 ##窑 ##窒 ##窓 ##窕 ##窖 ##窗 ##窘 ##窜 ##窝 ##窟 ##窠 ##窥 ##窦 ##窨 ##窩 ##窪 ##窮 ##窯 ##窺 ##窿 ##竄 ##竅 ##竇 ##竊 ##立 ##竖 ##站 ##竜 ##竞 ##竟 ##章 ##竣 ##童 ##竭 ##端 ##競 ##竹 ##竺 ##竽 ##竿 ##笃 ##笆 ##笈 ##笋 ##笏 ##笑 ##笔 ##笙 ##笛 ##笞 ##笠 ##符 ##笨 ##第 ##笹 ##笺 ##笼 ##筆 ##等 ##筊 ##筋 ##筍 ##筏 ##筐 ##筑 ##筒 ##答 ##策 ##筛 ##筝 ##筠 ##筱 ##筲 ##筵 ##筷 ##筹 ##签 ##简 ##箇 ##箋 ##箍 ##箏 ##箐 ##箔 ##箕 ##算 ##箝 ##管 ##箩 ##箫 ##箭 ##箱 ##箴 ##箸 ##節 ##篁 ##範 ##篆 ##篇 ##築 ##篑 ##篓 ##篙 ##篝 ##篠 ##篡 ##篤 ##篩 ##篪 ##篮 ##篱 ##篷 ##簇 ##簌 ##簍 ##簡 ##簦 ##簧 ##簪 ##簫 ##簷 ##簸 ##簽 ##簾 ##簿 ##籁 ##籃 ##籌 ##籍 ##籐 ##籟 ##籠 ##籤 ##籬 ##籮 ##籲 ##米 ##类 ##籼 ##籽 ##粄 ##粉 ##粑 ##粒 ##粕 ##粗 ##粘 ##粟 ##粤 ##粥 ##粧 ##粪 ##粮 ##粱 ##粲 ##粳 ##粵 ##粹 ##粼 ##粽 ##精 ##粿 ##糅 ##糊 ##糍 ##糕 ##糖 ##糗 ##糙 ##糜 ##糞 ##糟 ##糠 ##糧 ##糬 ##糯 ##糰 ##糸 ##系 ##糾 ##紀 ##紂 ##約 ##紅 ##紉 ##紊 ##紋 ##納 ##紐 ##紓 ##純 ##紗 ##紘 ##紙 ##級 ##紛 ##紜 ##素 ##紡 ##索 ##紧 ##紫 ##紮 ##累 ##細 ##紳 ##紹 ##紺 ##終 ##絃 ##組 ##絆 ##経 ##結 ##絕 ##絞 ##絡 ##絢 ##給 ##絨 ##絮 ##統 ##絲 ##絳 ##絵 ##絶 ##絹 ##綁 ##綏 ##綑 ##經 ##継 ##続 ##綜 ##綠 ##綢 ##綦 ##綫 ##綬 ##維 ##綱 ##網 ##綴 ##綵 ##綸 ##綺 ##綻 ##綽 ##綾 ##綿 ##緊 ##緋 ##総 ##緑 ##緒 ##緘 ##線 ##緝 ##緞 ##締 ##緣 ##編 ##緩 ##緬 ##緯 ##練 ##緹 ##緻 ##縁 ##縄 ##縈 ##縛 ##縝 ##縣 ##縫 ##縮 ##縱 ##縴 ##縷 ##總 ##績 ##繁 ##繃 ##繆 ##繇 ##繋 ##織 ##繕 ##繚 ##繞 ##繡 ##繩 ##繪 ##繫 ##繭 ##繳 ##繹 ##繼 ##繽 ##纂 ##續 ##纍 ##纏 ##纓 ##纔 ##纖 ##纜 ##纠 ##红 ##纣 ##纤 ##约 ##级 ##纨 ##纪 ##纫 ##纬 ##纭 ##纯 ##纰 ##纱 ##纲 ##纳 ##纵 ##纶 ##纷 ##纸 ##纹 ##纺 ##纽 ##纾 ##线 ##绀 ##练 ##组 ##绅 ##细 ##织 ##终 ##绊 ##绍 ##绎 ##经 ##绑 ##绒 ##结 ##绔 ##绕 ##绘 ##给 ##绚 ##绛 ##络 ##绝 ##绞 ##统 ##绡 ##绢 ##绣 ##绥 ##绦 ##继 ##绩 ##绪 ##绫 ##续 ##绮 ##绯 ##绰 ##绳 ##维 ##绵 ##绶 ##绷 ##绸 ##绻 ##综 ##绽 ##绾 ##绿 ##缀 ##缄 ##缅 ##缆 ##缇 ##缈 ##缉 ##缎 ##缓 ##缔 ##缕 ##编 ##缘 ##缙 ##缚 ##缜 ##缝 ##缠 ##缢 ##缤 ##缥 ##缨 ##缩 ##缪 ##缭 ##缮 ##缰 ##缱 ##缴 ##缸 ##缺 ##缽 ##罂 ##罄 ##罌 ##罐 ##网 ##罔 ##罕 ##罗 ##罚 ##罡 ##罢 ##罩 ##罪 ##置 ##罰 ##署 ##罵 ##罷 ##罹 ##羁 ##羅 ##羈 ##羊 ##羌 ##美 ##羔 ##羚 ##羞 ##羟 ##羡 ##羣 ##群 ##羥 ##羧 ##羨 ##義 ##羯 ##羲 ##羸 ##羹 ##羽 ##羿 ##翁 ##翅 ##翊 ##翌 ##翎 ##習 ##翔 ##翘 ##翟 ##翠 ##翡 ##翦 ##翩 ##翰 ##翱 ##翳 ##翹 ##翻 ##翼 ##耀 ##老 ##考 ##耄 ##者 ##耆 ##耋 ##而 ##耍 ##耐 ##耒 ##耕 ##耗 ##耘 ##耙 ##耦 ##耨 ##耳 ##耶 ##耷 ##耸 ##耻 ##耽 ##耿 ##聂 ##聆 ##聊 ##聋 ##职 ##聒 ##联 ##聖 ##聘 ##聚 ##聞 ##聪 ##聯 ##聰 ##聲 ##聳 ##聴 ##聶 ##職 ##聽 ##聾 ##聿 ##肃 ##肄 ##肅 ##肆 ##肇 ##肉 ##肋 ##肌 ##肏 ##肓 ##肖 ##肘 ##肚 ##肛 ##肝 ##肠 ##股 ##肢 ##肤 ##肥 ##肩 ##肪 ##肮 ##肯 ##肱 ##育 ##肴 ##肺 ##肽 ##肾 ##肿 ##胀 ##胁 ##胃 ##胄 ##胆 ##背 ##胍 ##胎 ##胖 ##胚 ##胛 ##胜 ##胝 ##胞 ##胡 ##胤 ##胥 ##胧 ##胫 ##胭 ##胯 ##胰 ##胱 ##胳 ##胴 ##胶 ##胸 ##胺 ##能 ##脂 ##脅 ##脆 ##脇 ##脈 ##脉 ##脊 ##脍 ##脏 ##脐 ##脑 ##脓 ##脖 ##脘 ##脚 ##脛 ##脣 ##脩 ##脫 ##脯 ##脱 ##脲 ##脳 ##脸 ##脹 ##脾 ##腆 ##腈 ##腊 ##腋 ##腌 ##腎 ##腐 ##腑 ##腓 ##腔 ##腕 ##腥 ##腦 ##腩 ##腫 ##腭 ##腮 ##腰 ##腱 ##腳 ##腴 ##腸 ##腹 ##腺 ##腻 ##腼 ##腾 ##腿 ##膀 ##膈 ##膊 ##膏 ##膑 ##膘 ##膚 ##膛 ##膜 ##膝 ##膠 ##膦 ##膨 ##膩 ##膳 ##膺 ##膻 ##膽 ##膾 ##膿 ##臀 ##臂 ##臃 ##臆 ##臉 ##臊 ##臍 ##臓 ##臘 ##臟 ##臣 ##臥 ##臧 ##臨 ##自 ##臬 ##臭 ##至 ##致 ##臺 ##臻 ##臼 ##臾 ##舀 ##舂 ##舅 ##舆 ##與 ##興 ##舉 ##舊 ##舌 ##舍 ##舎 ##舐 ##舒 ##舔 ##舖 ##舗 ##舛 ##舜 ##舞 ##舟 ##航 ##舫 ##般 ##舰 ##舱 ##舵 ##舶 ##舷 ##舸 ##船 ##舺 ##舾 ##艇 ##艋 ##艘 ##艙 ##艦 ##艮 ##良 ##艰 ##艱 ##色 ##艳 ##艷 ##艹 ##艺 ##艾 ##节 ##芃 ##芈 ##芊 ##芋 ##芍 ##芎 ##芒 ##芙 ##芜 ##芝 ##芡 ##芥 ##芦 ##芩 ##芪 ##芫 ##芬 ##芭 ##芮 ##芯 ##花 ##芳 ##芷 ##芸 ##芹 ##芻 ##芽 ##芾 ##苁 ##苄 ##苇 ##苋 ##苍 ##苏 ##苑 ##苒 ##苓 ##苔 ##苕 ##苗 ##苛 ##苜 ##苞 ##苟 ##苡 ##苣 ##若 ##苦 ##苫 ##苯 ##英 ##苷 ##苹 ##苻 ##茁 ##茂 ##范 ##茄 ##茅 ##茉 ##茎 ##茏 ##茗 ##茜 ##茧 ##茨 ##茫 ##茬 ##茭 ##茯 ##茱 ##茲 ##茴 ##茵 ##茶 ##茸 ##茹 ##茼 ##荀 ##荃 ##荆 ##草 ##荊 ##荏 ##荐 ##荒 ##荔 ##荖 ##荘 ##荚 ##荞 ##荟 ##荠 ##荡 ##荣 ##荤 ##荥 ##荧 ##荨 ##荪 ##荫 ##药 ##荳 ##荷 ##荸 ##荻 ##荼 ##荽 ##莅 ##莆 ##莉 ##莊 ##莎 ##莒 ##莓 ##莖 ##莘 ##莞 ##莠 ##莢 ##莧 ##莪 ##莫 ##莱 ##莲 ##莴 ##获 ##莹 ##莺 ##莽 ##莿 ##菀 ##菁 ##菅 ##菇 ##菈 ##菊 ##菌 ##菏 ##菓 ##菖 ##菘 ##菜 ##菟 ##菠 ##菡 ##菩 ##華 ##菱 ##菲 ##菸 ##菽 ##萁 ##萃 ##萄 ##萊 ##萋 ##萌 ##萍 ##萎 ##萘 ##萝 ##萤 ##营 ##萦 ##萧 ##萨 ##萩 ##萬 ##萱 ##萵 ##萸 ##萼 ##落 ##葆 ##葉 ##著 ##葚 ##葛 ##葡 ##董 ##葦 ##葩 ##葫 ##葬 ##葭 ##葯 ##葱 ##葳 ##葵 ##葷 ##葺 ##蒂 ##蒋 ##蒐 ##蒔 ##蒙 ##蒜 ##蒞 ##蒟 ##蒡 ##蒨 ##蒲 ##蒸 ##蒹 ##蒻 ##蒼 ##蒿 ##蓁 ##蓄 ##蓆 ##蓉 ##蓋 ##蓑 ##蓓 ##蓖 ##蓝 ##蓟 ##蓦 ##蓬 ##蓮 ##蓼 ##蓿 ##蔑 ##蔓 ##蔔 ##蔗 ##蔘 ##蔚 ##蔡 ##蔣 ##蔥 ##蔫 ##蔬 ##蔭 ##蔵 ##蔷 ##蔺 ##蔻 ##蔼 ##蔽 ##蕁 ##蕃 ##蕈 ##蕉 ##蕊 ##蕎 ##蕙 ##蕤 ##蕨 ##蕩 ##蕪 ##蕭 ##蕲 ##蕴 ##蕻 ##蕾 ##薄 ##薅 ##薇 ##薈 ##薊 ##薏 ##薑 ##薔 ##薙 ##薛 ##薦 ##薨 ##薩 ##薪 ##薬 ##薯 ##薰 ##薹 ##藉 ##藍 ##藏 ##藐 ##藓 ##藕 ##藜 ##藝 ##藤 ##藥 ##藩 ##藹 ##藻 ##藿 ##蘆 ##蘇 ##蘊 ##蘋 ##蘑 ##蘚 ##蘭 ##蘸 ##蘼 ##蘿 ##虎 ##虏 ##虐 ##虑 ##虔 ##處 ##虚 ##虛 ##虜 ##虞 ##號 ##虢 ##虧 ##虫 ##虬 ##虱 ##虹 ##虻 ##虽 ##虾 ##蚀 ##蚁 ##蚂 ##蚊 ##蚌 ##蚓 ##蚕 ##蚜 ##蚝 ##蚣 ##蚤 ##蚩 ##蚪 ##蚯 ##蚱 ##蚵 ##蛀 ##蛆 ##蛇 ##蛊 ##蛋 ##蛎 ##蛐 ##蛔 ##蛙 ##蛛 ##蛟 ##蛤 ##蛭 ##蛮 ##蛰 ##蛳 ##蛹 ##蛻 ##蛾 ##蜀 ##蜂 ##蜃 ##蜆 ##蜇 ##蜈 ##蜊 ##蜍 ##蜒 ##蜓 ##蜕 ##蜗 ##蜘 ##蜚 ##蜜 ##蜡 ##蜢 ##蜥 ##蜱 ##蜴 ##蜷 ##蜻 ##蜿 ##蝇 ##蝈 ##蝉 ##蝌 ##蝎 ##蝕 ##蝗 ##蝙 ##蝟 ##蝠 ##蝦 ##蝨 ##蝴 ##蝶 ##蝸 ##蝼 ##螂 ##螃 ##融 ##螞 ##螢 ##螨 ##螯 ##螳 ##螺 ##蟀 ##蟄 ##蟆 ##蟋 ##蟎 ##蟑 ##蟒 ##蟠 ##蟬 ##蟲 ##蟹 ##蟻 ##蟾 ##蠅 ##蠍 ##蠔 ##蠕 ##蠛 ##蠟 ##蠡 ##蠢 ##蠣 ##蠱 ##蠶 ##蠹 ##蠻 ##血 ##衄 ##衅 ##衆 ##行 ##衍 ##術 ##衔 ##街 ##衙 ##衛 ##衝 ##衞 ##衡 ##衢 ##衣 ##补 ##表 ##衩 ##衫 ##衬 ##衮 ##衰 ##衲 ##衷 ##衹 ##衾 ##衿 ##袁 ##袂 ##袄 ##袅 ##袈 ##袋 ##袍 ##袒 ##袖 ##袜 ##袞 ##袤 ##袪 ##被 ##袭 ##袱 ##裁 ##裂 ##装 ##裆 ##裊 ##裏 ##裔 ##裕 ##裘 ##裙 ##補 ##裝 ##裟 ##裡 ##裤 ##裨 ##裱 ##裳 ##裴 ##裸 ##裹 ##製 ##裾 ##褂 ##複 ##褐 ##褒 ##褓 ##褔 ##褚 ##褥 ##褪 ##褫 ##褲 ##褶 ##褻 ##襁 ##襄 ##襟 ##襠 ##襪 ##襬 ##襯 ##襲 ##西 ##要 ##覃 ##覆 ##覇 ##見 ##規 ##覓 ##視 ##覚 ##覦 ##覧 ##親 ##覬 ##観 ##覷 ##覺 ##覽 ##觀 ##见 ##观 ##规 ##觅 ##视 ##览 ##觉 ##觊 ##觎 ##觐 ##觑 ##角 ##觞 ##解 ##觥 ##触 ##觸 ##言 ##訂 ##計 ##訊 ##討 ##訓 ##訕 ##訖 ##託 ##記 ##訛 ##訝 ##訟 ##訣 ##訥 ##訪 ##設 ##許 ##訳 ##訴 ##訶 ##診 ##註 ##証 ##詆 ##詐 ##詔 ##評 ##詛 ##詞 ##詠 ##詡 ##詢 ##詣 ##試 ##詩 ##詫 ##詬 ##詭 ##詮 ##詰 ##話 ##該 ##詳 ##詹 ##詼 ##誅 ##誇 ##誉 ##誌 ##認 ##誓 ##誕 ##誘 ##語 ##誠 ##誡 ##誣 ##誤 ##誥 ##誦 ##誨 ##說 ##説 ##読 ##誰 ##課 ##誹 ##誼 ##調 ##諄 ##談 ##請 ##諏 ##諒 ##論 ##諗 ##諜 ##諡 ##諦 ##諧 ##諫 ##諭 ##諮 ##諱 ##諳 ##諷 ##諸 ##諺 ##諾 ##謀 ##謁 ##謂 ##謄 ##謊 ##謎 ##謐 ##謔 ##謗 ##謙 ##講 ##謝 ##謠 ##謨 ##謬 ##謹 ##謾 ##譁 ##證 ##譎 ##譏 ##識 ##譙 ##譚 ##譜 ##警 ##譬 ##譯 ##議 ##譲 ##譴 ##護 ##譽 ##讀 ##變 ##讓 ##讚 ##讞 ##计 ##订 ##认 ##讥 ##讧 ##讨 ##让 ##讪 ##讫 ##训 ##议 ##讯 ##记 ##讲 ##讳 ##讴 ##讶 ##讷 ##许 ##讹 ##论 ##讼 ##讽 ##设 ##访 ##诀 ##证 ##诃 ##评 ##诅 ##识 ##诈 ##诉 ##诊 ##诋 ##词 ##诏 ##译 ##试 ##诗 ##诘 ##诙 ##诚 ##诛 ##话 ##诞 ##诟 ##诠 ##诡 ##询 ##诣 ##诤 ##该 ##详 ##诧 ##诩 ##诫 ##诬 ##语 ##误 ##诰 ##诱 ##诲 ##说 ##诵 ##诶 ##请 ##诸 ##诺 ##读 ##诽 ##课 ##诿 ##谀 ##谁 ##调 ##谄 ##谅 ##谆 ##谈 ##谊 ##谋 ##谌 ##谍 ##谎 ##谏 ##谐 ##谑 ##谒 ##谓 ##谔 ##谕 ##谗 ##谘 ##谙 ##谚 ##谛 ##谜 ##谟 ##谢 ##谣 ##谤 ##谥 ##谦 ##谧 ##谨 ##谩 ##谪 ##谬 ##谭 ##谯 ##谱 ##谲 ##谴 ##谶 ##谷 ##豁 ##豆 ##豇 ##豈 ##豉 ##豊 ##豌 ##豎 ##豐 ##豔 ##豚 ##象 ##豢 ##豪 ##豫 ##豬 ##豹 ##豺 ##貂 ##貅 ##貌 ##貓 ##貔 ##貘 ##貝 ##貞 ##負 ##財 ##貢 ##貧 ##貨 ##販 ##貪 ##貫 ##責 ##貯 ##貰 ##貳 ##貴 ##貶 ##買 ##貸 ##費 ##貼 ##貽 ##貿 ##賀 ##賁 ##賂 ##賃 ##賄 ##資 ##賈 ##賊 ##賑 ##賓 ##賜 ##賞 ##賠 ##賡 ##賢 ##賣 ##賤 ##賦 ##質 ##賬 ##賭 ##賴 ##賺 ##購 ##賽 ##贅 ##贈 ##贊 ##贍 ##贏 ##贓 ##贖 ##贛 ##贝 ##贞 ##负 ##贡 ##财 ##责 ##贤 ##败 ##账 ##货 ##质 ##贩 ##贪 ##贫 ##贬 ##购 ##贮 ##贯 ##贰 ##贱 ##贲 ##贴 ##贵 ##贷 ##贸 ##费 ##贺 ##贻 ##贼 ##贾 ##贿 ##赁 ##赂 ##赃 ##资 ##赅 ##赈 ##赊 ##赋 ##赌 ##赎 ##赏 ##赐 ##赓 ##赔 ##赖 ##赘 ##赚 ##赛 ##赝 ##赞 ##赠 ##赡 ##赢 ##赣 ##赤 ##赦 ##赧 ##赫 ##赭 ##走 ##赳 ##赴 ##赵 ##赶 ##起 ##趁 ##超 ##越 ##趋 ##趕 ##趙 ##趟 ##趣 ##趨 ##足 ##趴 ##趵 ##趸 ##趺 ##趾 ##跃 ##跄 ##跆 ##跋 ##跌 ##跎 ##跑 ##跖 ##跚 ##跛 ##距 ##跟 ##跡 ##跤 ##跨 ##跩 ##跪 ##路 ##跳 ##践 ##跷 ##跹 ##跺 ##跻 ##踉 ##踊 ##踌 ##踏 ##踐 ##踝 ##踞 ##踟 ##踢 ##踩 ##踪 ##踮 ##踱 ##踴 ##踵 ##踹 ##蹂 ##蹄 ##蹇 ##蹈 ##蹉 ##蹊 ##蹋 ##蹑 ##蹒 ##蹙 ##蹟 ##蹣 ##蹤 ##蹦 ##蹩 ##蹬 ##蹭 ##蹲 ##蹴 ##蹶 ##蹺 ##蹼 ##蹿 ##躁 ##躇 ##躉 ##躊 ##躋 ##躍 ##躏 ##躪 ##身 ##躬 ##躯 ##躲 ##躺 ##軀 ##車 ##軋 ##軌 ##軍 ##軒 ##軟 ##転 ##軸 ##軼 ##軽 ##軾 ##較 ##載 ##輒 ##輓 ##輔 ##輕 ##輛 ##輝 ##輟 ##輩 ##輪 ##輯 ##輸 ##輻 ##輾 ##輿 ##轄 ##轅 ##轆 ##轉 ##轍 ##轎 ##轟 ##车 ##轧 ##轨 ##轩 ##转 ##轭 ##轮 ##软 ##轰 ##轲 ##轴 ##轶 ##轻 ##轼 ##载 ##轿 ##较 ##辄 ##辅 ##辆 ##辇 ##辈 ##辉 ##辊 ##辍 ##辐 ##辑 ##输 ##辕 ##辖 ##辗 ##辘 ##辙 ##辛 ##辜 ##辞 ##辟 ##辣 ##辦 ##辨 ##辩 ##辫 ##辭 ##辮 ##辯 ##辰 ##辱 ##農 ##边 ##辺 ##辻 ##込 ##辽 ##达 ##迁 ##迂 ##迄 ##迅 ##过 ##迈 ##迎 ##运 ##近 ##返 ##还 ##这 ##进 ##远 ##违 ##连 ##迟 ##迢 ##迤 ##迥 ##迦 ##迩 ##迪 ##迫 ##迭 ##述 ##迴 ##迷 ##迸 ##迹 ##迺 ##追 ##退 ##送 ##适 ##逃 ##逅 ##逆 ##选 ##逊 ##逍 ##透 ##逐 ##递 ##途 ##逕 ##逗 ##這 ##通 ##逛 ##逝 ##逞 ##速 ##造 ##逢 ##連 ##逮 ##週 ##進 ##逵 ##逶 ##逸 ##逻 ##逼 ##逾 ##遁 ##遂 ##遅 ##遇 ##遊 ##運 ##遍 ##過 ##遏 ##遐 ##遑 ##遒 ##道 ##達 ##違 ##遗 ##遙 ##遛 ##遜 ##遞 ##遠 ##遢 ##遣 ##遥 ##遨 ##適 ##遭 ##遮 ##遲 ##遴 ##遵 ##遶 ##遷 ##選 ##遺 ##遼 ##遽 ##避 ##邀 ##邁 ##邂 ##邃 ##還 ##邇 ##邈 ##邊 ##邋 ##邏 ##邑 ##邓 ##邕 ##邛 ##邝 ##邢 ##那 ##邦 ##邨 ##邪 ##邬 ##邮 ##邯 ##邰 ##邱 ##邳 ##邵 ##邸 ##邹 ##邺 ##邻 ##郁 ##郅 ##郊 ##郎 ##郑 ##郜 ##郝 ##郡 ##郢 ##郤 ##郦 ##郧 ##部 ##郫 ##郭 ##郴 ##郵 ##郷 ##郸 ##都 ##鄂 ##鄉 ##鄒 ##鄔 ##鄙 ##鄞 ##鄢 ##鄧 ##鄭 ##鄰 ##鄱 ##鄲 ##鄺 ##酉 ##酊 ##酋 ##酌 ##配 ##酐 ##酒 ##酗 ##酚 ##酝 ##酢 ##酣 ##酥 ##酩 ##酪 ##酬 ##酮 ##酯 ##酰 ##酱 ##酵 ##酶 ##酷 ##酸 ##酿 ##醃 ##醇 ##醉 ##醋 ##醍 ##醐 ##醒 ##醚 ##醛 ##醜 ##醞 ##醣 ##醪 ##醫 ##醬 ##醮 ##醯 ##醴 ##醺 ##釀 ##釁 ##采 ##釉 ##释 ##釋 ##里 ##重 ##野 ##量 ##釐 ##金 ##釗 ##釘 ##釜 ##針 ##釣 ##釦 ##釧 ##釵 ##鈀 ##鈉 ##鈍 ##鈎 ##鈔 ##鈕 ##鈞 ##鈣 ##鈦 ##鈪 ##鈴 ##鈺 ##鈾 ##鉀 ##鉄 ##鉅 ##鉉 ##鉑 ##鉗 ##鉚 ##鉛 ##鉤 ##鉴 ##鉻 ##銀 ##銃 ##銅 ##銑 ##銓 ##銖 ##銘 ##銜 ##銬 ##銭 ##銮 ##銳 ##銷 ##銹 ##鋁 ##鋅 ##鋒 ##鋤 ##鋪 ##鋰 ##鋸 ##鋼 ##錄 ##錐 ##錘 ##錚 ##錠 ##錢 ##錦 ##錨 ##錫 ##錮 ##錯 ##録 ##錳 ##錶 ##鍊 ##鍋 ##鍍 ##鍛 ##鍥 ##鍰 ##鍵 ##鍺 ##鍾 ##鎂 ##鎊 ##鎌 ##鎏 ##鎔 ##鎖 ##鎗 ##鎚 ##鎧 ##鎬 ##鎮 ##鎳 ##鏈 ##鏖 ##鏗 ##鏘 ##鏞 ##鏟 ##鏡 ##鏢 ##鏤 ##鏽 ##鐘 ##鐮 ##鐲 ##鐳 ##鐵 ##鐸 ##鐺 ##鑄 ##鑊 ##鑑 ##鑒 ##鑣 ##鑫 ##鑰 ##鑲 ##鑼 ##鑽 ##鑾 ##鑿 ##针 ##钉 ##钊 ##钎 ##钏 ##钒 ##钓 ##钗 ##钙 ##钛 ##钜 ##钝 ##钞 ##钟 ##钠 ##钡 ##钢 ##钣 ##钤 ##钥 ##钦 ##钧 ##钨 ##钩 ##钮 ##钯 ##钰 ##钱 ##钳 ##钴 ##钵 ##钺 ##钻 ##钼 ##钾 ##钿 ##铀 ##铁 ##铂 ##铃 ##铄 ##铅 ##铆 ##铉 ##铎 ##铐 ##铛 ##铜 ##铝 ##铠 ##铡 ##铢 ##铣 ##铤 ##铨 ##铩 ##铬 ##铭 ##铮 ##铰 ##铲 ##铵 ##银 ##铸 ##铺 ##链 ##铿 ##销 ##锁 ##锂 ##锄 ##锅 ##锆 ##锈 ##锉 ##锋 ##锌 ##锏 ##锐 ##锑 ##错 ##锚 ##锟 ##锡 ##锢 ##锣 ##锤 ##锥 ##锦 ##锭 ##键 ##锯 ##锰 ##锲 ##锵 ##锹 ##锺 ##锻 ##镀 ##镁 ##镂 ##镇 ##镉 ##镌 ##镍 ##镐 ##镑 ##镕 ##镖 ##镗 ##镛 ##镜 ##镣 ##镭 ##镯 ##镰 ##镳 ##镶 ##長 ##长 ##門 ##閃 ##閉 ##開 ##閎 ##閏 ##閑 ##閒 ##間 ##閔 ##閘 ##閡 ##関 ##閣 ##閥 ##閨 ##閩 ##閱 ##閲 ##閹 ##閻 ##閾 ##闆 ##闇 ##闊 ##闌 ##闍 ##闔 ##闕 ##闖 ##闘 ##關 ##闡 ##闢 ##门 ##闪 ##闫 ##闭 ##问 ##闯 ##闰 ##闲 ##间 ##闵 ##闷 ##闸 ##闹 ##闺 ##闻 ##闽 ##闾 ##阀 ##阁 ##阂 ##阅 ##阆 ##阇 ##阈 ##阉 ##阎 ##阐 ##阑 ##阔 ##阕 ##阖 ##阙 ##阚 ##阜 ##队 ##阡 ##阪 ##阮 ##阱 ##防 ##阳 ##阴 ##阵 ##阶 ##阻 ##阿 ##陀 ##陂 ##附 ##际 ##陆 ##陇 ##陈 ##陋 ##陌 ##降 ##限 ##陕 ##陛 ##陝 ##陞 ##陟 ##陡 ##院 ##陣 ##除 ##陨 ##险 ##陪 ##陰 ##陲 ##陳 ##陵 ##陶 ##陷 ##陸 ##険 ##陽 ##隅 ##隆 ##隈 ##隊 ##隋 ##隍 ##階 ##随 ##隐 ##隔 ##隕 ##隘 ##隙 ##際 ##障 ##隠 ##隣 ##隧 ##隨 ##險 ##隱 ##隴 ##隶 ##隸 ##隻 ##隼 ##隽 ##难 ##雀 ##雁 ##雄 ##雅 ##集 ##雇 ##雉 ##雋 ##雌 ##雍 ##雎 ##雏 ##雑 ##雒 ##雕 ##雖 ##雙 ##雛 ##雜 ##雞 ##離 ##難 ##雨 ##雪 ##雯 ##雰 ##雲 ##雳 ##零 ##雷 ##雹 ##電 ##雾 ##需 ##霁 ##霄 ##霆 ##震 ##霈 ##霉 ##霊 ##霍 ##霎 ##霏 ##霑 ##霓 ##霖 ##霜 ##霞 ##霧 ##霭 ##霰 ##露 ##霸 ##霹 ##霽 ##霾 ##靂 ##靄 ##靈 ##青 ##靓 ##靖 ##静 ##靚 ##靛 ##靜 ##非 ##靠 ##靡 ##面 ##靥 ##靦 ##革 ##靳 ##靴 ##靶 ##靼 ##鞅 ##鞋 ##鞍 ##鞏 ##鞑 ##鞘 ##鞠 ##鞣 ##鞦 ##鞭 ##韆 ##韋 ##韌 ##韓 ##韜 ##韦 ##韧 ##韩 ##韬 ##韭 ##音 ##韵 ##韶 ##韻 ##響 ##頁 ##頂 ##頃 ##項 ##順 ##須 ##頌 ##預 ##頑 ##頒 ##頓 ##頗 ##領 ##頜 ##頡 ##頤 ##頫 ##頭 ##頰 ##頷 ##頸 ##頹 ##頻 ##頼 ##顆 ##題 ##額 ##顎 ##顏 ##顔 ##願 ##顛 ##類 ##顧 ##顫 ##顯 ##顱 ##顴 ##页 ##顶 ##顷 ##项 ##顺 ##须 ##顼 ##顽 ##顾 ##顿 ##颁 ##颂 ##预 ##颅 ##领 ##颇 ##颈 ##颉 ##颊 ##颌 ##颍 ##颐 ##频 ##颓 ##颔 ##颖 ##颗 ##题 ##颚 ##颛 ##颜 ##额 ##颞 ##颠 ##颡 ##颢 ##颤 ##颦 ##颧 ##風 ##颯 ##颱 ##颳 ##颶 ##颼 ##飄 ##飆 ##风 ##飒 ##飓 ##飕 ##飘 ##飙 ##飚 ##飛 ##飞 ##食 ##飢 ##飨 ##飩 ##飪 ##飯 ##飲 ##飼 ##飽 ##飾 ##餃 ##餅 ##餉 ##養 ##餌 ##餐 ##餒 ##餓 ##餘 ##餚 ##餛 ##餞 ##餡 ##館 ##餮 ##餵 ##餾 ##饅 ##饈 ##饋 ##饌 ##饍 ##饑 ##饒 ##饕 ##饗 ##饞 ##饥 ##饨 ##饪 ##饬 ##饭 ##饮 ##饯 ##饰 ##饱 ##饲 ##饴 ##饵 ##饶 ##饷 ##饺 ##饼 ##饽 ##饿 ##馀 ##馁 ##馄 ##馅 ##馆 ##馈 ##馋 ##馍 ##馏 ##馒 ##馔 ##首 ##馗 ##香 ##馥 ##馨 ##馬 ##馭 ##馮 ##馳 ##馴 ##駁 ##駄 ##駅 ##駆 ##駐 ##駒 ##駕 ##駛 ##駝 ##駭 ##駱 ##駿 ##騁 ##騎 ##騏 ##験 ##騙 ##騨 ##騰 ##騷 ##驀 ##驅 ##驊 ##驍 ##驒 ##驕 ##驗 ##驚 ##驛 ##驟 ##驢 ##驥 ##马 ##驭 ##驮 ##驯 ##驰 ##驱 ##驳 ##驴 ##驶 ##驷 ##驸 ##驹 ##驻 ##驼 ##驾 ##驿 ##骁 ##骂 ##骄 ##骅 ##骆 ##骇 ##骈 ##骊 ##骋 ##验 ##骏 ##骐 ##骑 ##骗 ##骚 ##骛 ##骜 ##骞 ##骠 ##骡 ##骤 ##骥 ##骧 ##骨 ##骯 ##骰 ##骶 ##骷 ##骸 ##骼 ##髂 ##髅 ##髋 ##髏 ##髒 ##髓 ##體 ##髖 ##高 ##髦 ##髪 ##髮 ##髯 ##髻 ##鬃 ##鬆 ##鬍 ##鬓 ##鬚 ##鬟 ##鬢 ##鬣 ##鬥 ##鬧 ##鬱 ##鬼 ##魁 ##魂 ##魄 ##魅 ##魇 ##魍 ##魏 ##魔 ##魘 ##魚 ##魯 ##魷 ##鮑 ##鮨 ##鮪 ##鮭 ##鮮 ##鯉 ##鯊 ##鯖 ##鯛 ##鯨 ##鯰 ##鯽 ##鰍 ##鰓 ##鰭 ##鰲 ##鰻 ##鰾 ##鱈 ##鱉 ##鱔 ##鱗 ##鱷 ##鱸 ##鱼 ##鱿 ##鲁 ##鲈 ##鲍 ##鲑 ##鲛 ##鲜 ##鲟 ##鲢 ##鲤 ##鲨 ##鲫 ##鲱 ##鲲 ##鲶 ##鲷 ##鲸 ##鳃 ##鳄 ##鳅 ##鳌 ##鳍 ##鳕 ##鳖 ##鳗 ##鳝 ##鳞 ##鳥 ##鳩 ##鳳 ##鳴 ##鳶 ##鴉 ##鴕 ##鴛 ##鴦 ##鴨 ##鴻 ##鴿 ##鵑 ##鵜 ##鵝 ##鵡 ##鵬 ##鵰 ##鵲 ##鶘 ##鶩 ##鶯 ##鶴 ##鷗 ##鷲 ##鷹 ##鷺 ##鸚 ##鸞 ##鸟 ##鸠 ##鸡 ##鸢 ##鸣 ##鸥 ##鸦 ##鸨 ##鸪 ##鸭 ##鸯 ##鸳 ##鸵 ##鸽 ##鸾 ##鸿 ##鹂 ##鹃 ##鹄 ##鹅 ##鹈 ##鹉 ##鹊 ##鹌 ##鹏 ##鹑 ##鹕 ##鹘 ##鹜 ##鹞 ##鹤 ##鹦 ##鹧 ##鹫 ##鹭 ##鹰 ##鹳 ##鹵 ##鹹 ##鹼 ##鹽 ##鹿 ##麂 ##麋 ##麒 ##麓 ##麗 ##麝 ##麟 ##麥 ##麦 ##麩 ##麴 ##麵 ##麸 ##麺 ##麻 ##麼 ##麽 ##麾 ##黃 ##黄 ##黍 ##黎 ##黏 ##黑 ##黒 ##黔 ##默 ##黛 ##黜 ##黝 ##點 ##黠 ##黨 ##黯 ##黴 ##鼋 ##鼎 ##鼐 ##鼓 ##鼠 ##鼬 ##鼹 ##鼻 ##鼾 ##齁 ##齊 ##齋 ##齐 ##齒 ##齡 ##齢 ##齣 ##齦 ##齿 ##龄 ##龅 ##龈 ##龊 ##龋 ##龌 ##龍 ##龐 ##龔 ##龕 ##龙 ##龚 ##龛 ##龜 ##龟 ##︰ ##︱ ##︶ ##︿ ##﹁ ##﹂ ##﹍ ##﹏ ##﹐ ##﹑ ##﹒ ##﹔ ##﹕ ##﹖ ##﹗ ##﹙ ##﹚ ##﹝ ##﹞ ##﹡ ##﹣ ##! ##" ### ##$ ##% ##& ##' ##( ##) ##* ##, ##- ##. ##/ ##: ##; ##< ##? ##@ ##[ ##\ ##] ##^ ##_ ##` ##f ##h ##j ##u ##w ##z ##{ ##} ##。 ##「 ##」 ##、 ##・ ##ッ ##ー ##イ ##ク ##シ ##ス ##ト ##ノ ##フ ##ラ ##ル ##ン ##゙ ##゚ ## ̄ ##¥ ##👍 ##🔥 ##😂 ##😎 ================================================ FILE: args.py ================================================ import os import tensorflow as tf tf.logging.set_verbosity(tf.logging.INFO) file_path = os.path.dirname(__file__) #模型目录 model_dir = os.path.join(file_path, 'albert_lcqmc_checkpoints/') #config文件 config_name = os.path.join(file_path, 'albert_config/albert_config_tiny.json') #ckpt文件名称 ckpt_name = os.path.join(model_dir, 'model.ckpt') #输出文件目录 output_dir = os.path.join(file_path, 'albert_lcqmc_checkpoints/') #vocab文件目录 vocab_file = os.path.join(file_path, 'albert_config/vocab.txt') #数据目录 data_dir = os.path.join(file_path, 'data/') num_train_epochs = 10 batch_size = 128 learning_rate = 0.00005 # gpu使用率 gpu_memory_fraction = 0.8 # 默认取倒数第二层的输出值作为句向量 layer_indexes = [-2] # 序列的最大程度,单文本建议把该值调小 max_seq_len = 128 # graph名字 graph_file = os.path.join(file_path, 'albert_lcqmc_checkpoints/graph') ================================================ FILE: bert_utils.py ================================================ from __future__ import absolute_import from __future__ import division from __future__ import print_function import collections import copy import json import math import re import six import tensorflow as tf def get_shape_list(tensor, expected_rank=None, name=None): """Returns a list of the shape of tensor, preferring static dimensions. Args: tensor: A tf.Tensor object to find the shape of. expected_rank: (optional) int. The expected rank of `tensor`. If this is specified and the `tensor` has a different rank, and exception will be thrown. name: Optional name of the tensor for the error message. Returns: A list of dimensions of the shape of tensor. All static dimensions will be returned as python integers, and dynamic dimensions will be returned as tf.Tensor scalars. """ if name is None: name = tensor.name if expected_rank is not None: assert_rank(tensor, expected_rank, name) shape = tensor.shape.as_list() non_static_indexes = [] for (index, dim) in enumerate(shape): if dim is None: non_static_indexes.append(index) if not non_static_indexes: return shape dyn_shape = tf.shape(tensor) for index in non_static_indexes: shape[index] = dyn_shape[index] return shape def reshape_to_matrix(input_tensor): """Reshapes a >= rank 2 tensor to a rank 2 tensor (i.e., a matrix).""" ndims = input_tensor.shape.ndims if ndims < 2: raise ValueError("Input tensor must have at least rank 2. Shape = %s" % (input_tensor.shape)) if ndims == 2: return input_tensor width = input_tensor.shape[-1] output_tensor = tf.reshape(input_tensor, [-1, width]) return output_tensor def reshape_from_matrix(output_tensor, orig_shape_list): """Reshapes a rank 2 tensor back to its original rank >= 2 tensor.""" if len(orig_shape_list) == 2: return output_tensor output_shape = get_shape_list(output_tensor) orig_dims = orig_shape_list[0:-1] width = output_shape[-1] return tf.reshape(output_tensor, orig_dims + [width]) def assert_rank(tensor, expected_rank, name=None): """Raises an exception if the tensor rank is not of the expected rank. Args: tensor: A tf.Tensor to check the rank of. expected_rank: Python integer or list of integers, expected rank. name: Optional name of the tensor for the error message. Raises: ValueError: If the expected shape doesn't match the actual shape. """ if name is None: name = tensor.name expected_rank_dict = {} if isinstance(expected_rank, six.integer_types): expected_rank_dict[expected_rank] = True else: for x in expected_rank: expected_rank_dict[x] = True actual_rank = tensor.shape.ndims if actual_rank not in expected_rank_dict: scope_name = tf.get_variable_scope().name raise ValueError( "For the tensor `%s` in scope `%s`, the actual rank " "`%d` (shape = %s) is not equal to the expected rank `%s`" % (name, scope_name, actual_rank, str(tensor.shape), str(expected_rank))) def gather_indexes(sequence_tensor, positions): """Gathers the vectors at the specific positions over a minibatch.""" sequence_shape = get_shape_list(sequence_tensor, expected_rank=3) batch_size = sequence_shape[0] seq_length = sequence_shape[1] width = sequence_shape[2] flat_offsets = tf.reshape( tf.range(0, batch_size, dtype=tf.int32) * seq_length, [-1, 1]) flat_positions = tf.reshape(positions + flat_offsets, [-1]) flat_sequence_tensor = tf.reshape(sequence_tensor, [batch_size * seq_length, width]) output_tensor = tf.gather(flat_sequence_tensor, flat_positions) return output_tensor # add sequence mask for: # 1. random shuffle lm modeling---xlnet with random shuffled input # 2. left2right and right2left language modeling # 3. conditional generation def generate_seq2seq_mask(attention_mask, mask_sequence, seq_type, **kargs): if seq_type == 'seq2seq': if mask_sequence is not None: seq_shape = get_shape_list(mask_sequence, expected_rank=2) seq_len = seq_shape[1] ones = tf.ones((1, seq_len, seq_len)) a_mask = tf.matrix_band_part(ones, -1, 0) s_ex12 = tf.expand_dims(tf.expand_dims(mask_sequence, 1), 2) s_ex13 = tf.expand_dims(tf.expand_dims(mask_sequence, 1), 3) a_mask = (1 - s_ex13) * (1 - s_ex12) + s_ex13 * a_mask # generate mask of batch x seq_len x seq_len a_mask = tf.reshape(a_mask, (-1, seq_len, seq_len)) out_mask = attention_mask * a_mask else: ones = tf.ones_like(attention_mask[:1]) mask = (tf.matrix_band_part(ones, -1, 0)) out_mask = attention_mask * mask else: out_mask = attention_mask return out_mask ================================================ FILE: classifier_utils.py ================================================ # -*- coding: utf-8 -*- # @Author: bo.shi # @Date: 2019-12-01 22:28:41 # @Last Modified by: bo.shi # @Last Modified time: 2019-12-02 18:36:50 # coding=utf-8 # Copyright 2019 The Google Research Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Utility functions for GLUE classification tasks.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import json import csv import os import six import tensorflow as tf def convert_to_unicode(text): """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" if six.PY3: if isinstance(text, str): return text elif isinstance(text, bytes): return text.decode("utf-8", "ignore") else: raise ValueError("Unsupported string type: %s" % (type(text))) elif six.PY2: if isinstance(text, str): return text.decode("utf-8", "ignore") elif isinstance(text, unicode): return text else: raise ValueError("Unsupported string type: %s" % (type(text))) else: raise ValueError("Not running on Python2 or Python 3?") class InputExample(object): """A single training/test example for simple sequence classification.""" def __init__(self, guid, text_a, text_b=None, label=None): """Constructs a InputExample. Args: guid: Unique id for the example. text_a: string. The untokenized text of the first sequence. For single sequence tasks, only this sequence must be specified. text_b: (Optional) string. The untokenized text of the second sequence. Only must be specified for sequence pair tasks. label: (Optional) string. The label of the example. This should be specified for train and dev examples, but not for test examples. """ self.guid = guid self.text_a = text_a self.text_b = text_b self.label = label class PaddingInputExample(object): """Fake example so the num input examples is a multiple of the batch size. When running eval/predict on the TPU, we need to pad the number of examples to be a multiple of the batch size, because the TPU requires a fixed batch size. The alternative is to drop the last batch, which is bad because it means the entire output data won't be generated. We use this class instead of `None` because treating `None` as padding battches could cause silent errors. """ class DataProcessor(object): """Base class for data converters for sequence classification data sets.""" def get_train_examples(self, data_dir): """Gets a collection of `InputExample`s for the train set.""" raise NotImplementedError() def get_dev_examples(self, data_dir): """Gets a collection of `InputExample`s for the dev set.""" raise NotImplementedError() def get_test_examples(self, data_dir): """Gets a collection of `InputExample`s for prediction.""" raise NotImplementedError() def get_labels(self): """Gets the list of labels for this data set.""" raise NotImplementedError() @classmethod def _read_tsv(cls, input_file, delimiter="\t", quotechar=None): """Reads a tab separated value file.""" with tf.gfile.Open(input_file, "r") as f: reader = csv.reader(f, delimiter=delimiter, quotechar=quotechar) lines = [] for line in reader: lines.append(line) return lines @classmethod def _read_txt(cls, input_file): """Reads a tab separated value file.""" with tf.gfile.Open(input_file, "r") as f: reader = f.readlines() lines = [] for line in reader: lines.append(line.strip().split("_!_")) return lines @classmethod def _read_json(cls, input_file): """Reads a tab separated value file.""" with tf.gfile.Open(input_file, "r") as f: reader = f.readlines() lines = [] for line in reader: lines.append(json.loads(line.strip())) return lines class XnliProcessor(DataProcessor): """Processor for the XNLI data set.""" def get_train_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "train.json")), "train") def get_dev_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "dev.json")), "dev") def get_test_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "test.json")), "test") def _create_examples(self, lines, set_type): """See base class.""" examples = [] for (i, line) in enumerate(lines): guid = "%s-%s" % (set_type, i) text_a = convert_to_unicode(line['premise']) text_b = convert_to_unicode(line['hypo']) label = convert_to_unicode(line['label']) if set_type != 'test' else 'contradiction' examples.append( InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) return examples def get_labels(self): """See base class.""" return ["contradiction", "entailment", "neutral"] # class TnewsProcessor(DataProcessor): # """Processor for the MRPC data set (GLUE version).""" # # def get_train_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_txt(os.path.join(data_dir, "toutiao_category_train.txt")), "train") # # def get_dev_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_txt(os.path.join(data_dir, "toutiao_category_dev.txt")), "dev") # # def get_test_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_txt(os.path.join(data_dir, "toutiao_category_test.txt")), "test") # # def get_labels(self): # """See base class.""" # labels = [] # for i in range(17): # if i == 5 or i == 11: # continue # labels.append(str(100 + i)) # return labels # # def _create_examples(self, lines, set_type): # """Creates examples for the training and dev sets.""" # examples = [] # for (i, line) in enumerate(lines): # if i == 0: # continue # guid = "%s-%s" % (set_type, i) # text_a = convert_to_unicode(line[3]) # text_b = None # label = convert_to_unicode(line[1]) # examples.append( # InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) # return examples class TnewsProcessor(DataProcessor): """Processor for the MRPC data set (GLUE version).""" def get_train_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "train.json")), "train") def get_dev_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "dev.json")), "dev") def get_test_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "test.json")), "test") def get_labels(self): """See base class.""" labels = [] for i in range(17): if i == 5 or i == 11: continue labels.append(str(100 + i)) return labels def _create_examples(self, lines, set_type): """Creates examples for the training and dev sets.""" examples = [] for (i, line) in enumerate(lines): guid = "%s-%s" % (set_type, i) text_a = convert_to_unicode(line['sentence']) text_b = None label = convert_to_unicode(line['label']) if set_type != 'test' else "100" examples.append( InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) return examples # class iFLYTEKDataProcessor(DataProcessor): # """Processor for the iFLYTEKData data set (GLUE version).""" # # def get_train_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_txt(os.path.join(data_dir, "train.txt")), "train") # # def get_dev_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_txt(os.path.join(data_dir, "dev.txt")), "dev") # # def get_test_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_txt(os.path.join(data_dir, "test.txt")), "test") # # def get_labels(self): # """See base class.""" # labels = [] # for i in range(119): # labels.append(str(i)) # return labels # # def _create_examples(self, lines, set_type): # """Creates examples for the training and dev sets.""" # examples = [] # for (i, line) in enumerate(lines): # if i == 0: # continue # guid = "%s-%s" % (set_type, i) # text_a = convert_to_unicode(line[1]) # text_b = None # label = convert_to_unicode(line[0]) # examples.append( # InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) # return examples class iFLYTEKDataProcessor(DataProcessor): """Processor for the iFLYTEKData data set (GLUE version).""" def get_train_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "train.json")), "train") def get_dev_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "dev.json")), "dev") def get_test_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "test.json")), "test") def get_labels(self): """See base class.""" labels = [] for i in range(119): labels.append(str(i)) return labels def _create_examples(self, lines, set_type): """Creates examples for the training and dev sets.""" examples = [] for (i, line) in enumerate(lines): guid = "%s-%s" % (set_type, i) text_a = convert_to_unicode(line['sentence']) text_b = None label = convert_to_unicode(line['label']) if set_type != 'test' else "0" examples.append( InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) return examples class AFQMCProcessor(DataProcessor): """Processor for the internal data set. sentence pair classification""" def get_train_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "train.json")), "train") def get_dev_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "dev.json")), "dev") def get_test_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "test.json")), "test") def get_labels(self): """See base class.""" return ["0", "1"] def _create_examples(self, lines, set_type): """Creates examples for the training and dev sets.""" examples = [] for (i, line) in enumerate(lines): guid = "%s-%s" % (set_type, i) text_a = convert_to_unicode(line['sentence1']) text_b = convert_to_unicode(line['sentence2']) label = convert_to_unicode(line['label']) if set_type != 'test' else '0' examples.append( InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) return examples class CMNLIProcessor(DataProcessor): """Processor for the CMNLI data set.""" def get_train_examples(self, data_dir): """See base class.""" return self._create_examples_json(os.path.join(data_dir, "train.json"), "train") def get_dev_examples(self, data_dir): """See base class.""" return self._create_examples_json(os.path.join(data_dir, "dev.json"), "dev") def get_test_examples(self, data_dir): """See base class.""" return self._create_examples_json(os.path.join(data_dir, "test.json"), "test") def get_labels(self): """See base class.""" return ["contradiction", "entailment", "neutral"] def _create_examples_json(self, file_name, set_type): """Creates examples for the training and dev sets.""" examples = [] lines = tf.gfile.Open(file_name, "r") index = 0 for line in lines: line_obj = json.loads(line) index = index + 1 guid = "%s-%s" % (set_type, index) text_a = convert_to_unicode(line_obj["sentence1"]) text_b = convert_to_unicode(line_obj["sentence2"]) label = convert_to_unicode(line_obj["label"]) if set_type != 'test' else 'neutral' if label != "-": examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) return examples class CslProcessor(DataProcessor): """Processor for the CSL data set.""" def get_train_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "train.json")), "train") def get_dev_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "dev.json")), "dev") def get_test_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "test.json")), "test") def get_labels(self): """See base class.""" return ["0", "1"] def _create_examples(self, lines, set_type): """Creates examples for the training and dev sets.""" examples = [] for (i, line) in enumerate(lines): guid = "%s-%s" % (set_type, i) text_a = convert_to_unicode(" ".join(line['keyword'])) text_b = convert_to_unicode(line['abst']) label = convert_to_unicode(line['label']) if set_type != 'test' else '0' examples.append( InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) return examples # class InewsProcessor(DataProcessor): # """Processor for the MRPC data set (GLUE version).""" # # def get_train_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_txt(os.path.join(data_dir, "train.txt")), "train") # # def get_dev_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_txt(os.path.join(data_dir, "dev.txt")), "dev") # # def get_test_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_txt(os.path.join(data_dir, "test.txt")), "test") # # def get_labels(self): # """See base class.""" # labels = ["0", "1", "2"] # return labels # # def _create_examples(self, lines, set_type): # """Creates examples for the training and dev sets.""" # examples = [] # for (i, line) in enumerate(lines): # if i == 0: # continue # guid = "%s-%s" % (set_type, i) # text_a = convert_to_unicode(line[2]) # text_b = convert_to_unicode(line[3]) # label = convert_to_unicode(line[0]) if set_type != "test" else '0' # examples.append( # InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) # return examples # # # class THUCNewsProcessor(DataProcessor): # """Processor for the THUCNews data set (GLUE version).""" # # def get_train_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_txt(os.path.join(data_dir, "train.txt")), "train") # # def get_dev_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_txt(os.path.join(data_dir, "dev.txt")), "dev") # # def get_test_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_txt(os.path.join(data_dir, "test.txt")), "test") # # def get_labels(self): # """See base class.""" # labels = [] # for i in range(14): # labels.append(str(i)) # return labels # # def _create_examples(self, lines, set_type): # """Creates examples for the training and dev sets.""" # examples = [] # for (i, line) in enumerate(lines): # if i == 0 or len(line) < 3: # continue # guid = "%s-%s" % (set_type, i) # text_a = convert_to_unicode(line[3]) # text_b = None # label = convert_to_unicode(line[0]) # examples.append( # InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) # return examples # # class LCQMCProcessor(DataProcessor): # """Processor for the internal data set. sentence pair classification""" # # def __init__(self): # self.language = "zh" # # def get_train_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_tsv(os.path.join(data_dir, "train.txt")), "train") # # dev_0827.tsv # # def get_dev_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_tsv(os.path.join(data_dir, "dev.txt")), "dev") # # def get_test_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_tsv(os.path.join(data_dir, "test.txt")), "test") # # def get_labels(self): # """See base class.""" # return ["0", "1"] # # return ["-1","0", "1"] # # def _create_examples(self, lines, set_type): # """Creates examples for the training and dev sets.""" # examples = [] # print("length of lines:", len(lines)) # for (i, line) in enumerate(lines): # # print('#i:',i,line) # if i == 0: # continue # guid = "%s-%s" % (set_type, i) # try: # label = convert_to_unicode(line[2]) # text_a = convert_to_unicode(line[0]) # text_b = convert_to_unicode(line[1]) # examples.append( # InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) # except Exception: # print('###error.i:', i, line) # return examples # # # class JDCOMMENTProcessor(DataProcessor): # """Processor for the internal data set. sentence pair classification""" # # def __init__(self): # self.language = "zh" # # def get_train_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_tsv(os.path.join(data_dir, "jd_train.csv"), ",", "\""), "train") # # dev_0827.tsv # # def get_dev_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_tsv(os.path.join(data_dir, "jd_dev.csv"), ",", "\""), "dev") # # def get_test_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_tsv(os.path.join(data_dir, "jd_test.csv"), ",", "\""), "test") # # def get_labels(self): # """See base class.""" # return ["1", "2", "3", "4", "5"] # # return ["-1","0", "1"] # # def _create_examples(self, lines, set_type): # """Creates examples for the training and dev sets.""" # examples = [] # print("length of lines:", len(lines)) # for (i, line) in enumerate(lines): # # print('#i:',i,line) # if i == 0: # continue # guid = "%s-%s" % (set_type, i) # try: # label = convert_to_unicode(line[0]) # text_a = convert_to_unicode(line[1]) # text_b = convert_to_unicode(line[2]) # examples.append( # InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) # except Exception: # print('###error.i:', i, line) # return examples # # # class BQProcessor(DataProcessor): # """Processor for the internal data set. sentence pair classification""" # # def __init__(self): # self.language = "zh" # # def get_train_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_tsv(os.path.join(data_dir, "train.txt")), "train") # # dev_0827.tsv # # def get_dev_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_tsv(os.path.join(data_dir, "dev.txt")), "dev") # # def get_test_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_tsv(os.path.join(data_dir, "test.txt")), "test") # # def get_labels(self): # """See base class.""" # return ["0", "1"] # # return ["-1","0", "1"] # # def _create_examples(self, lines, set_type): # """Creates examples for the training and dev sets.""" # examples = [] # print("length of lines:", len(lines)) # for (i, line) in enumerate(lines): # # print('#i:',i,line) # if i == 0: # continue # guid = "%s-%s" % (set_type, i) # try: # label = convert_to_unicode(line[2]) # text_a = convert_to_unicode(line[0]) # text_b = convert_to_unicode(line[1]) # examples.append( # InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) # except Exception: # print('###error.i:', i, line) # return examples # # # class MnliProcessor(DataProcessor): # """Processor for the MultiNLI data set (GLUE version).""" # # def get_train_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") # # def get_dev_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")), # "dev_matched") # # def get_test_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_tsv(os.path.join(data_dir, "test_matched.tsv")), "test") # # def get_labels(self): # """See base class.""" # return ["contradiction", "entailment", "neutral"] # # def _create_examples(self, lines, set_type): # """Creates examples for the training and dev sets.""" # examples = [] # for (i, line) in enumerate(lines): # if i == 0: # continue # guid = "%s-%s" % (set_type, convert_to_unicode(line[0])) # text_a = convert_to_unicode(line[8]) # text_b = convert_to_unicode(line[9]) # if set_type == "test": # label = "contradiction" # else: # label = convert_to_unicode(line[-1]) # examples.append( # InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) # return examples # # # class MrpcProcessor(DataProcessor): # """Processor for the MRPC data set (GLUE version).""" # # def get_train_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") # # def get_dev_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") # # def get_test_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_tsv(os.path.join(data_dir, "test.tsv")), "test") # # def get_labels(self): # """See base class.""" # return ["0", "1"] # # def _create_examples(self, lines, set_type): # """Creates examples for the training and dev sets.""" # examples = [] # for (i, line) in enumerate(lines): # if i == 0: # continue # guid = "%s-%s" % (set_type, i) # text_a = convert_to_unicode(line[3]) # text_b = convert_to_unicode(line[4]) # if set_type == "test": # label = "0" # else: # label = convert_to_unicode(line[0]) # examples.append( # InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) # return examples # # # class ColaProcessor(DataProcessor): # """Processor for the CoLA data set (GLUE version).""" # # def get_train_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") # # def get_dev_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") # # def get_test_examples(self, data_dir): # """See base class.""" # return self._create_examples( # self._read_tsv(os.path.join(data_dir, "test.tsv")), "test") # # def get_labels(self): # """See base class.""" # return ["0", "1"] # # def _create_examples(self, lines, set_type): # """Creates examples for the training and dev sets.""" # examples = [] # for (i, line) in enumerate(lines): # # Only the test set has a header # if set_type == "test" and i == 0: # continue # guid = "%s-%s" % (set_type, i) # if set_type == "test": # text_a = convert_to_unicode(line[1]) # label = "0" # else: # text_a = convert_to_unicode(line[3]) # label = convert_to_unicode(line[1]) # examples.append( # InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) # return examples class WSCProcessor(DataProcessor): """Processor for the internal data set. sentence pair classification""" def get_train_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "train.json")), "train") def get_dev_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "dev.json")), "dev") def get_test_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "test.json")), "test") def get_labels(self): """See base class.""" return ["true", "false"] def _create_examples(self, lines, set_type): """Creates examples for the training and dev sets.""" examples = [] for (i, line) in enumerate(lines): guid = "%s-%s" % (set_type, i) text_a = convert_to_unicode(line['text']) text_a_list = list(text_a) target = line['target'] query = target['span1_text'] query_idx = target['span1_index'] pronoun = target['span2_text'] pronoun_idx = target['span2_index'] assert text_a[pronoun_idx: (pronoun_idx + len(pronoun)) ] == pronoun, "pronoun: {}".format(pronoun) assert text_a[query_idx: (query_idx + len(query))] == query, "query: {}".format(query) if pronoun_idx > query_idx: text_a_list.insert(query_idx, "_") text_a_list.insert(query_idx + len(query) + 1, "_") text_a_list.insert(pronoun_idx + 2, "[") text_a_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]") else: text_a_list.insert(pronoun_idx, "[") text_a_list.insert(pronoun_idx + len(pronoun) + 1, "]") text_a_list.insert(query_idx + 2, "_") text_a_list.insert(query_idx + len(query) + 2 + 1, "_") text_a = "".join(text_a_list) if set_type == "test": label = "true" else: label = line['label'] examples.append( InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) return examples class COPAProcessor(DataProcessor): """Processor for the internal data set. sentence pair classification""" def __init__(self): self.language = "zh" def get_train_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "train.json")), "train") # dev_0827.tsv def get_dev_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "dev.json")), "dev") def get_test_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_json(os.path.join(data_dir, "test.json")), "test") def get_labels(self): """See base class.""" return ["0", "1"] @classmethod def _create_examples_one(self, lines, set_type): examples = [] for (i, line) in enumerate(lines): guid1 = "%s-%s" % (set_type, i) # try: if line['question'] == 'cause': text_a = convert_to_unicode(line['premise'] + '原因是什么呢?' + line['choice0']) text_b = convert_to_unicode(line['premise'] + '原因是什么呢?' + line['choice1']) else: text_a = convert_to_unicode(line['premise'] + '造成了什么影响呢?' + line['choice0']) text_b = convert_to_unicode(line['premise'] + '造成了什么影响呢?' + line['choice1']) label = convert_to_unicode(str(1 if line['label'] == 0 else 0)) if set_type != 'test' else '0' examples.append( InputExample(guid=guid1, text_a=text_a, text_b=text_b, label=label)) # except Exception as e: # print('###error.i:',e, i, line) return examples @classmethod def _create_examples(self, lines, set_type): examples = [] for (i, line) in enumerate(lines): i = 2 * i guid1 = "%s-%s" % (set_type, i) guid2 = "%s-%s" % (set_type, i + 1) # try: premise = convert_to_unicode(line['premise']) choice0 = convert_to_unicode(line['choice0']) label = convert_to_unicode(str(1 if line['label'] == 0 else 0)) if set_type != 'test' else '0' #text_a2 = convert_to_unicode(line['premise']) choice1 = convert_to_unicode(line['choice1']) label2 = convert_to_unicode( str(0 if line['label'] == 0 else 1)) if set_type != 'test' else '0' if line['question'] == 'effect': text_a = premise text_b = choice0 text_a2 = premise text_b2 = choice1 elif line['question'] == 'cause': text_a = choice0 text_b = premise text_a2 = choice1 text_b2 = premise else: print('wrong format!!') return None examples.append( InputExample(guid=guid1, text_a=text_a, text_b=text_b, label=label)) examples.append( InputExample(guid=guid2, text_a=text_a2, text_b=text_b2, label=label2)) # except Exception as e: # print('###error.i:',e, i, line) return examples ================================================ FILE: create_pretrain_data.sh ================================================ #!/usr/bin/env bash BERT_BASE_DIR=./albert_config python3 create_pretraining_data.py --do_whole_word_mask=True --input_file=data/news_zh_1.txt \ --output_file=data/tf_news_2016_zh_raw_news2016zh_1.tfrecord --vocab_file=$BERT_BASE_DIR/vocab.txt --do_lower_case=True \ --max_seq_length=512 --max_predictions_per_seq=51 --masked_lm_prob=0.10 ================================================ FILE: create_pretraining_data.py ================================================ # coding=utf-8 # Copyright 2018 The Google AI Language Team Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Create masked LM/next sentence masked_lm TF examples for BERT.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import collections import random import tokenization import tensorflow as tf import jieba import re flags = tf.flags FLAGS = flags.FLAGS flags.DEFINE_string("input_file", None, "Input raw text file (or comma-separated list of files).") flags.DEFINE_string( "output_file", None, "Output TF example file (or comma-separated list of files).") flags.DEFINE_string("vocab_file", None, "The vocabulary file that the BERT model was trained on.") flags.DEFINE_bool( "do_lower_case", True, "Whether to lower case the input text. Should be True for uncased " "models and False for cased models.") flags.DEFINE_bool( "do_whole_word_mask", False, "Whether to use whole word masking rather than per-WordPiece masking.") flags.DEFINE_integer("max_seq_length", 128, "Maximum sequence length.") flags.DEFINE_integer("max_predictions_per_seq", 20, "Maximum number of masked LM predictions per sequence.") flags.DEFINE_integer("random_seed", 12345, "Random seed for data generation.") flags.DEFINE_integer( "dupe_factor", 10, "Number of times to duplicate the input data (with different masks).") flags.DEFINE_float("masked_lm_prob", 0.15, "Masked LM probability.") flags.DEFINE_float( "short_seq_prob", 0.1, "Probability of creating sequences which are shorter than the " "maximum length.") flags.DEFINE_bool("non_chinese", False,"manually set this to True if you are not doing chinese pre-train task.") class TrainingInstance(object): """A single training instance (sentence pair).""" def __init__(self, tokens, segment_ids, masked_lm_positions, masked_lm_labels, is_random_next): self.tokens = tokens self.segment_ids = segment_ids self.is_random_next = is_random_next self.masked_lm_positions = masked_lm_positions self.masked_lm_labels = masked_lm_labels def __str__(self): s = "" s += "tokens: %s\n" % (" ".join( [tokenization.printable_text(x) for x in self.tokens])) s += "segment_ids: %s\n" % (" ".join([str(x) for x in self.segment_ids])) s += "is_random_next: %s\n" % self.is_random_next s += "masked_lm_positions: %s\n" % (" ".join( [str(x) for x in self.masked_lm_positions])) s += "masked_lm_labels: %s\n" % (" ".join( [tokenization.printable_text(x) for x in self.masked_lm_labels])) s += "\n" return s def __repr__(self): return self.__str__() def write_instance_to_example_files(instances, tokenizer, max_seq_length, max_predictions_per_seq, output_files): """Create TF example files from `TrainingInstance`s.""" writers = [] for output_file in output_files: writers.append(tf.python_io.TFRecordWriter(output_file)) writer_index = 0 total_written = 0 for (inst_index, instance) in enumerate(instances): input_ids = tokenizer.convert_tokens_to_ids(instance.tokens) input_mask = [1] * len(input_ids) segment_ids = list(instance.segment_ids) assert len(input_ids) <= max_seq_length while len(input_ids) < max_seq_length: input_ids.append(0) input_mask.append(0) segment_ids.append(0) assert len(input_ids) == max_seq_length assert len(input_mask) == max_seq_length assert len(segment_ids) == max_seq_length masked_lm_positions = list(instance.masked_lm_positions) masked_lm_ids = tokenizer.convert_tokens_to_ids(instance.masked_lm_labels) masked_lm_weights = [1.0] * len(masked_lm_ids) while len(masked_lm_positions) < max_predictions_per_seq: masked_lm_positions.append(0) masked_lm_ids.append(0) masked_lm_weights.append(0.0) next_sentence_label = 1 if instance.is_random_next else 0 features = collections.OrderedDict() features["input_ids"] = create_int_feature(input_ids) features["input_mask"] = create_int_feature(input_mask) features["segment_ids"] = create_int_feature(segment_ids) features["masked_lm_positions"] = create_int_feature(masked_lm_positions) features["masked_lm_ids"] = create_int_feature(masked_lm_ids) features["masked_lm_weights"] = create_float_feature(masked_lm_weights) features["next_sentence_labels"] = create_int_feature([next_sentence_label]) tf_example = tf.train.Example(features=tf.train.Features(feature=features)) writers[writer_index].write(tf_example.SerializeToString()) writer_index = (writer_index + 1) % len(writers) total_written += 1 if inst_index < 20: tf.logging.info("*** Example ***") tf.logging.info("tokens: %s" % " ".join( [tokenization.printable_text(x) for x in instance.tokens])) for feature_name in features.keys(): feature = features[feature_name] values = [] if feature.int64_list.value: values = feature.int64_list.value elif feature.float_list.value: values = feature.float_list.value tf.logging.info( "%s: %s" % (feature_name, " ".join([str(x) for x in values]))) for writer in writers: writer.close() tf.logging.info("Wrote %d total instances", total_written) def create_int_feature(values): feature = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) return feature def create_float_feature(values): feature = tf.train.Feature(float_list=tf.train.FloatList(value=list(values))) return feature def create_training_instances(input_files, tokenizer, max_seq_length, dupe_factor, short_seq_prob, masked_lm_prob, max_predictions_per_seq, rng): """Create `TrainingInstance`s from raw text.""" all_documents = [[]] # Input file format: # (1) One sentence per line. These should ideally be actual sentences, not # entire paragraphs or arbitrary spans of text. (Because we use the # sentence boundaries for the "next sentence prediction" task). # (2) Blank lines between documents. Document boundaries are needed so # that the "next sentence prediction" task doesn't span between documents. for input_file in input_files: with tf.gfile.GFile(input_file, "r") as reader: while True: strings=reader.readline() strings=strings.replace(" "," ").replace(" "," ") # 如果有两个或三个空格,替换为一个空格 line = tokenization.convert_to_unicode(strings) if not line: break line = line.strip() # Empty lines are used as document delimiters if not line: all_documents.append([]) tokens = tokenizer.tokenize(line) if tokens: all_documents[-1].append(tokens) # Remove empty documents all_documents = [x for x in all_documents if x] rng.shuffle(all_documents) vocab_words = list(tokenizer.vocab.keys()) instances = [] for _ in range(dupe_factor): for document_index in range(len(all_documents)): instances.extend( create_instances_from_document_albert( # change to albert style for sentence order prediction(SOP), 2019-08-28, brightmart all_documents, document_index, max_seq_length, short_seq_prob, masked_lm_prob, max_predictions_per_seq, vocab_words, rng)) rng.shuffle(instances) return instances def get_new_segment(segment): # 新增的方法 #### """ 输入一句话,返回一句经过处理的话: 为了支持中文全称mask,将被分开的词,将上特殊标记("#"),使得后续处理模块,能够知道哪些字是属于同一个词的。 :param segment: 一句话. e.g. ['悬', '灸', '技', '术', '培', '训', '专', '家', '教', '你', '艾', '灸', '降', '血', '糖', ',', '为', '爸', '妈', '收', '好', '了', '!'] :return: 一句处理过的话 e.g. ['悬', '##灸', '技', '术', '培', '训', '专', '##家', '教', '你', '艾', '##灸', '降', '##血', '##糖', ',', '为', '爸', '##妈', '收', '##好', '了', '!'] """ seq_cws = jieba.lcut("".join(segment)) # 分词 seq_cws_dict = {x: 1 for x in seq_cws} # 分词后的词加入到词典dict new_segment = [] i = 0 while i < len(segment): # 从句子的第一个字开始处理,知道处理完整个句子 if len(re.findall('[\u4E00-\u9FA5]', segment[i])) == 0: # 如果找不到中文的,原文加进去即不用特殊处理。 new_segment.append(segment[i]) i += 1 continue has_add = False for length in range(3, 0, -1): if i + length > len(segment): continue if ''.join(segment[i:i + length]) in seq_cws_dict: new_segment.append(segment[i]) for l in range(1, length): new_segment.append('##' + segment[i + l]) i += length has_add = True break if not has_add: new_segment.append(segment[i]) i += 1 # print("get_new_segment.wwm.get_new_segment:",new_segment) return new_segment def create_instances_from_document_albert( all_documents, document_index, max_seq_length, short_seq_prob, masked_lm_prob, max_predictions_per_seq, vocab_words, rng): """Creates `TrainingInstance`s for a single document. This method is changed to create sentence-order prediction (SOP) followed by idea from paper of ALBERT, 2019-08-28, brightmart """ document = all_documents[document_index] # 得到一个文档 # Account for [CLS], [SEP], [SEP] max_num_tokens = max_seq_length - 3 # We *usually* want to fill up the entire sequence since we are padding # to `max_seq_length` anyways, so short sequences are generally wasted # computation. However, we *sometimes* # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter # sequences to minimize the mismatch between pre-training and fine-tuning. # The `target_seq_length` is just a rough target however, whereas # `max_seq_length` is a hard limit. target_seq_length = max_num_tokens if rng.random() < short_seq_prob: # 有一定的比例,如10%的概率,我们使用比较短的序列长度,以缓解预训练的长序列和调优阶段(可能的)短序列的不一致情况 target_seq_length = rng.randint(2, max_num_tokens) # We DON'T just concatenate all of the tokens from a document into a long # sequence and choose an arbitrary split point because this would make the # next sentence prediction task too easy. Instead, we split the input into # segments "A" and "B" based on the actual "sentences" provided by the user # input. # 设法使用实际的句子,而不是任意的截断句子,从而更好的构造句子连贯性预测的任务 instances = [] current_chunk = [] # 当前处理的文本段,包含多个句子 current_length = 0 i = 0 # print("###document:",document) # 一个document可以是一整篇文章、新闻、词条等. document:[['是', '爷', '们', ',', '就', '得', '给', '媳', '妇', '幸', '福'], ['关', '注', '【', '晨', '曦', '教', '育', '】', ',', '获', '取', '育', '儿', '的', '智', '慧', ',', '与', '孩', '子', '一', '同', '成', '长', '!'], ['方', '法', ':', '打', '开', '微', '信', '→', '添', '加', '朋', '友', '→', '搜', '号', '→', '##he', '##bc', '##x', '##jy', '##→', '关', '注', '!', '我', '是', '一', '个', '爷', '们', ',', '孝', '顺', '是', '做', '人', '的', '第', '一', '准', '则', '。'], ['甭', '管', '小', '时', '候', '怎', '么', '跟', '家', '长', '犯', '混', '蛋', ',', '长', '大', '了', ',', '就', '底', '报', '答', '父', '母', ',', '以', '后', '我', '媳', '妇', '也', '必', '须', '孝', '顺', '。'], ['我', '是', '一', '个', '爷', '们', ',', '可', '以', '花', '心', ',', '可', '以', '好', '玩', '。'], ['但', '我', '一', '定', '会', '找', '一', '个', '管', '的', '住', '我', '的', '女', '人', ',', '和', '我', '一', '起', '生', '活', '。'], ['28', '岁', '以', '前', '在', '怎', '么', '玩', '都', '行', ',', '但', '我', '最', '后', '一', '定', '会', '找', '一', '个', '勤', '俭', '持', '家', '的', '女', '人', '。'], ['我', '是', '一', '爷', '们', ',', '我', '不', '会', '让', '自', '己', '的', '女', '人', '受', '一', '点', '委', '屈', ',', '每', '次', '把', '她', '抱', '在', '怀', '里', ',', '看', '她', '洋', '溢', '着', '幸', '福', '的', '脸', ',', '我', '都', '会', '引', '以', '为', '傲', ',', '这', '特', '么', '就', '是', '我', '的', '女', '人', '。'], ['我', '是', '一', '爷', '们', ',', '干', '什', '么', '也', '不', '能', '忘', '了', '自', '己', '媳', '妇', ',', '就', '算', '和', '哥', '们', '一', '起', '喝', '酒', ',', '喝', '到', '很', '晚', ',', '也', '要', '提', '前', '打', '电', '话', '告', '诉', '她', ',', '让', '她', '早', '点', '休', '息', '。'], ['我', '是', '一', '爷', '们', ',', '我', '媳', '妇', '绝', '对', '不', '能', '抽', '烟', ',', '喝', '酒', '还', '勉', '强', '过', '得', '去', ',', '不', '过', '该', '喝', '的', '时', '候', '喝', ',', '不', '该', '喝', '的', '时', '候', ',', '少', '扯', '纳', '极', '薄', '蛋', '。'], ['我', '是', '一', '爷', '们', ',', '我', '媳', '妇', '必', '须', '听', '我', '话', ',', '在', '人', '前', '一', '定', '要', '给', '我', '面', '子', ',', '回', '家', '了', '咱', '什', '么', '都', '好', '说', '。'], ['我', '是', '一', '爷', '们', ',', '就', '算', '难', '的', '吃', '不', '上', '饭', '了', ',', '都', '不', '张', '口', '跟', '媳', '妇', '要', '一', '分', '钱', '。'], ['我', '是', '一', '爷', '们', ',', '不', '管', '上', '学', '还', '是', '上', '班', ',', '我', '都', '会', '送', '媳', '妇', '回', '家', '。'], ['我', '是', '一', '爷', '们', ',', '交', '往', '不', '到', '1', '年', ',', '绝', '对', '不', '会', '和', '媳', '妇', '提', '过', '分', '的', '要', '求', ',', '我', '会', '尊', '重', '她', '。'], ['我', '是', '一', '爷', '们', ',', '游', '戏', '永', '远', '比', '不', '上', '我', '媳', '妇', '重', '要', ',', '只', '要', '媳', '妇', '发', '话', ',', '我', '绝', '对', '唯', '命', '是', '从', '。'], ['我', '是', '一', '爷', '们', ',', '上', 'q', '绝', '对', '是', '为', '了', '等', '媳', '妇', ',', '所', '有', '暧', '昧', '的', '心', '情', '只', '为', '她', '一', '个', '女', '人', '而', '写', ',', '我', '不', '一', '定', '会', '经', '常', '写', '日', '志', ',', '可', '是', '我', '会', '告', '诉', '全', '世', '界', ',', '我', '很', '爱', '她', '。'], ['我', '是', '一', '爷', '们', ',', '不', '一', '定', '要', '经', '常', '制', '造', '浪', '漫', '、', '偶', '尔', '过', '个', '节', '日', '也', '要', '送', '束', '玫', '瑰', '花', '给', '媳', '妇', '抱', '回', '家', '。'], ['我', '是', '一', '爷', '们', ',', '手', '机', '会', '24', '小', '时', '为', '她', '开', '机', ',', '让', '她', '半', '夜', '痛', '经', '的', '时', '候', ',', '做', '恶', '梦', '的', '时', '候', ',', '随', '时', '可', '以', '联', '系', '到', '我', '。'], ['我', '是', '一', '爷', '们', ',', '我', '会', '经', '常', '带', '媳', '妇', '出', '去', '玩', ',', '她', '不', '一', '定', '要', '和', '我', '所', '有', '的', '哥', '们', '都', '认', '识', ',', '但', '见', '面', '能', '说', '的', '上', '话', '就', '行', '。'], ['我', '是', '一', '爷', '们', ',', '我', '会', '和', '媳', '妇', '的', '姐', '妹', '哥', '们', '搞', '好', '关', '系', ',', '让', '她', '们', '相', '信', '我', '一', '定', '可', '以', '给', '我', '媳', '妇', '幸', '福', '。'], ['我', '是', '一', '爷', '们', ',', '吵', '架', '后', '、', '也', '要', '主', '动', '打', '电', '话', '关', '心', '她', ',', '咱', '是', '一', '爷', '们', ',', '给', '媳', '妇', '服', '个', '软', ',', '道', '个', '歉', '怎', '么', '了', '?'], ['我', '是', '一', '爷', '们', ',', '绝', '对', '不', '会', '嫌', '弃', '自', '己', '媳', '妇', ',', '拿', '她', '和', '别', '人', '比', ',', '说', '她', '这', '不', '如', '人', '家', ',', '纳', '不', '如', '人', '家', '的', '。'], ['我', '是', '一', '爷', '们', ',', '陪', '媳', '妇', '逛', '街', '时', ',', '碰', '见', '熟', '人', ',', '无', '论', '我', '媳', '妇', '长', '的', '好', '看', '与', '否', ',', '我', '都', '会', '大', '方', '的', '介', '绍', '。'], ['谁', '让', '咱', '爷', '们', '就', '好', '这', '口', '呢', '。'], ['我', '是', '一', '爷', '们', ',', '我', '想', '我', '会', '给', '我', '媳', '妇', '最', '好', '的', '幸', '福', '。'], ['【', '我', '们', '重', '在', '分', '享', '。'], ['所', '有', '文', '字', '和', '美', '图', ',', '来', '自', '网', '络', ',', '晨', '欣', '教', '育', '整', '理', '。'], ['对', '原', '文', '作', '者', ',', '表', '示', '敬', '意', '。'], ['】', '关', '注', '晨', '曦', '教', '育', '[UNK]', '[UNK]', '晨', '曦', '教', '育', '(', '微', '信', '号', ':', 'he', '##bc', '##x', '##jy', ')', '。'], ['打', '开', '微', '信', ',', '扫', '描', '二', '维', '码', ',', '关', '注', '[UNK]', '晨', '曦', '教', '育', '[UNK]', ',', '获', '取', '更', '多', '育', '儿', '资', '源', '。'], ['点', '击', '下', '面', '订', '阅', '按', '钮', '订', '阅', ',', '会', '有', '更', '多', '惊', '喜', '哦', '!']] while i < len(document): # 从文档的第一个位置开始,按个往下看 segment = document[i] # segment是列表,代表的是按字分开的一个完整句子,如 segment=['我', '是', '一', '爷', '们', ',', '我', '想', '我', '会', '给', '我', '媳', '妇', '最', '好', '的', '幸', '福', '。'] if FLAGS.non_chinese==False: # if non chinese is False, that means it is chinese, then do something to make chinese whole word mask works. segment = get_new_segment(segment) # whole word mask for chinese: 结合分词的中文的whole mask设置即在需要的地方加上“##” current_chunk.append(segment) # 将一个独立的句子加入到当前的文本块中 current_length += len(segment) # 累计到为止位置接触到句子的总长度 if i == len(document) - 1 or current_length >= target_seq_length: # 如果累计的序列长度达到了目标的长度,或当前走到了文档结尾==>构造并添加到“A[SEP]B“中的A和B中; if current_chunk: # 如果当前块不为空 # `a_end` is how many segments from `current_chunk` go into the `A` # (first) sentence. a_end = 1 if len(current_chunk) >= 2: # 当前块,如果包含超过两个句子,取当前块的一部分作为“A[SEP]B“中的A部分 a_end = rng.randint(1, len(current_chunk) - 1) # 将当前文本段中选取出来的前半部分,赋值给A即tokens_a tokens_a = [] for j in range(a_end): tokens_a.extend(current_chunk[j]) # 构造“A[SEP]B“中的B部分(有一部分是正常的当前文档中的后半部;在原BERT的实现中一部分是随机的从另一个文档中选取的,) tokens_b = [] for j in range(a_end, len(current_chunk)): tokens_b.extend(current_chunk[j]) # 有百分之50%的概率交换一下tokens_a和tokens_b的位置 # print("tokens_a length1:",len(tokens_a)) # print("tokens_b length1:",len(tokens_b)) # len(tokens_b) = 0 if len(tokens_a) == 0 or len(tokens_b) == 0: i += 1; continue if rng.random() < 0.5: # 交换一下tokens_a和tokens_b is_random_next=True temp=tokens_a tokens_a=tokens_b tokens_b=temp else: is_random_next=False truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng) assert len(tokens_a) >= 1 assert len(tokens_b) >= 1 # 把tokens_a & tokens_b加入到按照bert的风格,即以[CLS]tokens_a[SEP]tokens_b[SEP]的形式,结合到一起,作为最终的tokens; 也带上segment_ids,前面部分segment_ids的值是0,后面部分的值是1. tokens = [] segment_ids = [] tokens.append("[CLS]") segment_ids.append(0) for token in tokens_a: tokens.append(token) segment_ids.append(0) tokens.append("[SEP]") segment_ids.append(0) for token in tokens_b: tokens.append(token) segment_ids.append(1) tokens.append("[SEP]") segment_ids.append(1) # 创建masked LM的任务的数据 Creates the predictions for the masked LM objective (tokens, masked_lm_positions, masked_lm_labels) = create_masked_lm_predictions( tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng) instance = TrainingInstance( # 创建训练实例的对象 tokens=tokens, segment_ids=segment_ids, is_random_next=is_random_next, masked_lm_positions=masked_lm_positions, masked_lm_labels=masked_lm_labels) instances.append(instance) current_chunk = [] # 清空当前块 current_length = 0 # 重置当前文本块的长度 i += 1 # 接着文档中的内容往后看 return instances def create_instances_from_document_original( # THIS IS ORIGINAL BERT STYLE FOR CREATE DATA OF MLM AND NEXT SENTENCE PREDICTION TASK all_documents, document_index, max_seq_length, short_seq_prob, masked_lm_prob, max_predictions_per_seq, vocab_words, rng): """Creates `TrainingInstance`s for a single document.""" document = all_documents[document_index] # 得到一个文档 # Account for [CLS], [SEP], [SEP] max_num_tokens = max_seq_length - 3 # We *usually* want to fill up the entire sequence since we are padding # to `max_seq_length` anyways, so short sequences are generally wasted # computation. However, we *sometimes* # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter # sequences to minimize the mismatch between pre-training and fine-tuning. # The `target_seq_length` is just a rough target however, whereas # `max_seq_length` is a hard limit. target_seq_length = max_num_tokens if rng.random() < short_seq_prob: # 有一定的比例,如10%的概率,我们使用比较短的序列长度,以缓解预训练的长序列和调优阶段(可能的)短序列的不一致情况 target_seq_length = rng.randint(2, max_num_tokens) # We DON'T just concatenate all of the tokens from a document into a long # sequence and choose an arbitrary split point because this would make the # next sentence prediction task too easy. Instead, we split the input into # segments "A" and "B" based on the actual "sentences" provided by the user # input. # 设法使用实际的句子,而不是任意的截断句子,从而更好的构造句子连贯性预测的任务 instances = [] current_chunk = [] # 当前处理的文本段,包含多个句子 current_length = 0 i = 0 # print("###document:",document) # 一个document可以是一整篇文章、新闻、一个词条等. document:[['是', '爷', '们', ',', '就', '得', '给', '媳', '妇', '幸', '福'], ['关', '注', '【', '晨', '曦', '教', '育', '】', ',', '获', '取', '育', '儿', '的', '智', '慧', ',', '与', '孩', '子', '一', '同', '成', '长', '!'], ['方', '法', ':', '打', '开', '微', '信', '→', '添', '加', '朋', '友', '→', '搜', '号', '→', '##he', '##bc', '##x', '##jy', '##→', '关', '注', '!', '我', '是', '一', '个', '爷', '们', ',', '孝', '顺', '是', '做', '人', '的', '第', '一', '准', '则', '。'], ['甭', '管', '小', '时', '候', '怎', '么', '跟', '家', '长', '犯', '混', '蛋', ',', '长', '大', '了', ',', '就', '底', '报', '答', '父', '母', ',', '以', '后', '我', '媳', '妇', '也', '必', '须', '孝', '顺', '。'], ['我', '是', '一', '个', '爷', '们', ',', '可', '以', '花', '心', ',', '可', '以', '好', '玩', '。'], ['但', '我', '一', '定', '会', '找', '一', '个', '管', '的', '住', '我', '的', '女', '人', ',', '和', '我', '一', '起', '生', '活', '。'], ['28', '岁', '以', '前', '在', '怎', '么', '玩', '都', '行', ',', '但', '我', '最', '后', '一', '定', '会', '找', '一', '个', '勤', '俭', '持', '家', '的', '女', '人', '。'], ['我', '是', '一', '爷', '们', ',', '我', '不', '会', '让', '自', '己', '的', '女', '人', '受', '一', '点', '委', '屈', ',', '每', '次', '把', '她', '抱', '在', '怀', '里', ',', '看', '她', '洋', '溢', '着', '幸', '福', '的', '脸', ',', '我', '都', '会', '引', '以', '为', '傲', ',', '这', '特', '么', '就', '是', '我', '的', '女', '人', '。'], ['我', '是', '一', '爷', '们', ',', '干', '什', '么', '也', '不', '能', '忘', '了', '自', '己', '媳', '妇', ',', '就', '算', '和', '哥', '们', '一', '起', '喝', '酒', ',', '喝', '到', '很', '晚', ',', '也', '要', '提', '前', '打', '电', '话', '告', '诉', '她', ',', '让', '她', '早', '点', '休', '息', '。'], ['我', '是', '一', '爷', '们', ',', '我', '媳', '妇', '绝', '对', '不', '能', '抽', '烟', ',', '喝', '酒', '还', '勉', '强', '过', '得', '去', ',', '不', '过', '该', '喝', '的', '时', '候', '喝', ',', '不', '该', '喝', '的', '时', '候', ',', '少', '扯', '纳', '极', '薄', '蛋', '。'], ['我', '是', '一', '爷', '们', ',', '我', '媳', '妇', '必', '须', '听', '我', '话', ',', '在', '人', '前', '一', '定', '要', '给', '我', '面', '子', ',', '回', '家', '了', '咱', '什', '么', '都', '好', '说', '。'], ['我', '是', '一', '爷', '们', ',', '就', '算', '难', '的', '吃', '不', '上', '饭', '了', ',', '都', '不', '张', '口', '跟', '媳', '妇', '要', '一', '分', '钱', '。'], ['我', '是', '一', '爷', '们', ',', '不', '管', '上', '学', '还', '是', '上', '班', ',', '我', '都', '会', '送', '媳', '妇', '回', '家', '。'], ['我', '是', '一', '爷', '们', ',', '交', '往', '不', '到', '1', '年', ',', '绝', '对', '不', '会', '和', '媳', '妇', '提', '过', '分', '的', '要', '求', ',', '我', '会', '尊', '重', '她', '。'], ['我', '是', '一', '爷', '们', ',', '游', '戏', '永', '远', '比', '不', '上', '我', '媳', '妇', '重', '要', ',', '只', '要', '媳', '妇', '发', '话', ',', '我', '绝', '对', '唯', '命', '是', '从', '。'], ['我', '是', '一', '爷', '们', ',', '上', 'q', '绝', '对', '是', '为', '了', '等', '媳', '妇', ',', '所', '有', '暧', '昧', '的', '心', '情', '只', '为', '她', '一', '个', '女', '人', '而', '写', ',', '我', '不', '一', '定', '会', '经', '常', '写', '日', '志', ',', '可', '是', '我', '会', '告', '诉', '全', '世', '界', ',', '我', '很', '爱', '她', '。'], ['我', '是', '一', '爷', '们', ',', '不', '一', '定', '要', '经', '常', '制', '造', '浪', '漫', '、', '偶', '尔', '过', '个', '节', '日', '也', '要', '送', '束', '玫', '瑰', '花', '给', '媳', '妇', '抱', '回', '家', '。'], ['我', '是', '一', '爷', '们', ',', '手', '机', '会', '24', '小', '时', '为', '她', '开', '机', ',', '让', '她', '半', '夜', '痛', '经', '的', '时', '候', ',', '做', '恶', '梦', '的', '时', '候', ',', '随', '时', '可', '以', '联', '系', '到', '我', '。'], ['我', '是', '一', '爷', '们', ',', '我', '会', '经', '常', '带', '媳', '妇', '出', '去', '玩', ',', '她', '不', '一', '定', '要', '和', '我', '所', '有', '的', '哥', '们', '都', '认', '识', ',', '但', '见', '面', '能', '说', '的', '上', '话', '就', '行', '。'], ['我', '是', '一', '爷', '们', ',', '我', '会', '和', '媳', '妇', '的', '姐', '妹', '哥', '们', '搞', '好', '关', '系', ',', '让', '她', '们', '相', '信', '我', '一', '定', '可', '以', '给', '我', '媳', '妇', '幸', '福', '。'], ['我', '是', '一', '爷', '们', ',', '吵', '架', '后', '、', '也', '要', '主', '动', '打', '电', '话', '关', '心', '她', ',', '咱', '是', '一', '爷', '们', ',', '给', '媳', '妇', '服', '个', '软', ',', '道', '个', '歉', '怎', '么', '了', '?'], ['我', '是', '一', '爷', '们', ',', '绝', '对', '不', '会', '嫌', '弃', '自', '己', '媳', '妇', ',', '拿', '她', '和', '别', '人', '比', ',', '说', '她', '这', '不', '如', '人', '家', ',', '纳', '不', '如', '人', '家', '的', '。'], ['我', '是', '一', '爷', '们', ',', '陪', '媳', '妇', '逛', '街', '时', ',', '碰', '见', '熟', '人', ',', '无', '论', '我', '媳', '妇', '长', '的', '好', '看', '与', '否', ',', '我', '都', '会', '大', '方', '的', '介', '绍', '。'], ['谁', '让', '咱', '爷', '们', '就', '好', '这', '口', '呢', '。'], ['我', '是', '一', '爷', '们', ',', '我', '想', '我', '会', '给', '我', '媳', '妇', '最', '好', '的', '幸', '福', '。'], ['【', '我', '们', '重', '在', '分', '享', '。'], ['所', '有', '文', '字', '和', '美', '图', ',', '来', '自', '网', '络', ',', '晨', '欣', '教', '育', '整', '理', '。'], ['对', '原', '文', '作', '者', ',', '表', '示', '敬', '意', '。'], ['】', '关', '注', '晨', '曦', '教', '育', '[UNK]', '[UNK]', '晨', '曦', '教', '育', '(', '微', '信', '号', ':', 'he', '##bc', '##x', '##jy', ')', '。'], ['打', '开', '微', '信', ',', '扫', '描', '二', '维', '码', ',', '关', '注', '[UNK]', '晨', '曦', '教', '育', '[UNK]', ',', '获', '取', '更', '多', '育', '儿', '资', '源', '。'], ['点', '击', '下', '面', '订', '阅', '按', '钮', '订', '阅', ',', '会', '有', '更', '多', '惊', '喜', '哦', '!']] while i < len(document): # 从文档的第一个位置开始,按个往下看 segment = document[i] # segment是列表,代表的是按字分开的一个完整句子,如 segment=['我', '是', '一', '爷', '们', ',', '我', '想', '我', '会', '给', '我', '媳', '妇', '最', '好', '的', '幸', '福', '。'] # print("###i:",i,";segment:",segment) current_chunk.append(segment) # 将一个独立的句子加入到当前的文本块中 current_length += len(segment) # 累计到为止位置接触到句子的总长度 if i == len(document) - 1 or current_length >= target_seq_length: # 如果累计的序列长度达到了目标的长度==>构造并添加到“A[SEP]B“中的A和B中。 if current_chunk: # 如果当前块不为空 # `a_end` is how many segments from `current_chunk` go into the `A` # (first) sentence. a_end = 1 if len(current_chunk) >= 2: # 当前块,如果包含超过两个句子,怎取当前块的一部分作为“A[SEP]B“中的A部分 a_end = rng.randint(1, len(current_chunk) - 1) # 将当前文本段中选取出来的前半部分,赋值给A即tokens_a tokens_a = [] for j in range(a_end): tokens_a.extend(current_chunk[j]) # 构造“A[SEP]B“中的B部分(原本的B有一部分是随机的从另一个文档中选取的,有一部分是正常的当前文档中的后半部) tokens_b = [] # Random next is_random_next = False if len(current_chunk) == 1 or rng.random() < 0.5: # 有50%的概率,是从其他文档中随机的选取一个文档,并得到这个文档的后半版本作为B即tokens_b is_random_next = True target_b_length = target_seq_length - len(tokens_a) # This should rarely go for more than one iteration for large # corpora. However, just to be careful, we try to make sure that # the random document is not the same as the document # we're processing. random_document_index=0 for _ in range(10): # 随机的选出一个与当前的文档不一样的文档的索引 random_document_index = rng.randint(0, len(all_documents) - 1) if random_document_index != document_index: break random_document = all_documents[random_document_index] # 选出这个文档 random_start = rng.randint(0, len(random_document) - 1) # 从这个文档选出一个段落的开始位置 for j in range(random_start, len(random_document)): # 从这个文档的开始位置到结束,作为我们的“A[SEP]B“中的B即tokens_b tokens_b.extend(random_document[j]) if len(tokens_b) >= target_b_length: break # We didn't actually use these segments so we "put them back" so # they don't go to waste. 这里是为了防止文本的浪费的一个小技巧 num_unused_segments = len(current_chunk) - a_end # e.g. 550-200=350 i -= num_unused_segments # i=i-num_unused_segments, e.g. i=400, num_unused_segments=350, 那么 i=i-num_unused_segments=400-350=50 # Actual next else: # 有另外50%的几乎,从当前文本块(长度为max_sequence_length)中的后段中填充到tokens_b即“A[SEP]B“中的B。 is_random_next = False for j in range(a_end, len(current_chunk)): tokens_b.extend(current_chunk[j]) truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng) assert len(tokens_a) >= 1 assert len(tokens_b) >= 1 # 把tokens_a & tokens_b加入到按照bert的风格,即以[CLS]tokens_a[SEP]tokens_b[SEP]的形式,结合到一起,作为最终的tokens; 也带上segment_ids,前面部分segment_ids的值是0,后面部分的值是1. tokens = [] segment_ids = [] tokens.append("[CLS]") segment_ids.append(0) for token in tokens_a: tokens.append(token) segment_ids.append(0) tokens.append("[SEP]") segment_ids.append(0) for token in tokens_b: tokens.append(token) segment_ids.append(1) tokens.append("[SEP]") segment_ids.append(1) # 创建masked LM的任务的数据 Creates the predictions for the masked LM objective (tokens, masked_lm_positions, masked_lm_labels) = create_masked_lm_predictions( tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng) instance = TrainingInstance( # 创建训练实例的对象 tokens=tokens, segment_ids=segment_ids, is_random_next=is_random_next, masked_lm_positions=masked_lm_positions, masked_lm_labels=masked_lm_labels) instances.append(instance) current_chunk = [] # 清空当前块 current_length = 0 # 重置当前文本块的长度 i += 1 # 接着文档中的内容往后看 return instances MaskedLmInstance = collections.namedtuple("MaskedLmInstance", ["index", "label"]) def create_masked_lm_predictions(tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng): """Creates the predictions for the masked LM objective.""" cand_indexes = [] for (i, token) in enumerate(tokens): if token == "[CLS]" or token == "[SEP]": continue # Whole Word Masking means that if we mask all of the wordpieces # corresponding to an original word. When a word has been split into # WordPieces, the first token does not have any marker and any subsequence # tokens are prefixed with ##. So whenever we see the ## token, we # append it to the previous set of word indexes. # # Note that Whole Word Masking does *not* change the training code # at all -- we still predict each WordPiece independently, softmaxed # over the entire vocabulary. if (FLAGS.do_whole_word_mask and len(cand_indexes) >= 1 and token.startswith("##")): cand_indexes[-1].append(i) else: cand_indexes.append([i]) rng.shuffle(cand_indexes) if FLAGS.non_chinese==False: # if non chinese is False, that means it is chinese, then try to remove "##" which is added previously output_tokens = [t[2:] if len(re.findall('##[\u4E00-\u9FA5]', t)) > 0 else t for t in tokens] # 去掉"##" else: # english and other language, which is not chinese output_tokens = list(tokens) num_to_predict = min(max_predictions_per_seq, max(1, int(round(len(tokens) * masked_lm_prob)))) masked_lms = [] covered_indexes = set() for index_set in cand_indexes: if len(masked_lms) >= num_to_predict: break # If adding a whole-word mask would exceed the maximum number of # predictions, then just skip this candidate. if len(masked_lms) + len(index_set) > num_to_predict: continue is_any_index_covered = False for index in index_set: if index in covered_indexes: is_any_index_covered = True break if is_any_index_covered: continue for index in index_set: covered_indexes.add(index) masked_token = None # 80% of the time, replace with [MASK] if rng.random() < 0.8: masked_token = "[MASK]" else: # 10% of the time, keep original if rng.random() < 0.5: if FLAGS.non_chinese == False: # if non chinese is False, that means it is chinese, then try to remove "##" which is added previously masked_token = tokens[index][2:] if len(re.findall('##[\u4E00-\u9FA5]', tokens[index])) > 0 else tokens[index] # 去掉"##" else: masked_token = tokens[index] # 10% of the time, replace with random word else: masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)] output_tokens[index] = masked_token masked_lms.append(MaskedLmInstance(index=index, label=tokens[index])) assert len(masked_lms) <= num_to_predict masked_lms = sorted(masked_lms, key=lambda x: x.index) masked_lm_positions = [] masked_lm_labels = [] for p in masked_lms: masked_lm_positions.append(p.index) masked_lm_labels.append(p.label) # tf.logging.info('%s' % (tokens)) # tf.logging.info('%s' % (output_tokens)) return (output_tokens, masked_lm_positions, masked_lm_labels) def create_masked_lm_predictions_original(tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng): """Creates the predictions for the masked LM objective.""" cand_indexes = [] for (i, token) in enumerate(tokens): if token == "[CLS]" or token == "[SEP]": continue # Whole Word Masking means that if we mask all of the wordpieces # corresponding to an original word. When a word has been split into # WordPieces, the first token does not have any marker and any subsequence # tokens are prefixed with ##. So whenever we see the ## token, we # append it to the previous set of word indexes. # # Note that Whole Word Masking does *not* change the training code # at all -- we still predict each WordPiece independently, softmaxed # over the entire vocabulary. if (FLAGS.do_whole_word_mask and len(cand_indexes) >= 1 and token.startswith("##")): cand_indexes[-1].append(i) else: cand_indexes.append([i]) rng.shuffle(cand_indexes) output_tokens = list(tokens) num_to_predict = min(max_predictions_per_seq, max(1, int(round(len(tokens) * masked_lm_prob)))) masked_lms = [] covered_indexes = set() for index_set in cand_indexes: if len(masked_lms) >= num_to_predict: break # If adding a whole-word mask would exceed the maximum number of # predictions, then just skip this candidate. if len(masked_lms) + len(index_set) > num_to_predict: continue is_any_index_covered = False for index in index_set: if index in covered_indexes: is_any_index_covered = True break if is_any_index_covered: continue for index in index_set: covered_indexes.add(index) masked_token = None # 80% of the time, replace with [MASK] if rng.random() < 0.8: masked_token = "[MASK]" else: # 10% of the time, keep original if rng.random() < 0.5: masked_token = tokens[index] # 10% of the time, replace with random word else: masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)] output_tokens[index] = masked_token masked_lms.append(MaskedLmInstance(index=index, label=tokens[index])) assert len(masked_lms) <= num_to_predict masked_lms = sorted(masked_lms, key=lambda x: x.index) masked_lm_positions = [] masked_lm_labels = [] for p in masked_lms: masked_lm_positions.append(p.index) masked_lm_labels.append(p.label) return (output_tokens, masked_lm_positions, masked_lm_labels) def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng): """Truncates a pair of sequences to a maximum sequence length.""" while True: total_length = len(tokens_a) + len(tokens_b) if total_length <= max_num_tokens: break trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b assert len(trunc_tokens) >= 1 # We want to sometimes truncate from the front and sometimes from the # back to add more randomness and avoid biases. if rng.random() < 0.5: del trunc_tokens[0] else: trunc_tokens.pop() def main(_): tf.logging.set_verbosity(tf.logging.INFO) tokenizer = tokenization.FullTokenizer( vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) input_files = [] for input_pattern in FLAGS.input_file.split(","): input_files.extend(tf.gfile.Glob(input_pattern)) tf.logging.info("*** Reading from input files ***") for input_file in input_files: tf.logging.info(" %s", input_file) rng = random.Random(FLAGS.random_seed) instances = create_training_instances( input_files, tokenizer, FLAGS.max_seq_length, FLAGS.dupe_factor, FLAGS.short_seq_prob, FLAGS.masked_lm_prob, FLAGS.max_predictions_per_seq, rng) output_files = FLAGS.output_file.split(",") tf.logging.info("*** Writing to output files ***") for output_file in output_files: tf.logging.info(" %s", output_file) write_instance_to_example_files(instances, tokenizer, FLAGS.max_seq_length, FLAGS.max_predictions_per_seq, output_files) if __name__ == "__main__": flags.mark_flag_as_required("input_file") flags.mark_flag_as_required("output_file") flags.mark_flag_as_required("vocab_file") tf.app.run() ================================================ FILE: create_pretraining_data_google.py ================================================ # coding=utf-8 # Copyright 2019 The Google Research Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # Lint as: python2, python3 # coding=utf-8 """Create masked LM/next sentence masked_lm TF examples for ALBERT.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import collections import random import numpy as np import six from six.moves import range from six.moves import zip import tensorflow as tf from albert import tokenization flags = tf.flags FLAGS = flags.FLAGS flags.DEFINE_string("input_file", None, "Input raw text file (or comma-separated list of files).") flags.DEFINE_string( "output_file", None, "Output TF example file (or comma-separated list of files).") flags.DEFINE_string( "vocab_file", None, "The vocabulary file that the ALBERT model was trained on.") flags.DEFINE_string("spm_model_file", None, "The model file for sentence piece tokenization.") flags.DEFINE_bool( "do_lower_case", True, "Whether to lower case the input text. Should be True for uncased " "models and False for cased models.") flags.DEFINE_bool( "do_whole_word_mask", True, "Whether to use whole word masking rather than per-xWordPiece masking.") flags.DEFINE_bool( "do_permutation", False, "Whether to do the permutation training.") flags.DEFINE_bool( "favor_shorter_ngram", False, "Whether to set higher probabilities for sampling shorter ngrams.") flags.DEFINE_bool( "random_next_sentence", False, "Whether to use the sentence that's right before the current sentence " "as the negative sample for next sentence prection, rather than using " "sentences from other random documents.") flags.DEFINE_integer("max_seq_length", 512, "Maximum sequence length.") flags.DEFINE_integer("ngram", 3, "Maximum number of ngrams to mask.") flags.DEFINE_integer("max_predictions_per_seq", 20, "Maximum number of masked LM predictions per sequence.") flags.DEFINE_integer("random_seed", 12345, "Random seed for data generation.") flags.DEFINE_integer( "dupe_factor", 10, "Number of times to duplicate the input data (with different masks).") flags.DEFINE_float("masked_lm_prob", 0.15, "Masked LM probability.") flags.DEFINE_float( "short_seq_prob", 0.1, "Probability of creating sequences which are shorter than the " "maximum length.") class TrainingInstance(object): """A single training instance (sentence pair).""" def __init__(self, tokens, segment_ids, masked_lm_positions, masked_lm_labels, is_random_next, token_boundary): self.tokens = tokens self.segment_ids = segment_ids self.is_random_next = is_random_next self.token_boundary = token_boundary self.masked_lm_positions = masked_lm_positions self.masked_lm_labels = masked_lm_labels def __str__(self): s = "" s += "tokens: %s\n" % (" ".join( [tokenization.printable_text(x) for x in self.tokens])) s += "segment_ids: %s\n" % (" ".join([str(x) for x in self.segment_ids])) s += "token_boundary: %s\n" % (" ".join( [str(x) for x in self.token_boundary])) s += "is_random_next: %s\n" % self.is_random_next s += "masked_lm_positions: %s\n" % (" ".join( [str(x) for x in self.masked_lm_positions])) s += "masked_lm_labels: %s\n" % (" ".join( [tokenization.printable_text(x) for x in self.masked_lm_labels])) s += "\n" return s def __repr__(self): return self.__str__() def write_instance_to_example_files(instances, tokenizer, max_seq_length, max_predictions_per_seq, output_files): """Create TF example files from `TrainingInstance`s.""" writers = [] for output_file in output_files: writers.append(tf.python_io.TFRecordWriter(output_file)) writer_index = 0 total_written = 0 for (inst_index, instance) in enumerate(instances): input_ids = tokenizer.convert_tokens_to_ids(instance.tokens) input_mask = [1] * len(input_ids) segment_ids = list(instance.segment_ids) token_boundary = list(instance.token_boundary) assert len(input_ids) <= max_seq_length while len(input_ids) < max_seq_length: input_ids.append(0) input_mask.append(0) segment_ids.append(0) token_boundary.append(0) assert len(input_ids) == max_seq_length assert len(input_mask) == max_seq_length assert len(segment_ids) == max_seq_length masked_lm_positions = list(instance.masked_lm_positions) masked_lm_ids = tokenizer.convert_tokens_to_ids(instance.masked_lm_labels) masked_lm_weights = [1.0] * len(masked_lm_ids) multiplier = 1 + int(FLAGS.do_permutation) while len(masked_lm_positions) < max_predictions_per_seq * multiplier: masked_lm_positions.append(0) masked_lm_ids.append(0) masked_lm_weights.append(0.0) sentence_order_label = 1 if instance.is_random_next else 0 features = collections.OrderedDict() features["input_ids"] = create_int_feature(input_ids) features["input_mask"] = create_int_feature(input_mask) features["segment_ids"] = create_int_feature(segment_ids) features["token_boundary"] = create_int_feature(token_boundary) features["masked_lm_positions"] = create_int_feature(masked_lm_positions) features["masked_lm_ids"] = create_int_feature(masked_lm_ids) features["masked_lm_weights"] = create_float_feature(masked_lm_weights) # Note: We keep this feature name `next_sentence_labels` to be compatible # with the original data created by lanzhzh@. However, in the ALBERT case # it does contain sentence_order_label. features["next_sentence_labels"] = create_int_feature( [sentence_order_label]) tf_example = tf.train.Example(features=tf.train.Features(feature=features)) writers[writer_index].write(tf_example.SerializeToString()) writer_index = (writer_index + 1) % len(writers) total_written += 1 if inst_index < 6: tf.logging.info("*** Example ***") tf.logging.info("tokens: %s" % " ".join( [tokenization.printable_text(x) for x in instance.tokens])) for feature_name in features.keys(): feature = features[feature_name] values = [] if feature.int64_list.value: values = feature.int64_list.value elif feature.float_list.value: values = feature.float_list.value tf.logging.info( "%s: %s" % (feature_name, " ".join([str(x) for x in values]))) for writer in writers: writer.close() tf.logging.info("Wrote %d total instances", total_written) def create_int_feature(values): feature = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) return feature def create_float_feature(values): feature = tf.train.Feature(float_list=tf.train.FloatList(value=list(values))) return feature def create_training_instances(input_files, tokenizer, max_seq_length, dupe_factor, short_seq_prob, masked_lm_prob, max_predictions_per_seq, rng): """Create `TrainingInstance`s from raw text.""" all_documents = [[]] # Input file format: # (1) One sentence per line. These should ideally be actual sentences, not # entire paragraphs or arbitrary spans of text. (Because we use the # sentence boundaries for the "next sentence prediction" task). # (2) Blank lines between documents. Document boundaries are needed so # that the "next sentence prediction" task doesn't span between documents. for input_file in input_files: with tf.gfile.GFile(input_file, "r") as reader: while True: line = reader.readline() if not FLAGS.spm_model_file: line = tokenization.convert_to_unicode(line) if not line: break if FLAGS.spm_model_file: line = tokenization.preprocess_text(line, lower=FLAGS.do_lower_case) else: line = line.strip() # Empty lines are used as document delimiters if not line: all_documents.append([]) tokens = tokenizer.tokenize(line) if tokens: all_documents[-1].append(tokens) # Remove empty documents all_documents = [x for x in all_documents if x] rng.shuffle(all_documents) vocab_words = list(tokenizer.vocab.keys()) instances = [] for _ in range(dupe_factor): for document_index in range(len(all_documents)): instances.extend( create_instances_from_document( all_documents, document_index, max_seq_length, short_seq_prob, masked_lm_prob, max_predictions_per_seq, vocab_words, rng)) rng.shuffle(instances) return instances def create_instances_from_document( all_documents, document_index, max_seq_length, short_seq_prob, masked_lm_prob, max_predictions_per_seq, vocab_words, rng): """Creates `TrainingInstance`s for a single document.""" document = all_documents[document_index] # Account for [CLS], [SEP], [SEP] max_num_tokens = max_seq_length - 3 # We *usually* want to fill up the entire sequence since we are padding # to `max_seq_length` anyways, so short sequences are generally wasted # computation. However, we *sometimes* # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter # sequences to minimize the mismatch between pre-training and fine-tuning. # The `target_seq_length` is just a rough target however, whereas # `max_seq_length` is a hard limit. target_seq_length = max_num_tokens if rng.random() < short_seq_prob: target_seq_length = rng.randint(2, max_num_tokens) # We DON'T just concatenate all of the tokens from a document into a long # sequence and choose an arbitrary split point because this would make the # next sentence prediction task too easy. Instead, we split the input into # segments "A" and "B" based on the actual "sentences" provided by the user # input. instances = [] current_chunk = [] current_length = 0 i = 0 while i < len(document): segment = document[i] current_chunk.append(segment) current_length += len(segment) if i == len(document) - 1 or current_length >= target_seq_length: if current_chunk: # `a_end` is how many segments from `current_chunk` go into the `A` # (first) sentence. a_end = 1 if len(current_chunk) >= 2: a_end = rng.randint(1, len(current_chunk) - 1) tokens_a = [] for j in range(a_end): tokens_a.extend(current_chunk[j]) tokens_b = [] # Random next is_random_next = False if len(current_chunk) == 1 or \ (FLAGS.random_next_sentence and rng.random() < 0.5): is_random_next = True target_b_length = target_seq_length - len(tokens_a) # This should rarely go for more than one iteration for large # corpora. However, just to be careful, we try to make sure that # the random document is not the same as the document # we're processing. for _ in range(10): random_document_index = rng.randint(0, len(all_documents) - 1) if random_document_index != document_index: break random_document = all_documents[random_document_index] random_start = rng.randint(0, len(random_document) - 1) for j in range(random_start, len(random_document)): tokens_b.extend(random_document[j]) if len(tokens_b) >= target_b_length: break # We didn't actually use these segments so we "put them back" so # they don't go to waste. num_unused_segments = len(current_chunk) - a_end i -= num_unused_segments elif not FLAGS.random_next_sentence and rng.random() < 0.5: is_random_next = True for j in range(a_end, len(current_chunk)): tokens_b.extend(current_chunk[j]) # Note(mingdachen): in this case, we just swap tokens_a and tokens_b tokens_a, tokens_b = tokens_b, tokens_a # Actual next else: is_random_next = False for j in range(a_end, len(current_chunk)): tokens_b.extend(current_chunk[j]) truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng) assert len(tokens_a) >= 1 assert len(tokens_b) >= 1 tokens = [] segment_ids = [] tokens.append("[CLS]") segment_ids.append(0) for token in tokens_a: tokens.append(token) segment_ids.append(0) tokens.append("[SEP]") segment_ids.append(0) for token in tokens_b: tokens.append(token) segment_ids.append(1) tokens.append("[SEP]") segment_ids.append(1) (tokens, masked_lm_positions, masked_lm_labels, token_boundary) = create_masked_lm_predictions( tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng) instance = TrainingInstance( tokens=tokens, segment_ids=segment_ids, is_random_next=is_random_next, token_boundary=token_boundary, masked_lm_positions=masked_lm_positions, masked_lm_labels=masked_lm_labels) instances.append(instance) current_chunk = [] current_length = 0 i += 1 return instances MaskedLmInstance = collections.namedtuple("MaskedLmInstance", ["index", "label"]) def _is_start_piece_sp(piece): """Check if the current word piece is the starting piece (sentence piece).""" special_pieces = set(list('!"#$%&\"()*+,-./:;?@[\\]^_`{|}~')) special_pieces.add(u"€".encode("utf-8")) special_pieces.add(u"£".encode("utf-8")) # Note(mingdachen): # For foreign characters, we always treat them as a whole piece. english_chars = set(list("abcdefghijklmnopqrstuvwhyz")) if (six.ensure_str(piece).startswith("▁") or six.ensure_str(piece).startswith("<") or piece in special_pieces or not all([i.lower() in english_chars.union(special_pieces) for i in piece])): return True else: return False def _is_start_piece_bert(piece): """Check if the current word piece is the starting piece (BERT).""" # When a word has been split into # WordPieces, the first token does not have any marker and any subsequence # tokens are prefixed with ##. So whenever we see the ## token, we # append it to the previous set of word indexes. return not six.ensure_str(piece).startswith("##") def is_start_piece(piece): if FLAGS.spm_model_file: return _is_start_piece_sp(piece) else: return _is_start_piece_bert(piece) def create_masked_lm_predictions(tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng): """Creates the predictions for the masked LM objective.""" cand_indexes = [] # Note(mingdachen): We create a list for recording if the piece is # the starting piece of current token, where 1 means true, so that # on-the-fly whole word masking is possible. token_boundary = [0] * len(tokens) for (i, token) in enumerate(tokens): if token == "[CLS]" or token == "[SEP]": token_boundary[i] = 1 continue # Whole Word Masking means that if we mask all of the wordpieces # corresponding to an original word. # # Note that Whole Word Masking does *not* change the training code # at all -- we still predict each WordPiece independently, softmaxed # over the entire vocabulary. if (FLAGS.do_whole_word_mask and len(cand_indexes) >= 1 and not is_start_piece(token)): cand_indexes[-1].append(i) else: cand_indexes.append([i]) if is_start_piece(token): token_boundary[i] = 1 output_tokens = list(tokens) masked_lm_positions = [] masked_lm_labels = [] if masked_lm_prob == 0: return (output_tokens, masked_lm_positions, masked_lm_labels, token_boundary) num_to_predict = min(max_predictions_per_seq, max(1, int(round(len(tokens) * masked_lm_prob)))) # Note(mingdachen): # By default, we set the probilities to favor longer ngram sequences. ngrams = np.arange(1, FLAGS.ngram + 1, dtype=np.int64) pvals = 1. / np.arange(1, FLAGS.ngram + 1) pvals /= pvals.sum(keepdims=True) if FLAGS.favor_shorter_ngram: pvals = pvals[::-1] ngram_indexes = [] for idx in range(len(cand_indexes)): ngram_index = [] for n in ngrams: ngram_index.append(cand_indexes[idx:idx+n]) ngram_indexes.append(ngram_index) rng.shuffle(ngram_indexes) masked_lms = [] covered_indexes = set() for cand_index_set in ngram_indexes: if len(masked_lms) >= num_to_predict: break if not cand_index_set: continue # Note(mingdachen): # Skip current piece if they are covered in lm masking or previous ngrams. for index_set in cand_index_set[0]: for index in index_set: if index in covered_indexes: continue n = np.random.choice(ngrams[:len(cand_index_set)], p=pvals[:len(cand_index_set)] / pvals[:len(cand_index_set)].sum(keepdims=True)) index_set = sum(cand_index_set[n - 1], []) n -= 1 # Note(mingdachen): # Repeatedly looking for a candidate that does not exceed the # maximum number of predictions by trying shorter ngrams. while len(masked_lms) + len(index_set) > num_to_predict: if n == 0: break index_set = sum(cand_index_set[n - 1], []) n -= 1 # If adding a whole-word mask would exceed the maximum number of # predictions, then just skip this candidate. if len(masked_lms) + len(index_set) > num_to_predict: continue is_any_index_covered = False for index in index_set: if index in covered_indexes: is_any_index_covered = True break if is_any_index_covered: continue for index in index_set: covered_indexes.add(index) masked_token = None # 80% of the time, replace with [MASK] if rng.random() < 0.8: masked_token = "[MASK]" else: # 10% of the time, keep original if rng.random() < 0.5: masked_token = tokens[index] # 10% of the time, replace with random word else: masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)] output_tokens[index] = masked_token masked_lms.append(MaskedLmInstance(index=index, label=tokens[index])) assert len(masked_lms) <= num_to_predict rng.shuffle(ngram_indexes) select_indexes = set() if FLAGS.do_permutation: for cand_index_set in ngram_indexes: if len(select_indexes) >= num_to_predict: break if not cand_index_set: continue # Note(mingdachen): # Skip current piece if they are covered in lm masking or previous ngrams. for index_set in cand_index_set[0]: for index in index_set: if index in covered_indexes or index in select_indexes: continue n = np.random.choice(ngrams[:len(cand_index_set)], p=pvals[:len(cand_index_set)] / pvals[:len(cand_index_set)].sum(keepdims=True)) index_set = sum(cand_index_set[n - 1], []) n -= 1 while len(select_indexes) + len(index_set) > num_to_predict: if n == 0: break index_set = sum(cand_index_set[n - 1], []) n -= 1 # If adding a whole-word mask would exceed the maximum number of # predictions, then just skip this candidate. if len(select_indexes) + len(index_set) > num_to_predict: continue is_any_index_covered = False for index in index_set: if index in covered_indexes or index in select_indexes: is_any_index_covered = True break if is_any_index_covered: continue for index in index_set: select_indexes.add(index) assert len(select_indexes) <= num_to_predict select_indexes = sorted(select_indexes) permute_indexes = list(select_indexes) rng.shuffle(permute_indexes) orig_token = list(output_tokens) for src_i, tgt_i in zip(select_indexes, permute_indexes): output_tokens[src_i] = orig_token[tgt_i] masked_lms.append(MaskedLmInstance(index=src_i, label=orig_token[src_i])) masked_lms = sorted(masked_lms, key=lambda x: x.index) for p in masked_lms: masked_lm_positions.append(p.index) masked_lm_labels.append(p.label) return (output_tokens, masked_lm_positions, masked_lm_labels, token_boundary) def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng): """Truncates a pair of sequences to a maximum sequence length.""" while True: total_length = len(tokens_a) + len(tokens_b) if total_length <= max_num_tokens: break trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b assert len(trunc_tokens) >= 1 # We want to sometimes truncate from the front and sometimes from the # back to add more randomness and avoid biases. if rng.random() < 0.5: del trunc_tokens[0] else: trunc_tokens.pop() def main(_): tf.logging.set_verbosity(tf.logging.INFO) tokenizer = tokenization.FullTokenizer( vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case, spm_model_file=FLAGS.spm_model_file) input_files = [] for input_pattern in FLAGS.input_file.split(","): input_files.extend(tf.gfile.Glob(input_pattern)) tf.logging.info("*** Reading from input files ***") for input_file in input_files: tf.logging.info(" %s", input_file) rng = random.Random(FLAGS.random_seed) instances = create_training_instances( input_files, tokenizer, FLAGS.max_seq_length, FLAGS.dupe_factor, FLAGS.short_seq_prob, FLAGS.masked_lm_prob, FLAGS.max_predictions_per_seq, rng) tf.logging.info("number of instances: %i", len(instances)) output_files = FLAGS.output_file.split(",") tf.logging.info("*** Writing to output files ***") for output_file in output_files: tf.logging.info(" %s", output_file) write_instance_to_example_files(instances, tokenizer, FLAGS.max_seq_length, FLAGS.max_predictions_per_seq, output_files) if __name__ == "__main__": flags.mark_flag_as_required("input_file") flags.mark_flag_as_required("output_file") flags.mark_flag_as_required("vocab_file") tf.app.run() ================================================ FILE: data/news_zh_1.txt ================================================ 最后的南京老城该往何处去 城市化时代呼唤文化自觉 【概要】80后学者姚远出版《城市的自觉》一书 姚远出版《城市的自觉》 作者简介姚远,政治学博士,1981年出生于南京,1999年从金陵中学毕业后考入北京大学国际关系学院,负笈燕园十二载,获政治学博士学位。 现任教于南京大学政府管理学院。 在关系古都北京、南京等历史文化名城存废的历史关头,他锲而不舍地为抢救中华文明奔走呐喊。 2010年,他被中国文物保护基金会评为“中国文化遗产保护年度十大杰出人物”,当时的获奖评语是:一支?土耳其诗人纳齐姆·希克梅特曾深情地说:“人的一生有两样东西不会忘记,那就是母亲的面孔和城市的面貌。 ”然而,前不久南京再次发生颜料坊地块市级文保单位两进建筑被毁的事件。 故宫博物院院长、原国家文物局局长单霁翔近日在宁直言,南京城南再遭损毁令他心痛。 南京老城“路在何方”? 2010年被中国文物保护基金会评为“中国文化遗产保护年度十大杰出人物”的80后学者、南京大学姚远老师所著的《城市的自觉》近日正式出版。 书中探索古城保护与复兴的建设性路径,值得南京的决策者们在颜料坊事件后再次深思。 江南时报记者黄勇疑问:城市化,是否迷失了文化自觉“目睹一座座古建筑的消失,行走在古城的废墟,想到梁思成说过的‘拆掉北京的一座城楼,就像割掉我的一块肉;扒掉北京的一段城墙,就像扒掉我的一层皮’,真是感同身受,我流泪了。 ”这是姚远最让记者为之动容的一句话,也是《城市的自觉》一书中的“魂”。 包括南京在内,中国大多数城市正处于大拆除的时代,成片的历史街区在“旧城改造”的大旗下被不断夷为平地。 有专家称,这场“休克疗法式”的“改造”,对中华文脉的影响之深、之巨、之不可逆,堪称中国城市史上“三千年未有之大变局”。 《城市的自觉》正是在这种背景下,由北京大学出版社于近日出版的。 书中,姚远以情理交融的文字,辅之以背景、南京古城珍贵的最后影像,如实记录了在北京梁思成故居和宣南、东四八条、钟鼓楼等历史街区,南京颜料坊、南捕厅、门东、门西等历史街区的最后时刻,为阻挡推土机而屡败屡战的历程。 同时,又理性剖析了与存续城市记忆密切相关的文化自觉、物权保护、民生改善、公众参与等议题,探索古城保护与复兴的建设性路径。 为何要保老城? 很多人认为陈旧的老街区、老房子应该为摩天大楼让位,造高速路、摩天楼是现代化,“保护老古董”是抱残守缺,姚远却不是这种看法:“一些决策者并不知城市遗产保护恰恰是‘后工业’、‘后现代’的思想,比前者的理念差不多领先了一个世纪。 ” 在他眼里,南京这座千年古城曾是“活”着的,老城里有最纯正的方言、最鲜活的民俗、最地道的小吃,简直是一座巨大的民俗博物馆。 “你可以在同老者的交谈中,听到一个个家族或老宅的兴衰故事。 这里的城与人,就是一本厚重的大书,它们用最生动的语言向你讲述不一样的‘城南旧事’。 ”面对许多古城不断遭到大拆大建、拆真建假、拆旧建新的厄运,姚远痛心地说,“我们的城市化,是否迷失了自我认同,是否失去了文化自觉的能力? 在城市化的文化自觉重建之前,我们还将继续付出多少代价? ”现状:老城南仅剩不到1平方公里南京城曾有十九个别称,如秦淮、白下、建邺、江宁等,建城史更是长达两千五百年。 但如今,除去明城墙以及一些重点文物以及七零八落的民国建筑之外,这个城市跟中国其他的城市看上去并无太多区别,鳞次栉比的高楼大厦,车水马龙的宽阔街道,川流不息的红男绿女……持续多年的旧城改造,已经让南京老城日益失去古朴的历史风貌。 秦淮河畔的老城南,是南京文化的发源地,是南京的根。 在2006年前,尽管南京诸多的“殿、庙、塔、桥”已在兵火和变乱中消失,但秦淮河畔的老城南依然保存了文物丰富、风貌完整的历史街区。 然而,2006年,南京风云突起,突击对颜料坊、安品街等历史街区实施“危旧房改造”,拆毁大量文物建筑。 2009年又是一轮“危改”,大大的“拆”字,再次涂上了门东、门西、南捕厅等多片老街区。 2010年至今,南京先后出台了《南京市历史文化名城保护条例》《南京历史文化名城保护规划》《南京老城南历史城区保护规划与城市设计》,以法规的高度,回应了社会各界的诉求,明确要求对老城的整体保护。 姚远和其他学者联名提出的建议,有40处被采纳进了最后的《条例》中。 姚远告诉江南时报记者,南京的传统旧城区——老城南仅剩不到1平方公里,尚不及50平方公里老城总面积的2%,整体保护势在必行。 但他并不认为整体保护意味着“冻结不动”,而是强调古民居、古街巷和宏伟的古建筑一样重要,它们是古都特有的城市肌理,低矮的民居衬托高大的城阙,形成轮廓丰富的城市格局。 如果消灭了它们,名胜古迹就变成无法交融联络的“孤岛”,古都的整体风貌则无从谈起。 “对于金陵古城濒危的最后这点种子,实行‘整体保护’已经没有任何讨价还价的余地。 ”《城市的自觉》一书中,姚远的声音振聋发聩。 方案:探索保护与整治的最大合力可惜的是,在专家学者与推土机的拉锯战中,前者基本还是处于下风的,即便是中央领导的几次批示,旧城改造的推土机依然我行我素,将一面面古墙碾在轮下。 颜料坊、牛市、门东等被“肢解”的老城南片区,如今多已竖起或正在建设房地产开发、商业项目。 2002年8月,姚远在南京颜料坊开始了古城保护的第一次拍摄。 如今牛市64号-颜料坊49号这座百年清代建筑却再遭破坏。 单霁翔近日在南大演讲中也表示,颜料坊再遭损毁令人心痛。 “我不认同南京老城南成片拆除,搬迁当地住户的改造方式。 简单地认为它的居住形式落后了,这种态度是消极的,没有给予作为代表地域特色的传统建筑的居住形式有尊严的呵护。 ”《城市的自觉》一书中也多次提及南京老城不能“只见物,不见人”。 姚远强调,南京历史文化名城的保护,离不开对传统社区的活态保护。 老城南有丰富的民俗和古老的街区,是唇齿相依的一个整体。 拆去了老宅,迁走了居民,文化自然就成了无源之水、无本之木。 “国际上的成功经验表明,保护从来不是发展、民生、现代化的反义词。 ”姚远建议,老城区的整治,可以在政府的指导和协助下,以居民为主体,通过社区互助的“自我修缮”的方式来实施,将“旧城区改建”从拆迁模式下的行政关系转变为修缮模式下的民事关系,最大限度地调动各方面的积极性,形成保护与整治的最大合力。 措施:用行动让法律“站起来”经历了两次保卫战,姚远对于文物保护方面的法律条文早已如数家珍。 在他看来,“法治”和“参与”这两个关键词尤为重要。 姚远认为,政府的很多失误是因为政策制定的封闭性,推土机开到门口时才告知公众。 公民参与,就要求行政更加透明、公开。 “几次保护后制定的政策或者法律法规,也很重要。 因为未来只要有人参与去触动,政策或者法律法规就能‘站起来’,变成一套强有力的程序,约束政府行为。 ”“这些年古城保护的每一点进步,都离不开广泛的公众参与,都凝结着社会各界共同的努力。 ”姚远认为,在北京、南京等许多古城,一批志愿者、社会人士和民间团体,在古城命运的危急关头,已经显示出日益崛起的公众参与的巨大力量。 “关键要有人能够站出来。 第一个人站出来,就会有第二个人跟上,专家和媒体也会介入,事情就能在公开博弈中得到较为合理的解决。 我国目前民间的文保力量正在逐渐成长,公民参与将成为构建良性社会机制的重要力量。 ”姚远强调。 单霁翔对文化遗产保护中的公众参与也做出了高度评价。 他在《城市的自觉》的序中写道:“保护文化遗产绝不仅仅是各级政府和文物工作者的专利,只有广大民众真心地、持久地参与文化遗产保护,文化遗产才能得到最可靠的保障。 以姚远博士为代表的一批志愿者和社会人士,在我国文化遗产保护事业中已经显示出不可低估、无可替代的力量。 不是每一块石头,都能叫珠宝 对于很多人来说,矿石是长成这样的石头: 上图:铁矿石 上图:石 上图:煤矿石 上图:锡矿石如你所想象的那样,很多矿石都是又黑又丑,即使在野外遇到,也不会多看一眼的那种石头。 当然,也不是所有矿石都这么丑。 我们再看看下面这些矿石: 上图:赤铜 上图:钼铅矿 上图:方硼石 上图:自然硫 上图:云母这些矿石,能否让你感慨大自然的造化神奇?小伙伴们可能会想,这些漂亮的矿石,打磨以后就是漂亮的宝石啊,为什么我们不把他们加工成珠宝呢?这个是个好问题。 人类自古以来就没有停止过对美好事物的追求,凡漂亮的东西都可能被人们看上,成为制作饰品原料。 珠宝就是大自然赐予的美好的东西中的一种。 珠宝如果不美就不能成为珠宝,这种美或表现为绚丽的颜色,或表现为透明而洁净。 物以稀为贵,鸽血红级别的红宝石、矢车菊蓝级别的蓝宝石,每克拉价值上万美元,而某些颇美丽又可耐久的宝石(如白水晶),由于产量较多,开采较容易,其价格一直较低。 so,大家能明白了吧,不是每一块石头都能成为珠宝。 如果拥有珠宝,请务必珍惜。 目前1000+人已关注加入我们您看此文用· 秒,转发只需1秒呦~ 北京市黄埔同学会接待“踏寻中山足迹学习之旅”台湾参访团 光明网讯(通讯员苏民军记者任生心)日前,由台湾中国统一联盟桃竹分会成员组成的“踏寻中山足迹学习之旅”参访团一行21人来到北京参观访问。 在北京市黄埔同学会的精心安排下,在京期间,参访团拜谒了中山先生衣冠冢,参观了卢沟桥、抗战纪念馆、抗战名将纪念馆和宋庆龄故居等;“踏寻中山足迹学习之旅”参访团还将赴南京中山堂等地参访。 在抗战纪念馆,参访团成员们认真聆听讲解员的介绍,仔细观看每张图片资料,回顾国共两党团结抗战的往事,缅怀那些为民族独立而壮烈牺牲的英雄。 而后,参访团一行来到位于京西香山深处的孙中山先生衣冠冢拜谒,参访团团长李尚贤(台湾中国统一联盟总会第一副主席兼秘书长)发表了简短的感言后,全体成员在孙中山雕像前三鞠躬,向孙中山先生致敬,缅怀孙中山先生以“三民主义”为宗旨的革命的一生。 随后,参访团一行又来到2009年建成的北京香麓园抗战名将纪念馆,瞻仰了佟麟阁将军墓,他们还参观了宋庆龄故居。 鼎丰(08056.HK)向客户借出5000万人币 月息1.75厘 为期一年 鼎丰集团控股(08056.HK)+0.030(+1.345%)公布,同意将一笔5000万元人民币的款项委托予贷款银行,以供转借予客户,贷款期为十二个月,月息1.75厘。 (报价延迟最少十五分钟。 在青岛不买房,居然能拥有这么多东西! 这段时间青岛房价扶摇直上闹得人心惶惶这不,青岛房市,又在国庆节火了一把 国庆5天内16城启动楼市限购一时之间楼市风云大转纵观9月份青岛一手房均价怎么也有一万三四了看完十三哥默默地回去工作了 按照一套房子100平米计算购买一套房子大概需要130万在青岛,买一套房子怎么也得需要130万如果这些钱不买房能在全世界各地买什么呢? 今天,小编就带大家(bai)感(ri)受(meng)一下在西班牙能买3.4个村庄 一位英国人,名叫尼尔·克里斯蒂,在西班牙农村西北部一个田园地区买下了一处村庄(阿鲁纳达),只花费了4.5万欧元(约合35.6万人民币)。 简直便宜到吐血,这点钱要是在青岛的豪宅区,恐怕厕所都买不了。 如果选的地方靠近旅游景区,稍微装修一下,变成一个度假村……妥妥的壕啊,画面太美,不敢想象……在爱尔兰差不多能买个小岛 Inishdooney岛,位于北爱尔兰西北部,售价14万英镑(约合139万人民币)。 约38万平方米的无人居住地有淡水池塘、天然溶洞和鹅卵石海滩,美翻了有木有! 一个小岛的钱,和青岛一个水泥格子的价格差不多。 不要拦着最懂妹,我要去爱尔兰做岛主! 在巴厘岛能买2座别墅 巴厘岛,蓝天、碧水、白云,美的像梦一样,而你知道吗,这座世界著名旅游岛一个小镇的别墅只要10.7万美元,也就是不到70万人民币,青岛买房那点钱都够买两栋别墅了。 在巴厘岛拥有两座别墅是什么概念? 发完文章小编就去买机票! 在美国能买1驾小飞机 美国塞斯纳C172R型,最大航程可达1270公里,飞机上具备GPS导航定位系统、自动驾驶、盲降设备等,价格大概在17万美元左右,也就是104万人民币。 在青岛买房的钱妥妥的够买一架飞机了。 直接移民去西班牙 一个以阳光和沙滩吸引着无数游客的国家,有着激情的足球和斗牛文化、独特的海鲜美食、发达的时装行业、热情火辣的西班牙女郎...... 直接去西班牙? 你以为我在搞笑? 西班牙有个买房移民的政策,在西班牙的指定区域购买当地售价在170万人民币以上的房产就可以办理多次往返签证了,然后你待够10年,就可以入西班牙国籍了。 买一大堆LV手袋 十三哥相信很多女孩应该都很喜欢LV手袋。 这款极具魅力的CHAIN LOUISE手袋价格为2.04万人民币。 随随便便买一堆! 带着爱人环游世界 微博上那对香港80后小夫妻历时308天花费16万人民币走遍了37国,你们还记得吗? 按照他们的行程,你几乎就能去环游世界了。 什么也不用想,痛痛快快环游地球一圈! 在澳大利亚当农场主 五卧室、三浴室的大房子,还有德尼利昆镇附近一块27英亩的农场。 只需要美元价格14.4万美元(≈96万人民币),是不是惊呆了! 哦,对了,澳大利亚还提供住房贷款业务哟! 十三哥要挣钱去澳大利亚买牧场! 在莫斯科买下1座别墅 莫斯科市中心双卧室、双浴室的豪华大别墅,你觉得多少钱? 千万别吃惊,美元价格在15.2万美元左右(≈100.1万人民币)。 虽然在这个城市生活总会有各种各样的压力我们必须十分努力才能看起来毫不费力但是我们永远保持一颗向上的心不气馁,好好加油! [海尔地产世纪公馆]新都心2期升级新品9月底推出 海尔地产世纪公馆二期规划8栋高层住宅,预计9月底推出,认筹中,交2.5万享99折优惠,预计均价17000-18000元/平。 户型面积区间89-162平,主力120-140平品质改善产品。 125-126平为套三,142-162平为套四。 海尔地产世纪公馆一户一价,以上价格仅供参考,所有在售户型价格以售楼处公布为准。 咨询电话:400-099-0099 转 27724[金隅和府]3大商圈环绕地铁房18000元 金隅和府一户一价,以下价格仅供参考,所有在售户型价格以售楼处公布为准。 金隅和府预计9月20日加推6#楼(24F)楼王,3个单元,1梯2户,户型面积为90平套二,122平、138平套三,团购交1万团购金、10万认筹金可以享受97折优惠,预计均价18000-26000元/平。 金隅和府位于镇江路12号,近邻山东路、延吉路、东西快速路等三横三纵交通网、未来享地铁M5之便利;CBD商圈、香港路商圈、台东商圈3大商圈环绕,居住生活便利。 直播拐点来临:未来直播APP开发还有哪些趋势? 趋势一:巨头收割直播价值,依赖巨头扶持的直播平台存活几率更高尽管一线垂直领域已经被巨头的直播平台占领,但创业者依然还有机会。 未来在泛娱乐社交、游戏、美妆电商等核心领域必然会有几家直播平台具有突出优势,而这些具备突出优势的直播平台很可能会被BAT入股收购或者收编,因此如果能够获得巨头的资本输血与流量扶持,往往存活的几率会更大。 趋势二:直播平台从争抢网红到争抢明星资源明星+粉丝经济+直播平台,很可能会衍生出新型的整合营销方式。 即怎样通过可购买价值的内容设定,运营好与粉丝之间的感情沟通,让粉丝群体进行持续性参与并进行情感消费投入,直播平台与明星组合叠加的人气效应与非理性消费的频次也非常契合品牌商的需求。 因此,直播的未来趋势将从争抢网红资源到争抢明星资源。 这是直播平台孕育粉丝经济进而带来新型的情感消费与商业模式的要走的一条必要的路径。 而未来可能会有越来越多的品牌商更愿意尝试这种直播互动带来的品牌曝光机会与商业变现模式。 趋势三:从泛娱乐明星网红直播转入到二级垂直细分市场的专业直播泛娱乐直播内容属性上由于其单一、无聊的直播内容无法构成平台的核心竞争力,直播平台未来大趋势是从泛娱乐直播转入到内涵直播。 目前部分视频直播平台已针对财经、育儿、时尚、体育、美食等垂直领域的自频道开放直播权限,内容的差异化与垂直化可以为直播平台带来新的商业模式,平台也可以通过优质的直播内容,产生付费、会员、打赏以及直播购物等盈利模式。 因为目前缺乏真正有价值的直播,多数直播平台在内容供给侧是存在问题的,网红要提升自身与粉丝之间的黏性,显然需要差异化的内容,而从目前的欧美网红与直播内容的发展规律来看,更健康、更有价值与内涵的直播内容成为未来的发展趋势之一。 趋势四:网红孵化器批量生产网红 将走向专业化由于在网红包装、传播、变现等方面具备专业的运营能力,网红孵化器未来须具备 “经纪人+代运营+供应链+网红星探”等多重角色,向专业网红群聚捆绑者向提供专业化的服务与垂直领域专家型、特长型、个性型网红培养者与发现者这一定位转型。 借助在用户洞察、网红运营、电商管理方面的精良团队,需要打通粉丝营销和电商运营,并将网红、粉丝,平台、内容,品牌、供应链,进行有效链接及整合。 趋势五:C端直播洗牌 B端企业直播崛起带动专业的商务直播需求目前,各种企业的商务发布会、沙龙、座谈、讲座、渠道大会、教育培训等方面直播需求强烈,在企业进行移动视频直播的需求推动下,它们开始寻求低成本、快速的搭建属于自己的高清视频直播平台的模式,而企业搭建视频直播平台需要专业的技术能力的服务商来应对这种需求。 用户可以通过微信直接观看企业直播参与互动,让直播突破空间场地的限制,某种程度也代表直播产业链的一个接入的发展方向。 趋势六:解决直播用户体验与新媒体营销,移动直播服务商将迎来新的机会直播行业进入了各行各业均可参与,并将直播作为企业服务工具的直播+时代,而玩转直播+,从技术、营销、服务、内容,进而可以衍生出更多的直播服务盈利。 而对于解决直播体验背后的移动直播服务商,也将迎来新的机会。 趋势七:直播或成为企业的标配,可能为企业带来更多转化率当直播火爆起来的时候,人们要关注的不仅仅是行业能火爆多久,它的商业模式是否成熟,在洗牌节点来临与巨头羽翼覆盖下,自身还有没有机会,创业者与企业都应该从中寻找自己的机会与跨界领域的嫁接。 它不仅仅是内容和流量的变现工具,更应该是一种营销与商业理念的转变。 不久前,马化腾向青年创业者建议,要关注两个产业跨界的部分,因为将新技术用在两个产业跨界部分往往最有可能诞生创新的机会。 而企业营销如果能从垂直细分领域的切入并借助直播技术与趋势为已所用,往往也能获得新的机会,尽管任何基于行业趋势的预测都意味着不确定性,但抓住不确定性的机会,才能最终在新一轮风口下,把握企业转型与商业、营销模式创新的机会,迎来属于自己的时代。 欢迎互联网创业者加入杭州互联网创业QQ群:157936473直接加QQ或pc上点击加群项目开发咨询:0571-28030088 邓伟根北美硅谷行“捎回”一个MBA授课点 南都讯记者郭伟豪通讯员伍新宇6月7日至16日,佛山市委常委、南海区委书记、佛山高新区党工委书记兼管委会主任邓伟根率领由南海区和佛山高新区相关人员组成的经贸洽谈和友好交流代表团,对新加坡、美国和加拿大进行友好访问。 由于新加坡裕廊、美国硅谷与有“加拿大高科技之都”美誉的万锦市均以发达的高科技产业著名,皆是所在国的硅谷,邓伟根更称此行为“三谷”之行。 在新加坡,邓伟根一行与新加坡淡马锡控股公司相关负责人就双方进一步深化合作进行了深入的探讨。 交流中,新加坡国立大学(N U S)商学院杨贤院长表示有意在南海设立N U S的海外M B A授课点,双方拟于6月下旬就有关意向在南海签订合作协议。 6月9日,邓伟根一行前往硅谷拜会了硅谷美华科技商会(S V C A C A )和华美半导体协会(C A SPA )。 SV C A C A和CA SPA将通过其广泛的会员和在硅谷等地的影响力,为佛高区、南高区在硅谷进行宣传推介,并积极把有意拓展中国市场的高科技项目推荐到南高区。 代表团一行还到访了南海区政府与万锦市政府联合举办了“南海区与万锦市经贸交流会”。 2012年12月,万锦市市长薛家平先生率团访问南海后,万锦市议会正式通过了为当地一道路命名“南海街”的议案,并于2013年9月举行道路命名仪式。 在本次交流中,邓伟根提议未来也在南海选址命名一条“万锦路”,此举也立即得到薛家平市长的认同。 对于“三谷”之行,邓伟根表示,南海将利用现有的南海乡亲和关系密切的协会等有利资源,计划在“三谷”建立南海和佛高区的海外联络处,学习和吸收海外高科技之都的先进经验,努力将已定位为“中国制造金谷”的佛高区南海核心园打造成为下一个“硅谷”,并争取早日实现佛高区挺进全国国家高新区20强的目标。 内地高中生将通篇学习《道德经》 摘要国内第一套自主研发的高中传统文化通识教材预计将于今年9月出版,四册分别为《论语》《孟子》《大学·中庸》和《道德经》。 2016年高考改革方案中,全国25个省高考要统一命题,并且增加分数后的语文考试,正在研究增加“中华优秀传统文化”之相关内容。 《道德经》成为高中传统文化教材。 法制晚报讯(记者 李文姬 )今天上午,记者从“十二五”教育部规划课题《传统文化与中小学生人格培养研究》总课题组了解到,国内第一套自主研发的高中传统文化通识教材预计将于今年9月出版,四册分别为《论语》《孟子》《大学·中庸》和《道德经》。 至此,课题组已完成了幼儿园、小学、初中、高中各阶段标准化传统文化教材的研发工作,高中国学教材将在各地开展成规模的教材试用工作。 中国国学文化艺术中心秘书长张健表示,目前各地高考改革的几个信号均指向国学,但考什么、怎么考又是一个难题。 专家建议,不应以文言文字词解释等传统形式考查,应关注考生如何消化吸收传统文化中的哲学素养和思想韬略。 教材各年级国学内容全覆盖据 “十二五”教育部规划课题《传统文化与中小学生人格培养研究》总课题组介绍,高中传统文化通识系列教材作为“十一五”、“十二五”两个阶段十年课题研究的重要成果之一,由中国国学文化艺术中心承担资源整合和编著。 去年,教育部印发了《完善中华优秀传统文化教育指导纲要》,要求在课程建设和课程标准修订中强化中华优秀传统文化内容。 在中小学德育、语文、历史等课程标准修订中,增加中华优秀传统文化的比重。 课题组秘书长张健表示,幼儿园、小学、初中、高中各阶段标准化传统文化教材的均已研发完成,明确提出以“青少年完美人格”为传统文化教育目标,教材知识相互关联,自成体系,并通过高中教材实现最终教学评价。 这是“十一五”“十二五”两个阶段十年课题研究的重要成果之一。 今年5月份之前,《高等教育传统文化教材》(12册)《全国行政领导干部国学教材》(10册)两套教材也将研发完毕。 内容高中教材含《论语》《道德经》此次即将出版的高中阶段传统文化通识教材共有4册,供高中一、二年级使用。 高一学习《论语》《孟子》,高二学习《大学·中庸》和《道德经》。 其中《道德经》为原文全本讲解,另外三册则是按主题归类讲解。 如《大学·中庸》一册,分为“慎独”“齐家”“格物致知”“中和”“为政”等章节。 据课题组专家介绍,这4册书并非孤立的高中教材,而是《中华优秀传统文化教育全国中小学实验教材》的高中部分。 全套教材包含小学、初中和高中三个阶段,经专家组反复研讨、论证,制定了“儒学养正、兵学相佑、道法自然、文化浸润”的课程结构,各阶段教学内容和深度循序渐进、系统科学。 事实上,小学高年级段已开始涉及《论语》《孟子》等儒学典籍,但仅以诵读和简单理解为主,到高中阶段,学生可在已有基础上更为深刻地领悟儒道经典的思想内涵,以达到融会贯通的程度。 此外,每一章节在讲解儒道核心精神的同时,还为学生提供了大量中西文化比较等拓展阅读素材。 针对公众关注的一个话题,即传统文化有望成为高考的新考点,课题组表示目前在研发高中传统文化教材的同时,就已开展了另一个重点子课题研究,即传统文化教学评价与考试模式研究。 张健强调高考改革的几个信号均指向国学,例如北京、上海等地公布的高考改革方案中,英语降分后其所降分数分给了语文,而且还更进一步明确指出了就是将分数转移给所增加的“传统文化考试内容”部分。 又如今年清华北大自主招生均招收国学特长生。 此外,近期公布的2016年高考改革方案中,全国25个省高考要统一命题,并且增加分数后的语文考试,正在研究增加“中华优秀传统文化”之相关内容。 张健表示,传统文化成为高考的又一创新考点指日可待,但考什么、怎么考又是一个重大难题。 由于相关子课题研究还没有结束,课题组非行政机构只承担建议义务。 张健坦言,能否在高考语文中出现一个新的形式——政论或申论形式的传统文化论述题,这一方向应该是研究和创新的改革方向之一。 若2016年传统文化进入高考,最大的问题是很多高中生没有接触过传统文化课程,不具备相关知识储备和素养,国学文化是通过长期熏陶和涵养才能显现的,不是靠一朝一夕突击补课就能拥有的。 悬灸技术培训专家教你艾灸降血糖,为爸妈收好了! 近年来随着我国经济条件的改善和人们生活水平的提高,我国糖尿病的患病率也在逐年上升。 悬灸技术培训的创始人艾灸专家刘全军先生对糖尿病深有研究,接下来,学一学他是怎么用艾灸降血压的吧! 中医认为,糖尿病是气血、阴阳失调等多种原因引起的一种慢性疾病。 虽然分为上消、中消、下消,但是无论何种糖尿病 ,治疗的原则都是荣养阴液,清热润燥。 艾灸对控制血糖效果不错。 艾灸功效:调升元阳降血糖艾灸可以修复受损胰岛细胞,激活再生,逐步实现胰岛素的自给自足。 服药一天比一天少,身体一天比一天好,彻底摆脱终生服药! 还可以双向调节血糖,使血糖老老实实地锁定在正常的恒定值范围。 也可以改善组织供氧,对微血管病变导致的视物不清、眼底出血等视网膜病变及早期肾病病变及早期肾病病变有明显治疗与改善作用,改善病人消瘦无力、免疫力低下、低蛋白质血证及伤口不愈等现象。 艾灸取穴糖尿病艾灸过的穴位有,承浆中脘足三里关元曲骨三阴交、期门太冲下脘天枢气海膈俞膻中、胃俞,这么多穴位可根据患者当时的症状进行选取。 选取后艾灸,每10天为一个疗程,疗程间休息3-5天后继续第二轮的治疗,三个疗程基本可见到理想疗效。 这几个穴位都是具有补充人体元阳功能的大穴和调节脏腑功能的腧穴,从根上调节人体的元阳使阴阳达到新的平衡,五脏六腑尤其是肺、脾肾的功能恢复正常,糖尿病自然也就不药而愈了。 艾灸可以有效控制糖尿病 ,这在很多资料都有报导。 艾灸使病人的营养能得到有效的吸收和利用,从而提高人体的自身免疫功能和抗病防病能力,防止了系列并发症的发生,真正做到综合治疗,标本兼治。 艾灸对于常见病是具有广泛的适应性的。 希望大家把艾灸推广出去,让艾灸这个疗法能够更完善,造福更多的人。 熟食放在垃圾旁无照窝点被取缔 本报讯(记者李涛)又黑又脏的墙面、随意堆放的加工原料、处处弥漫的刺鼻味道。 昨天上午,东小口镇政府与城管、食药、公安等部门开展联合执法行动时,依法取缔了一个位于昌平区东小口镇半截塔村的非法熟食加工窝点。 昨天上午,执法人员对东小口镇半截塔村进行环境整治时,一家挂着“久久鸭”招牌的小店的店主显得有点紧张,还“顺手”把通向后院的门关上了。 执法人员觉得有些蹊跷,便要求到后院进行检查。 一进院子,执法人员就发现大量的熟食加工原料被随意摆放在地上,旁边就堆放着垃圾。 院内煤炉上的一口锅内正煮着的食物,发出刺鼻的味道。 执法队员介绍,在炉子一旁的笸箩里盛着制作好的熟食制品,但却没有任何遮盖,一阵风起,煤灰混着尘土就落在上面。 执法队员说:“走进院旁的小屋内,地上和墙上满是油污,脏乎乎的冰柜上堆放着一袋一袋的半成品,一个个用来盛放熟食制品的笸箩摞在生锈的铁架子上。 ”随后,执法人员仔细查找,没有发现任何消毒设施,调查得知从事加工的人员也没有取得加工熟食应需的健康证。 执法人员随后对店主进行询问,当执法人员要求出示营业执照及卫生许可证时,店主嘟囔了半天才坦白自己不具备任何手续。 执法人员当即对该非法生产窝点进行了取缔,对现场工作人员进行了宣传与教育,并依法没收了加工工具及食品。 ================================================ FILE: lamb_optimizer_google.py ================================================ # coding=utf-8 # Copyright 2019 The Google Research Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # Lint as: python2, python3 """Functions and classes related to optimization (weight updates).""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import re import six import tensorflow as tf # pylint: disable=g-direct-tensorflow-import from tensorflow.python.ops import array_ops from tensorflow.python.ops import linalg_ops from tensorflow.python.ops import math_ops # pylint: enable=g-direct-tensorflow-import class LAMBOptimizer(tf.train.Optimizer): """LAMB (Layer-wise Adaptive Moments optimizer for Batch training).""" # A new optimizer that includes correct L2 weight decay, adaptive # element-wise updating, and layer-wise justification. The LAMB optimizer # was proposed by Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, # James Demmel, and Cho-Jui Hsieh in a paper titled as Reducing BERT # Pre-Training Time from 3 Days to 76 Minutes (arxiv.org/abs/1904.00962) def __init__(self, learning_rate, weight_decay_rate=0.0, beta_1=0.9, beta_2=0.999, epsilon=1e-6, exclude_from_weight_decay=None, exclude_from_layer_adaptation=None, name="LAMBOptimizer"): """Constructs a LAMBOptimizer.""" super(LAMBOptimizer, self).__init__(False, name) self.learning_rate = learning_rate self.weight_decay_rate = weight_decay_rate self.beta_1 = beta_1 self.beta_2 = beta_2 self.epsilon = epsilon self.exclude_from_weight_decay = exclude_from_weight_decay # exclude_from_layer_adaptation is set to exclude_from_weight_decay if the # arg is None. # TODO(jingli): validate if exclude_from_layer_adaptation is necessary. if exclude_from_layer_adaptation: self.exclude_from_layer_adaptation = exclude_from_layer_adaptation else: self.exclude_from_layer_adaptation = exclude_from_weight_decay def apply_gradients(self, grads_and_vars, global_step=None, name=None): """See base class.""" assignments = [] for (grad, param) in grads_and_vars: if grad is None or param is None: continue param_name = self._get_variable_name(param.name) m = tf.get_variable( name=six.ensure_str(param_name) + "/adam_m", shape=param.shape.as_list(), dtype=tf.float32, trainable=False, initializer=tf.zeros_initializer()) v = tf.get_variable( name=six.ensure_str(param_name) + "/adam_v", shape=param.shape.as_list(), dtype=tf.float32, trainable=False, initializer=tf.zeros_initializer()) # Standard Adam update. next_m = ( tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad)) next_v = ( tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2, tf.square(grad))) update = next_m / (tf.sqrt(next_v) + self.epsilon) # Just adding the square of the weights to the loss function is *not* # the correct way of using L2 regularization/weight decay with Adam, # since that will interact with the m and v parameters in strange ways. # # Instead we want ot decay the weights in a manner that doesn't interact # with the m/v parameters. This is equivalent to adding the square # of the weights to the loss with plain (non-momentum) SGD. if self._do_use_weight_decay(param_name): update += self.weight_decay_rate * param ratio = 1.0 if self._do_layer_adaptation(param_name): w_norm = linalg_ops.norm(param, ord=2) g_norm = linalg_ops.norm(update, ord=2) ratio = array_ops.where(math_ops.greater(w_norm, 0), array_ops.where( math_ops.greater(g_norm, 0), (w_norm / g_norm), 1.0), 1.0) update_with_lr = ratio * self.learning_rate * update next_param = param - update_with_lr assignments.extend( [param.assign(next_param), m.assign(next_m), v.assign(next_v)]) return tf.group(*assignments, name=name) def _do_use_weight_decay(self, param_name): """Whether to use L2 weight decay for `param_name`.""" if not self.weight_decay_rate: return False if self.exclude_from_weight_decay: for r in self.exclude_from_weight_decay: if re.search(r, param_name) is not None: return False return True def _do_layer_adaptation(self, param_name): """Whether to do layer-wise learning rate adaptation for `param_name`.""" if self.exclude_from_layer_adaptation: for r in self.exclude_from_layer_adaptation: if re.search(r, param_name) is not None: return False return True def _get_variable_name(self, param_name): """Get the variable name from the tensor name.""" m = re.match("^(.*):\\d+$", six.ensure_str(param_name)) if m is not None: param_name = m.group(1) return param_name ================================================ FILE: modeling.py ================================================ # coding=utf-8 # Copyright 2018 The Google AI Language Team Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """The main BERT model and related functions.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import collections import copy import json import math import re import numpy as np import six import tensorflow as tf import bert_utils class BertConfig(object): """Configuration for `BertModel`.""" def __init__(self, vocab_size, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act="gelu", hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, type_vocab_size=16, initializer_range=0.02): """Constructs BertConfig. Args: vocab_size: Vocabulary size of `inputs_ids` in `BertModel`. hidden_size: Size of the encoder layers and the pooler layer. num_hidden_layers: Number of hidden layers in the Transformer encoder. num_attention_heads: Number of attention heads for each attention layer in the Transformer encoder. intermediate_size: The size of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. hidden_act: The non-linear activation function (function or string) in the encoder and pooler. hidden_dropout_prob: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. attention_probs_dropout_prob: The dropout ratio for the attention probabilities. max_position_embeddings: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). type_vocab_size: The vocabulary size of the `token_type_ids` passed into `BertModel`. initializer_range: The stdev of the truncated_normal_initializer for initializing all weight matrices. """ self.vocab_size = vocab_size self.hidden_size = hidden_size self.num_hidden_layers = num_hidden_layers self.num_attention_heads = num_attention_heads self.hidden_act = hidden_act self.intermediate_size = intermediate_size self.hidden_dropout_prob = hidden_dropout_prob self.attention_probs_dropout_prob = attention_probs_dropout_prob self.max_position_embeddings = max_position_embeddings self.type_vocab_size = type_vocab_size self.initializer_range = initializer_range @classmethod def from_dict(cls, json_object): """Constructs a `BertConfig` from a Python dictionary of parameters.""" config = BertConfig(vocab_size=None) for (key, value) in six.iteritems(json_object): config.__dict__[key] = value return config @classmethod def from_json_file(cls, json_file): """Constructs a `BertConfig` from a json file of parameters.""" with tf.gfile.GFile(json_file, "r") as reader: text = reader.read() return cls.from_dict(json.loads(text)) def to_dict(self): """Serializes this instance to a Python dictionary.""" output = copy.deepcopy(self.__dict__) return output def to_json_string(self): """Serializes this instance to a JSON string.""" return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n" class BertModel(object): """BERT model ("Bidirectional Encoder Representations from Transformers"). Example usage: ```python # Already been converted into WordPiece token ids input_ids = tf.constant([[31, 51, 99], [15, 5, 0]]) input_mask = tf.constant([[1, 1, 1], [1, 1, 0]]) token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]]) config = modeling.BertConfig(vocab_size=32000, hidden_size=512, num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024) model = modeling.BertModel(config=config, is_training=True, input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids) label_embeddings = tf.get_variable(...) pooled_output = model.get_pooled_output() logits = tf.matmul(pooled_output, label_embeddings) ... ``` """ def __init__(self, config, is_training, input_ids, input_mask=None, token_type_ids=None, use_one_hot_embeddings=False, scope=None): """Constructor for BertModel. Args: config: `BertConfig` instance. is_training: bool. true for training model, false for eval model. Controls whether dropout will be applied. input_ids: int32 Tensor of shape [batch_size, seq_length]. input_mask: (optional) int32 Tensor of shape [batch_size, seq_length]. token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length]. use_one_hot_embeddings: (optional) bool. Whether to use one-hot word embeddings or tf.embedding_lookup() for the word embeddings. scope: (optional) variable scope. Defaults to "bert". Raises: ValueError: The config is invalid or one of the input tensor shapes is invalid. """ config = copy.deepcopy(config) if not is_training: config.hidden_dropout_prob = 0.0 config.attention_probs_dropout_prob = 0.0 input_shape = get_shape_list(input_ids, expected_rank=2) batch_size = input_shape[0] seq_length = input_shape[1] if input_mask is None: input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32) if token_type_ids is None: token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32) with tf.variable_scope(scope, default_name="bert"): with tf.variable_scope("embeddings"): # Perform embedding lookup on the word ids, but use stype of factorized embedding parameterization from albert. add by brightmart, 2019-09-28 (self.embedding_output, self.embedding_table,self.embedding_table_2) = embedding_lookup_factorized( input_ids=input_ids, vocab_size=config.vocab_size, hidden_size=config.hidden_size, embedding_size=config.embedding_size, initializer_range=config.initializer_range, word_embedding_name="word_embeddings", use_one_hot_embeddings=use_one_hot_embeddings) # Add positional embeddings and token type embeddings, then layer # normalize and perform dropout. self.embedding_output = embedding_postprocessor( input_tensor=self.embedding_output, use_token_type=True, token_type_ids=token_type_ids, token_type_vocab_size=config.type_vocab_size, token_type_embedding_name="token_type_embeddings", use_position_embeddings=True, position_embedding_name="position_embeddings", initializer_range=config.initializer_range, max_position_embeddings=config.max_position_embeddings, dropout_prob=config.hidden_dropout_prob) with tf.variable_scope("encoder"): # This converts a 2D mask of shape [batch_size, seq_length] to a 3D # mask of shape [batch_size, seq_length, seq_length] which is used # for the attention scores. attention_mask = create_attention_mask_from_input_mask( input_ids, input_mask) # Run the stacked transformer. # `sequence_output` shape = [batch_size, seq_length, hidden_size]. ln_type=config.ln_type print("ln_type:",ln_type) if ln_type=='postln' or ln_type is None: # currently, base or large of albert used post-LN structure print("old structure of transformer.use: transformer_model,which use post-LN") self.all_encoder_layers = transformer_model( input_tensor=self.embedding_output, attention_mask=attention_mask, hidden_size=config.hidden_size, num_hidden_layers=config.num_hidden_layers, num_attention_heads=config.num_attention_heads, intermediate_size=config.intermediate_size, intermediate_act_fn=get_activation(config.hidden_act), hidden_dropout_prob=config.hidden_dropout_prob, attention_probs_dropout_prob=config.attention_probs_dropout_prob, initializer_range=config.initializer_range, do_return_all_layers=True) else: # xlarge or xxlarge of albert, used pre-LN structure print("new structure of transformer.use: prelln_transformer_model,which use pre-LN") self.all_encoder_layers = prelln_transformer_model( # change by brightmart, 4th, oct, 2019. pre-Layer Normalization can converge fast and better. check paper: ON LAYER NORMALIZATION IN THE TRANSFORMER ARCHITECTURE input_tensor=self.embedding_output, attention_mask=attention_mask, hidden_size=config.hidden_size, num_hidden_layers=config.num_hidden_layers, num_attention_heads=config.num_attention_heads, intermediate_size=config.intermediate_size, intermediate_act_fn=get_activation(config.hidden_act), hidden_dropout_prob=config.hidden_dropout_prob, attention_probs_dropout_prob=config.attention_probs_dropout_prob, initializer_range=config.initializer_range, do_return_all_layers=True, shared_type='all') # do_return_all_layers=True self.sequence_output = self.all_encoder_layers[-1] # [batch_size, seq_length, hidden_size] # The "pooler" converts the encoded sequence tensor of shape # [batch_size, seq_length, hidden_size] to a tensor of shape # [batch_size, hidden_size]. This is necessary for segment-level # (or segment-pair-level) classification tasks where we need a fixed # dimensional representation of the segment. with tf.variable_scope("pooler"): # We "pool" the model by simply taking the hidden state corresponding # to the first token. We assume that this has been pre-trained first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1) self.pooled_output = tf.layers.dense( first_token_tensor, config.hidden_size, activation=tf.tanh, kernel_initializer=create_initializer(config.initializer_range)) def get_pooled_output(self): return self.pooled_output def get_sequence_output(self): """Gets final hidden layer of encoder. Returns: float Tensor of shape [batch_size, seq_length, hidden_size] corresponding to the final hidden of the transformer encoder. """ return self.sequence_output def get_all_encoder_layers(self): return self.all_encoder_layers def get_embedding_output(self): """Gets output of the embedding lookup (i.e., input to the transformer). Returns: float Tensor of shape [batch_size, seq_length, hidden_size] corresponding to the output of the embedding layer, after summing the word embeddings with the positional embeddings and the token type embeddings, then performing layer normalization. This is the input to the transformer. """ return self.embedding_output def get_embedding_table(self): return self.embedding_table def get_embedding_table_2(self): return self.embedding_table_2 def gelu(x): """Gaussian Error Linear Unit. This is a smoother version of the RELU. Original paper: https://arxiv.org/abs/1606.08415 Args: x: float Tensor to perform activation. Returns: `x` with the GELU activation applied. """ cdf = 0.5 * (1.0 + tf.tanh( (np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3))))) return x * cdf def get_activation(activation_string): """Maps a string to a Python function, e.g., "relu" => `tf.nn.relu`. Args: activation_string: String name of the activation function. Returns: A Python function corresponding to the activation function. If `activation_string` is None, empty, or "linear", this will return None. If `activation_string` is not a string, it will return `activation_string`. Raises: ValueError: The `activation_string` does not correspond to a known activation. """ # We assume that anything that"s not a string is already an activation # function, so we just return it. if not isinstance(activation_string, six.string_types): return activation_string if not activation_string: return None act = activation_string.lower() if act == "linear": return None elif act == "relu": return tf.nn.relu elif act == "gelu": return gelu elif act == "tanh": return tf.tanh else: raise ValueError("Unsupported activation: %s" % act) def get_assignment_map_from_checkpoint(tvars, init_checkpoint): """Compute the union of the current variables and checkpoint variables.""" assignment_map = {} initialized_variable_names = {} name_to_variable = collections.OrderedDict() for var in tvars: name = var.name m = re.match("^(.*):\\d+$", name) if m is not None: name = m.group(1) name_to_variable[name] = var init_vars = tf.train.list_variables(init_checkpoint) assignment_map = collections.OrderedDict() for x in init_vars: (name, var) = (x[0], x[1]) if name not in name_to_variable: continue assignment_map[name] = name initialized_variable_names[name] = 1 initialized_variable_names[name + ":0"] = 1 return (assignment_map, initialized_variable_names) def dropout(input_tensor, dropout_prob): """Perform dropout. Args: input_tensor: float Tensor. dropout_prob: Python float. The probability of dropping out a value (NOT of *keeping* a dimension as in `tf.nn.dropout`). Returns: A version of `input_tensor` with dropout applied. """ if dropout_prob is None or dropout_prob == 0.0: return input_tensor output = tf.nn.dropout(input_tensor, 1.0 - dropout_prob) return output def layer_norm(input_tensor, name=None): """Run layer normalization on the last dimension of the tensor.""" return tf.contrib.layers.layer_norm( inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name) def layer_norm_and_dropout(input_tensor, dropout_prob, name=None): """Runs layer normalization followed by dropout.""" output_tensor = layer_norm(input_tensor, name) output_tensor = dropout(output_tensor, dropout_prob) return output_tensor def create_initializer(initializer_range=0.02): """Creates a `truncated_normal_initializer` with the given range.""" return tf.truncated_normal_initializer(stddev=initializer_range) def embedding_lookup(input_ids, vocab_size, embedding_size=128, initializer_range=0.02, word_embedding_name="word_embeddings", use_one_hot_embeddings=False): """Looks up words embeddings for id tensor. Args: input_ids: int32 Tensor of shape [batch_size, seq_length] containing word ids. vocab_size: int. Size of the embedding vocabulary. embedding_size: int. Width of the word embeddings. initializer_range: float. Embedding initialization range. word_embedding_name: string. Name of the embedding table. use_one_hot_embeddings: bool. If True, use one-hot method for word embeddings. If False, use `tf.gather()`. Returns: float Tensor of shape [batch_size, seq_length, embedding_size]. """ # This function assumes that the input is of shape [batch_size, seq_length, # num_inputs]. # # If the input is a 2D tensor of shape [batch_size, seq_length], we # reshape to [batch_size, seq_length, 1]. if input_ids.shape.ndims == 2: input_ids = tf.expand_dims(input_ids, axis=[-1]) # shape of input_ids is:[ batch_size, seq_length, 1] embedding_table = tf.get_variable( # [vocab_size, embedding_size] name=word_embedding_name, shape=[vocab_size, embedding_size], initializer=create_initializer(initializer_range)) flat_input_ids = tf.reshape(input_ids, [-1]) # one rank. shape as (batch_size * sequence_length,) if use_one_hot_embeddings: one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size) # one_hot_input_ids=[batch_size * sequence_length,vocab_size] output = tf.matmul(one_hot_input_ids, embedding_table) # output=[batch_size * sequence_length,embedding_size] else: output = tf.gather(embedding_table, flat_input_ids) # [vocab_size, embedding_size]*[batch_size * sequence_length,]--->[batch_size * sequence_length,embedding_size] input_shape = get_shape_list(input_ids) # input_shape=[ batch_size, seq_length, 1] output = tf.reshape(output,input_shape[0:-1] + [input_shape[-1] * embedding_size]) # output=[batch_size,sequence_length,embedding_size] return (output, embedding_table) def embedding_lookup_factorized(input_ids, # Factorized embedding parameterization provide by albert vocab_size, hidden_size, embedding_size=128, initializer_range=0.02, word_embedding_name="word_embeddings", use_one_hot_embeddings=False): """Looks up words embeddings for id tensor, but in a factorized style followed by albert. it is used to reduce much percentage of parameters previous exists. Check "Factorized embedding parameterization" session in the paper. Args: input_ids: int32 Tensor of shape [batch_size, seq_length] containing word ids. vocab_size: int. Size of the embedding vocabulary. embedding_size: int. Width of the word embeddings. initializer_range: float. Embedding initialization range. word_embedding_name: string. Name of the embedding table. use_one_hot_embeddings: bool. If True, use one-hot method for word embeddings. If False, use `tf.gather()`. Returns: float Tensor of shape [batch_size, seq_length, embedding_size]. """ # This function assumes that the input is of shape [batch_size, seq_length, # num_inputs]. # # If the input is a 2D tensor of shape [batch_size, seq_length], we # reshape to [batch_size, seq_length, 1]. # 1.first project one-hot vectors into a lower dimensional embedding space of size E print("embedding_lookup_factorized. factorized embedding parameterization is used.") if input_ids.shape.ndims == 2: input_ids = tf.expand_dims(input_ids, axis=[-1]) # shape of input_ids is:[ batch_size, seq_length, 1] embedding_table = tf.get_variable( # [vocab_size, embedding_size] name=word_embedding_name, shape=[vocab_size, embedding_size], initializer=create_initializer(initializer_range)) flat_input_ids = tf.reshape(input_ids, [-1]) # one rank. shape as (batch_size * sequence_length,) if use_one_hot_embeddings: one_hot_input_ids = tf.one_hot(flat_input_ids,depth=vocab_size) # one_hot_input_ids=[batch_size * sequence_length,vocab_size] output_middle = tf.matmul(one_hot_input_ids, embedding_table) # output=[batch_size * sequence_length,embedding_size] else: output_middle = tf.gather(embedding_table,flat_input_ids) # [vocab_size, embedding_size]*[batch_size * sequence_length,]--->[batch_size * sequence_length,embedding_size] # 2. project vector(output_middle) to the hidden space project_variable = tf.get_variable( # [embedding_size, hidden_size] name=word_embedding_name+"_2", shape=[embedding_size, hidden_size], initializer=create_initializer(initializer_range)) output = tf.matmul(output_middle, project_variable) # ([batch_size * sequence_length, embedding_size] * [embedding_size, hidden_size])--->[batch_size * sequence_length, hidden_size] # reshape back to 3 rank input_shape = get_shape_list(input_ids) # input_shape=[ batch_size, seq_length, 1] batch_size, sequene_length, _=input_shape output = tf.reshape(output, (batch_size,sequene_length,hidden_size)) # output=[batch_size, sequence_length, hidden_size] return (output, embedding_table, project_variable) def embedding_postprocessor(input_tensor, use_token_type=False, token_type_ids=None, token_type_vocab_size=16, token_type_embedding_name="token_type_embeddings", use_position_embeddings=True, position_embedding_name="position_embeddings", initializer_range=0.02, max_position_embeddings=512, dropout_prob=0.1): """Performs various post-processing on a word embedding tensor. Args: input_tensor: float Tensor of shape [batch_size, seq_length, embedding_size]. use_token_type: bool. Whether to add embeddings for `token_type_ids`. token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length]. Must be specified if `use_token_type` is True. token_type_vocab_size: int. The vocabulary size of `token_type_ids`. token_type_embedding_name: string. The name of the embedding table variable for token type ids. use_position_embeddings: bool. Whether to add position embeddings for the position of each token in the sequence. position_embedding_name: string. The name of the embedding table variable for positional embeddings. initializer_range: float. Range of the weight initialization. max_position_embeddings: int. Maximum sequence length that might ever be used with this model. This can be longer than the sequence length of input_tensor, but cannot be shorter. dropout_prob: float. Dropout probability applied to the final output tensor. Returns: float tensor with same shape as `input_tensor`. Raises: ValueError: One of the tensor shapes or input values is invalid. """ input_shape = get_shape_list(input_tensor, expected_rank=3) batch_size = input_shape[0] seq_length = input_shape[1] width = input_shape[2] output = input_tensor if use_token_type: if token_type_ids is None: raise ValueError("`token_type_ids` must be specified if" "`use_token_type` is True.") token_type_table = tf.get_variable( name=token_type_embedding_name, shape=[token_type_vocab_size, width], initializer=create_initializer(initializer_range)) # This vocab will be small so we always do one-hot here, since it is always # faster for a small vocabulary. flat_token_type_ids = tf.reshape(token_type_ids, [-1]) one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size) token_type_embeddings = tf.matmul(one_hot_ids, token_type_table) token_type_embeddings = tf.reshape(token_type_embeddings, [batch_size, seq_length, width]) output += token_type_embeddings if use_position_embeddings: assert_op = tf.assert_less_equal(seq_length, max_position_embeddings) with tf.control_dependencies([assert_op]): full_position_embeddings = tf.get_variable( name=position_embedding_name, shape=[max_position_embeddings, width], initializer=create_initializer(initializer_range)) # Since the position embedding table is a learned variable, we create it # using a (long) sequence length `max_position_embeddings`. The actual # sequence length might be shorter than this, for faster training of # tasks that do not have long sequences. # # So `full_position_embeddings` is effectively an embedding table # for position [0, 1, 2, ..., max_position_embeddings-1], and the current # sequence has positions [0, 1, 2, ... seq_length-1], so we can just # perform a slice. position_embeddings = tf.slice(full_position_embeddings, [0, 0], [seq_length, -1]) num_dims = len(output.shape.as_list()) # Only the last two dimensions are relevant (`seq_length` and `width`), so # we broadcast among the first dimensions, which is typically just # the batch size. position_broadcast_shape = [] for _ in range(num_dims - 2): position_broadcast_shape.append(1) position_broadcast_shape.extend([seq_length, width]) position_embeddings = tf.reshape(position_embeddings, position_broadcast_shape) output += position_embeddings output = layer_norm_and_dropout(output, dropout_prob) return output def create_attention_mask_from_input_mask(from_tensor, to_mask): """Create 3D attention mask from a 2D tensor mask. Args: from_tensor: 2D or 3D Tensor of shape [batch_size, from_seq_length, ...]. to_mask: int32 Tensor of shape [batch_size, to_seq_length]. Returns: float Tensor of shape [batch_size, from_seq_length, to_seq_length]. """ from_shape = get_shape_list(from_tensor, expected_rank=[2, 3]) batch_size = from_shape[0] from_seq_length = from_shape[1] to_shape = get_shape_list(to_mask, expected_rank=2) to_seq_length = to_shape[1] to_mask = tf.cast( tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32) # We don't assume that `from_tensor` is a mask (although it could be). We # don't actually care if we attend *from* padding tokens (only *to* padding) # tokens so we create a tensor of all ones. # # `broadcast_ones` = [batch_size, from_seq_length, 1] broadcast_ones = tf.ones( shape=[batch_size, from_seq_length, 1], dtype=tf.float32) # Here we broadcast along two dimensions to create the mask. mask = broadcast_ones * to_mask return mask def attention_layer(from_tensor, to_tensor, attention_mask=None, num_attention_heads=1, size_per_head=512, query_act=None, key_act=None, value_act=None, attention_probs_dropout_prob=0.0, initializer_range=0.02, do_return_2d_tensor=False, batch_size=None, from_seq_length=None, to_seq_length=None): """Performs multi-headed attention from `from_tensor` to `to_tensor`. This is an implementation of multi-headed attention based on "Attention is all you Need". If `from_tensor` and `to_tensor` are the same, then this is self-attention. Each timestep in `from_tensor` attends to the corresponding sequence in `to_tensor`, and returns a fixed-with vector. This function first projects `from_tensor` into a "query" tensor and `to_tensor` into "key" and "value" tensors. These are (effectively) a list of tensors of length `num_attention_heads`, where each tensor is of shape [batch_size, seq_length, size_per_head]. Then, the query and key tensors are dot-producted and scaled. These are softmaxed to obtain attention probabilities. The value tensors are then interpolated by these probabilities, then concatenated back to a single tensor and returned. In practice, the multi-headed attention are done with transposes and reshapes rather than actual separate tensors. Args: from_tensor: float Tensor of shape [batch_size, from_seq_length, from_width]. to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width]. attention_mask: (optional) int32 Tensor of shape [batch_size, from_seq_length, to_seq_length]. The values should be 1 or 0. The attention scores will effectively be set to -infinity for any positions in the mask that are 0, and will be unchanged for positions that are 1. num_attention_heads: int. Number of attention heads. size_per_head: int. Size of each attention head. query_act: (optional) Activation function for the query transform. key_act: (optional) Activation function for the key transform. value_act: (optional) Activation function for the value transform. attention_probs_dropout_prob: (optional) float. Dropout probability of the attention probabilities. initializer_range: float. Range of the weight initializer. do_return_2d_tensor: bool. If True, the output will be of shape [batch_size * from_seq_length, num_attention_heads * size_per_head]. If False, the output will be of shape [batch_size, from_seq_length, num_attention_heads * size_per_head]. batch_size: (Optional) int. If the input is 2D, this might be the batch size of the 3D version of the `from_tensor` and `to_tensor`. from_seq_length: (Optional) If the input is 2D, this might be the seq length of the 3D version of the `from_tensor`. to_seq_length: (Optional) If the input is 2D, this might be the seq length of the 3D version of the `to_tensor`. Returns: float Tensor of shape [batch_size, from_seq_length, num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is true, this will be of shape [batch_size * from_seq_length, num_attention_heads * size_per_head]). Raises: ValueError: Any of the arguments or tensor shapes are invalid. """ def transpose_for_scores(input_tensor, batch_size, num_attention_heads, seq_length, width): output_tensor = tf.reshape( input_tensor, [batch_size, seq_length, num_attention_heads, width]) output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3]) return output_tensor from_shape = get_shape_list(from_tensor, expected_rank=[2, 3]) to_shape = get_shape_list(to_tensor, expected_rank=[2, 3]) if len(from_shape) != len(to_shape): raise ValueError( "The rank of `from_tensor` must match the rank of `to_tensor`.") if len(from_shape) == 3: batch_size = from_shape[0] from_seq_length = from_shape[1] to_seq_length = to_shape[1] elif len(from_shape) == 2: if (batch_size is None or from_seq_length is None or to_seq_length is None): raise ValueError( "When passing in rank 2 tensors to attention_layer, the values " "for `batch_size`, `from_seq_length`, and `to_seq_length` " "must all be specified.") # Scalar dimensions referenced here: # B = batch size (number of sequences) # F = `from_tensor` sequence length # T = `to_tensor` sequence length # N = `num_attention_heads` # H = `size_per_head` from_tensor_2d = reshape_to_matrix(from_tensor) to_tensor_2d = reshape_to_matrix(to_tensor) # `query_layer` = [B*F, N*H] query_layer = tf.layers.dense( from_tensor_2d, num_attention_heads * size_per_head, activation=query_act, name="query", kernel_initializer=create_initializer(initializer_range)) # `key_layer` = [B*T, N*H] key_layer = tf.layers.dense( to_tensor_2d, num_attention_heads * size_per_head, activation=key_act, name="key", kernel_initializer=create_initializer(initializer_range)) # `value_layer` = [B*T, N*H] value_layer = tf.layers.dense( to_tensor_2d, num_attention_heads * size_per_head, activation=value_act, name="value", kernel_initializer=create_initializer(initializer_range)) # `query_layer` = [B, N, F, H] query_layer = transpose_for_scores(query_layer, batch_size, num_attention_heads, from_seq_length, size_per_head) # `key_layer` = [B, N, T, H] key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads, to_seq_length, size_per_head) # Take the dot product between "query" and "key" to get the raw # attention scores. # `attention_scores` = [B, N, F, T] attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True) attention_scores = tf.multiply(attention_scores, 1.0 / math.sqrt(float(size_per_head))) if attention_mask is not None: # `attention_mask` = [B, 1, F, T] attention_mask = tf.expand_dims(attention_mask, axis=[1]) # Since attention_mask is 1.0 for positions we want to attend and 0.0 for # masked positions, this operation will create a tensor which is 0.0 for # positions we want to attend and -10000.0 for masked positions. adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0 # Since we are adding it to the raw scores before the softmax, this is # effectively the same as removing these entirely. attention_scores += adder # Normalize the attention scores to probabilities. # `attention_probs` = [B, N, F, T] attention_probs = tf.nn.softmax(attention_scores) # This is actually dropping out entire tokens to attend to, which might # seem a bit unusual, but is taken from the original Transformer paper. attention_probs = dropout(attention_probs, attention_probs_dropout_prob) # `value_layer` = [B, T, N, H] value_layer = tf.reshape( value_layer, [batch_size, to_seq_length, num_attention_heads, size_per_head]) # `value_layer` = [B, N, T, H] value_layer = tf.transpose(value_layer, [0, 2, 1, 3]) # `context_layer` = [B, N, F, H] context_layer = tf.matmul(attention_probs, value_layer) # `context_layer` = [B, F, N, H] context_layer = tf.transpose(context_layer, [0, 2, 1, 3]) if do_return_2d_tensor: # `context_layer` = [B*F, N*H] context_layer = tf.reshape( context_layer, [batch_size * from_seq_length, num_attention_heads * size_per_head]) else: # `context_layer` = [B, F, N*H] context_layer = tf.reshape( context_layer, [batch_size, from_seq_length, num_attention_heads * size_per_head]) return context_layer def transformer_model(input_tensor, attention_mask=None, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, intermediate_act_fn=gelu, hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, initializer_range=0.02, do_return_all_layers=False, share_parameter_across_layers=True): """Multi-headed, multi-layer Transformer from "Attention is All You Need". This is almost an exact implementation of the original Transformer encoder. See the original paper: https://arxiv.org/abs/1706.03762 Also see: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py Args: input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size]. attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length, seq_length], with 1 for positions that can be attended to and 0 in positions that should not be. hidden_size: int. Hidden size of the Transformer. num_hidden_layers: int. Number of layers (blocks) in the Transformer. num_attention_heads: int. Number of attention heads in the Transformer. intermediate_size: int. The size of the "intermediate" (a.k.a., feed forward) layer. intermediate_act_fn: function. The non-linear activation function to apply to the output of the intermediate/feed-forward layer. hidden_dropout_prob: float. Dropout probability for the hidden layers. attention_probs_dropout_prob: float. Dropout probability of the attention probabilities. initializer_range: float. Range of the initializer (stddev of truncated normal). do_return_all_layers: Whether to also return all layers or just the final layer. Returns: float Tensor of shape [batch_size, seq_length, hidden_size], the final hidden layer of the Transformer. Raises: ValueError: A Tensor shape or parameter is invalid. """ if hidden_size % num_attention_heads != 0: raise ValueError( "The hidden size (%d) is not a multiple of the number of attention " "heads (%d)" % (hidden_size, num_attention_heads)) attention_head_size = int(hidden_size / num_attention_heads) input_shape = get_shape_list(input_tensor, expected_rank=3) batch_size = input_shape[0] seq_length = input_shape[1] input_width = input_shape[2] # The Transformer performs sum residuals on all layers so the input needs # to be the same as the hidden size. if input_width != hidden_size: raise ValueError("The width of the input tensor (%d) != hidden size (%d)" % (input_width, hidden_size)) # We keep the representation as a 2D tensor to avoid re-shaping it back and # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on # the GPU/CPU but may not be free on the TPU, so we want to minimize them to # help the optimizer. prev_output = reshape_to_matrix(input_tensor) all_layer_outputs = [] for layer_idx in range(num_hidden_layers): if share_parameter_across_layers: name_variable_scope="layer_shared" else: name_variable_scope="layer_%d" % layer_idx # share all parameters across layers. add by brightmart, 2019-09-28. previous it is like this: "layer_%d" % layer_idx with tf.variable_scope(name_variable_scope, reuse=True if (share_parameter_across_layers and layer_idx>0) else False): layer_input = prev_output with tf.variable_scope("attention"): attention_heads = [] with tf.variable_scope("self"): attention_head = attention_layer( from_tensor=layer_input, to_tensor=layer_input, attention_mask=attention_mask, num_attention_heads=num_attention_heads, size_per_head=attention_head_size, attention_probs_dropout_prob=attention_probs_dropout_prob, initializer_range=initializer_range, do_return_2d_tensor=True, batch_size=batch_size, from_seq_length=seq_length, to_seq_length=seq_length) attention_heads.append(attention_head) attention_output = None if len(attention_heads) == 1: attention_output = attention_heads[0] else: # In the case where we have other sequences, we just concatenate # them to the self-attention head before the projection. attention_output = tf.concat(attention_heads, axis=-1) # Run a linear projection of `hidden_size` then add a residual # with `layer_input`. with tf.variable_scope("output"): attention_output = tf.layers.dense( attention_output, hidden_size, kernel_initializer=create_initializer(initializer_range)) attention_output = dropout(attention_output, hidden_dropout_prob) attention_output = layer_norm(attention_output + layer_input) # The activation is only applied to the "intermediate" hidden layer. with tf.variable_scope("intermediate"): intermediate_output = tf.layers.dense( attention_output, intermediate_size, activation=intermediate_act_fn, kernel_initializer=create_initializer(initializer_range)) # Down-project back to `hidden_size` then add the residual. with tf.variable_scope("output"): layer_output = tf.layers.dense( intermediate_output, hidden_size, kernel_initializer=create_initializer(initializer_range)) layer_output = dropout(layer_output, hidden_dropout_prob) layer_output = layer_norm(layer_output + attention_output) prev_output = layer_output all_layer_outputs.append(layer_output) if do_return_all_layers: final_outputs = [] for layer_output in all_layer_outputs: final_output = reshape_from_matrix(layer_output, input_shape) final_outputs.append(final_output) return final_outputs else: final_output = reshape_from_matrix(prev_output, input_shape) return final_output def get_shape_list(tensor, expected_rank=None, name=None): """Returns a list of the shape of tensor, preferring static dimensions. Args: tensor: A tf.Tensor object to find the shape of. expected_rank: (optional) int. The expected rank of `tensor`. If this is specified and the `tensor` has a different rank, and exception will be thrown. name: Optional name of the tensor for the error message. Returns: A list of dimensions of the shape of tensor. All static dimensions will be returned as python integers, and dynamic dimensions will be returned as tf.Tensor scalars. """ if name is None: name = tensor.name if expected_rank is not None: assert_rank(tensor, expected_rank, name) shape = tensor.shape.as_list() non_static_indexes = [] for (index, dim) in enumerate(shape): if dim is None: non_static_indexes.append(index) if not non_static_indexes: return shape dyn_shape = tf.shape(tensor) for index in non_static_indexes: shape[index] = dyn_shape[index] return shape def reshape_to_matrix(input_tensor): """Reshapes a >= rank 2 tensor to a rank 2 tensor (i.e., a matrix).""" ndims = input_tensor.shape.ndims if ndims < 2: raise ValueError("Input tensor must have at least rank 2. Shape = %s" % (input_tensor.shape)) if ndims == 2: return input_tensor width = input_tensor.shape[-1] output_tensor = tf.reshape(input_tensor, [-1, width]) return output_tensor def reshape_from_matrix(output_tensor, orig_shape_list): """Reshapes a rank 2 tensor back to its original rank >= 2 tensor.""" if len(orig_shape_list) == 2: return output_tensor output_shape = get_shape_list(output_tensor) orig_dims = orig_shape_list[0:-1] width = output_shape[-1] return tf.reshape(output_tensor, orig_dims + [width]) def assert_rank(tensor, expected_rank, name=None): """Raises an exception if the tensor rank is not of the expected rank. Args: tensor: A tf.Tensor to check the rank of. expected_rank: Python integer or list of integers, expected rank. name: Optional name of the tensor for the error message. Raises: ValueError: If the expected shape doesn't match the actual shape. """ if name is None: name = tensor.name expected_rank_dict = {} if isinstance(expected_rank, six.integer_types): expected_rank_dict[expected_rank] = True else: for x in expected_rank: expected_rank_dict[x] = True actual_rank = tensor.shape.ndims if actual_rank not in expected_rank_dict: scope_name = tf.get_variable_scope().name raise ValueError( "For the tensor `%s` in scope `%s`, the actual rank " "`%d` (shape = %s) is not equal to the expected rank `%s`" % (name, scope_name, actual_rank, str(tensor.shape), str(expected_rank))) def prelln_transformer_model(input_tensor, attention_mask=None, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, intermediate_act_fn=gelu, hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, initializer_range=0.02, do_return_all_layers=False, shared_type='all', # None, adapter_fn=None): """Multi-headed, multi-layer Transformer from "Attention is All You Need". This is almost an exact implementation of the original Transformer encoder. See the original paper: https://arxiv.org/abs/1706.03762 Also see: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py Args: input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size]. attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length, seq_length], with 1 for positions that can be attended to and 0 in positions that should not be. hidden_size: int. Hidden size of the Transformer. num_hidden_layers: int. Number of layers (blocks) in the Transformer. num_attention_heads: int. Number of attention heads in the Transformer. intermediate_size: int. The size of the "intermediate" (a.k.a., feed forward) layer. intermediate_act_fn: function. The non-linear activation function to apply to the output of the intermediate/feed-forward layer. hidden_dropout_prob: float. Dropout probability for the hidden layers. attention_probs_dropout_prob: float. Dropout probability of the attention probabilities. initializer_range: float. Range of the initializer (stddev of truncated normal). do_return_all_layers: Whether to also return all layers or just the final layer. Returns: float Tensor of shape [batch_size, seq_length, hidden_size], the final hidden layer of the Transformer. Raises: ValueError: A Tensor shape or parameter is invalid. """ if hidden_size % num_attention_heads != 0: raise ValueError( "The hidden size (%d) is not a multiple of the number of attention " "heads (%d)" % (hidden_size, num_attention_heads)) attention_head_size = int(hidden_size / num_attention_heads) input_shape = bert_utils.get_shape_list(input_tensor, expected_rank=3) batch_size = input_shape[0] seq_length = input_shape[1] input_width = input_shape[2] # The Transformer performs sum residuals on all layers so the input needs # to be the same as the hidden size. if input_width != hidden_size: raise ValueError("The width of the input tensor (%d) != hidden size (%d)" % (input_width, hidden_size)) # We keep the representation as a 2D tensor to avoid re-shaping it back and # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on # the GPU/CPU but may not be free on the TPU, so we want to minimize them to # help the optimizer. prev_output = bert_utils.reshape_to_matrix(input_tensor) all_layer_outputs = [] def layer_scope(idx, shared_type): if shared_type == 'all': tmp = { "layer":"layer_shared", 'attention':'attention', 'intermediate':'intermediate', 'output':'output' } elif shared_type == 'attention': tmp = { "layer":"layer_shared", 'attention':'attention', 'intermediate':'intermediate_{}'.format(idx), 'output':'output_{}'.format(idx) } elif shared_type == 'ffn': tmp = { "layer":"layer_shared", 'attention':'attention_{}'.format(idx), 'intermediate':'intermediate', 'output':'output' } else: tmp = { "layer":"layer_{}".format(idx), 'attention':'attention', 'intermediate':'intermediate', 'output':'output' } return tmp all_layer_outputs = [] for layer_idx in range(num_hidden_layers): idx_scope = layer_scope(layer_idx, shared_type) with tf.variable_scope(idx_scope['layer'], reuse=tf.AUTO_REUSE): layer_input = prev_output with tf.variable_scope(idx_scope['attention'], reuse=tf.AUTO_REUSE): attention_heads = [] with tf.variable_scope("output", reuse=tf.AUTO_REUSE): layer_input_pre = layer_norm(layer_input) with tf.variable_scope("self"): attention_head = attention_layer( from_tensor=layer_input_pre, to_tensor=layer_input_pre, attention_mask=attention_mask, num_attention_heads=num_attention_heads, size_per_head=attention_head_size, attention_probs_dropout_prob=attention_probs_dropout_prob, initializer_range=initializer_range, do_return_2d_tensor=True, batch_size=batch_size, from_seq_length=seq_length, to_seq_length=seq_length) attention_heads.append(attention_head) attention_output = None if len(attention_heads) == 1: attention_output = attention_heads[0] else: # In the case where we have other sequences, we just concatenate # them to the self-attention head before the projection. attention_output = tf.concat(attention_heads, axis=-1) # Run a linear projection of `hidden_size` then add a residual # with `layer_input`. with tf.variable_scope("output", reuse=tf.AUTO_REUSE): attention_output = tf.layers.dense( attention_output, hidden_size, kernel_initializer=create_initializer(initializer_range)) attention_output = dropout(attention_output, hidden_dropout_prob) # attention_output = layer_norm(attention_output + layer_input) attention_output = attention_output + layer_input with tf.variable_scope(idx_scope['output'], reuse=tf.AUTO_REUSE): attention_output_pre = layer_norm(attention_output) # The activation is only applied to the "intermediate" hidden layer. with tf.variable_scope(idx_scope['intermediate'], reuse=tf.AUTO_REUSE): intermediate_output = tf.layers.dense( attention_output_pre, intermediate_size, activation=intermediate_act_fn, kernel_initializer=create_initializer(initializer_range)) # Down-project back to `hidden_size` then add the residual. with tf.variable_scope(idx_scope['output'], reuse=tf.AUTO_REUSE): layer_output = tf.layers.dense( intermediate_output, hidden_size, kernel_initializer=create_initializer(initializer_range)) layer_output = dropout(layer_output, hidden_dropout_prob) # layer_output = layer_norm(layer_output + attention_output) layer_output = layer_output + attention_output prev_output = layer_output all_layer_outputs.append(layer_output) if do_return_all_layers: final_outputs = [] for layer_output in all_layer_outputs: final_output = bert_utils.reshape_from_matrix(layer_output, input_shape) final_outputs.append(final_output) return final_outputs else: final_output = bert_utils.reshape_from_matrix(prev_output, input_shape) return final_output ================================================ FILE: modeling_google.py ================================================ # coding=utf-8 # Copyright 2019 The Google Research Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # Lint as: python2, python3 """The main ALBERT model and related functions. For a description of the algorithm, see https://arxiv.org/abs/1909.11942. """ from __future__ import absolute_import from __future__ import division from __future__ import print_function import collections import copy import json import math import re import numpy as np import six from six.moves import range import tensorflow as tf class AlbertConfig(object): """Configuration for `AlbertModel`. The default settings match the configuration of model `albert_xxlarge`. """ def __init__(self, vocab_size, embedding_size=128, hidden_size=4096, num_hidden_layers=12, num_hidden_groups=1, num_attention_heads=64, intermediate_size=16384, inner_group_num=1, down_scale_factor=1, hidden_act="gelu", hidden_dropout_prob=0, attention_probs_dropout_prob=0, max_position_embeddings=512, type_vocab_size=2, initializer_range=0.02): """Constructs AlbertConfig. Args: vocab_size: Vocabulary size of `inputs_ids` in `AlbertModel`. embedding_size: size of voc embeddings. hidden_size: Size of the encoder layers and the pooler layer. num_hidden_layers: Number of hidden layers in the Transformer encoder. num_hidden_groups: Number of group for the hidden layers, parameters in the same group are shared. num_attention_heads: Number of attention heads for each attention layer in the Transformer encoder. intermediate_size: The size of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. inner_group_num: int, number of inner repetition of attention and ffn. down_scale_factor: float, the scale to apply hidden_act: The non-linear activation function (function or string) in the encoder and pooler. hidden_dropout_prob: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. attention_probs_dropout_prob: The dropout ratio for the attention probabilities. max_position_embeddings: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). type_vocab_size: The vocabulary size of the `token_type_ids` passed into `AlbertModel`. initializer_range: The stdev of the truncated_normal_initializer for initializing all weight matrices. """ self.vocab_size = vocab_size self.embedding_size = embedding_size self.hidden_size = hidden_size self.num_hidden_layers = num_hidden_layers self.num_hidden_groups = num_hidden_groups self.num_attention_heads = num_attention_heads self.inner_group_num = inner_group_num self.down_scale_factor = down_scale_factor self.hidden_act = hidden_act self.intermediate_size = intermediate_size self.hidden_dropout_prob = hidden_dropout_prob self.attention_probs_dropout_prob = attention_probs_dropout_prob self.max_position_embeddings = max_position_embeddings self.type_vocab_size = type_vocab_size self.initializer_range = initializer_range @classmethod def from_dict(cls, json_object): """Constructs a `AlbertConfig` from a Python dictionary of parameters.""" config = AlbertConfig(vocab_size=None) for (key, value) in six.iteritems(json_object): config.__dict__[key] = value return config @classmethod def from_json_file(cls, json_file): """Constructs a `AlbertConfig` from a json file of parameters.""" with tf.gfile.GFile(json_file, "r") as reader: text = reader.read() return cls.from_dict(json.loads(text)) def to_dict(self): """Serializes this instance to a Python dictionary.""" output = copy.deepcopy(self.__dict__) return output def to_json_string(self): """Serializes this instance to a JSON string.""" return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n" class AlbertModel(object): """BERT model ("Bidirectional Encoder Representations from Transformers"). Example usage: ```python # Already been converted from strings into ids input_ids = tf.constant([[31, 51, 99], [15, 5, 0]]) input_mask = tf.constant([[1, 1, 1], [1, 1, 0]]) token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]]) config = modeling.AlbertConfig(vocab_size=32000, hidden_size=512, num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024) model = modeling.AlbertModel(config=config, is_training=True, input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids) label_embeddings = tf.get_variable(...) pooled_output = model.get_pooled_output() logits = tf.matmul(pooled_output, label_embeddings) ... ``` """ def __init__(self, config, is_training, input_ids, input_mask=None, token_type_ids=None, use_one_hot_embeddings=False, scope=None): """Constructor for AlbertModel. Args: config: `AlbertConfig` instance. is_training: bool. true for training model, false for eval model. Controls whether dropout will be applied. input_ids: int32 Tensor of shape [batch_size, seq_length]. input_mask: (optional) int32 Tensor of shape [batch_size, seq_length]. token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length]. use_one_hot_embeddings: (optional) bool. Whether to use one-hot word embeddings or tf.embedding_lookup() for the word embeddings. scope: (optional) variable scope. Defaults to "bert". Raises: ValueError: The config is invalid or one of the input tensor shapes is invalid. """ config = copy.deepcopy(config) if not is_training: config.hidden_dropout_prob = 0.0 config.attention_probs_dropout_prob = 0.0 input_shape = get_shape_list(input_ids, expected_rank=2) batch_size = input_shape[0] seq_length = input_shape[1] if input_mask is None: input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32) if token_type_ids is None: token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32) with tf.variable_scope(scope, default_name="bert"): with tf.variable_scope("embeddings"): # Perform embedding lookup on the word ids. (self.word_embedding_output, self.output_embedding_table) = embedding_lookup( input_ids=input_ids, vocab_size=config.vocab_size, embedding_size=config.embedding_size, initializer_range=config.initializer_range, word_embedding_name="word_embeddings", use_one_hot_embeddings=use_one_hot_embeddings) # Add positional embeddings and token type embeddings, then layer # normalize and perform dropout. self.embedding_output = embedding_postprocessor( input_tensor=self.word_embedding_output, use_token_type=True, token_type_ids=token_type_ids, token_type_vocab_size=config.type_vocab_size, token_type_embedding_name="token_type_embeddings", use_position_embeddings=True, position_embedding_name="position_embeddings", initializer_range=config.initializer_range, max_position_embeddings=config.max_position_embeddings, dropout_prob=config.hidden_dropout_prob) with tf.variable_scope("encoder"): # Run the stacked transformer. # `sequence_output` shape = [batch_size, seq_length, hidden_size]. self.all_encoder_layers = transformer_model( input_tensor=self.embedding_output, attention_mask=input_mask, hidden_size=config.hidden_size, num_hidden_layers=config.num_hidden_layers, num_hidden_groups=config.num_hidden_groups, num_attention_heads=config.num_attention_heads, intermediate_size=config.intermediate_size, inner_group_num=config.inner_group_num, intermediate_act_fn=get_activation(config.hidden_act), hidden_dropout_prob=config.hidden_dropout_prob, attention_probs_dropout_prob=config.attention_probs_dropout_prob, initializer_range=config.initializer_range, do_return_all_layers=True) self.sequence_output = self.all_encoder_layers[-1] # The "pooler" converts the encoded sequence tensor of shape # [batch_size, seq_length, hidden_size] to a tensor of shape # [batch_size, hidden_size]. This is necessary for segment-level # (or segment-pair-level) classification tasks where we need a fixed # dimensional representation of the segment. with tf.variable_scope("pooler"): # We "pool" the model by simply taking the hidden state corresponding # to the first token. We assume that this has been pre-trained first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1) self.pooled_output = tf.layers.dense( first_token_tensor, config.hidden_size, activation=tf.tanh, kernel_initializer=create_initializer(config.initializer_range)) def get_pooled_output(self): return self.pooled_output def get_sequence_output(self): """Gets final hidden layer of encoder. Returns: float Tensor of shape [batch_size, seq_length, hidden_size] corresponding to the final hidden of the transformer encoder. """ return self.sequence_output def get_all_encoder_layers(self): return self.all_encoder_layers def get_word_embedding_output(self): """Get output of the word(piece) embedding lookup. This is BEFORE positional embeddings and token type embeddings have been added. Returns: float Tensor of shape [batch_size, seq_length, hidden_size] corresponding to the output of the word(piece) embedding layer. """ return self.word_embedding_output def get_embedding_output(self): """Gets output of the embedding lookup (i.e., input to the transformer). Returns: float Tensor of shape [batch_size, seq_length, hidden_size] corresponding to the output of the embedding layer, after summing the word embeddings with the positional embeddings and the token type embeddings, then performing layer normalization. This is the input to the transformer. """ return self.embedding_output def get_embedding_table(self): return self.output_embedding_table def gelu(x): """Gaussian Error Linear Unit. This is a smoother version of the RELU. Original paper: https://arxiv.org/abs/1606.08415 Args: x: float Tensor to perform activation. Returns: `x` with the GELU activation applied. """ cdf = 0.5 * (1.0 + tf.tanh( (np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3))))) return x * cdf def get_activation(activation_string): """Maps a string to a Python function, e.g., "relu" => `tf.nn.relu`. Args: activation_string: String name of the activation function. Returns: A Python function corresponding to the activation function. If `activation_string` is None, empty, or "linear", this will return None. If `activation_string` is not a string, it will return `activation_string`. Raises: ValueError: The `activation_string` does not correspond to a known activation. """ # We assume that anything that"s not a string is already an activation # function, so we just return it. if not isinstance(activation_string, six.string_types): return activation_string if not activation_string: return None act = activation_string.lower() if act == "linear": return None elif act == "relu": return tf.nn.relu elif act == "gelu": return gelu elif act == "tanh": return tf.tanh else: raise ValueError("Unsupported activation: %s" % act) def get_assignment_map_from_checkpoint(tvars, init_checkpoint, num_of_group=0): """Compute the union of the current variables and checkpoint variables.""" assignment_map = {} initialized_variable_names = {} name_to_variable = collections.OrderedDict() for var in tvars: name = var.name m = re.match("^(.*):\\d+$", name) if m is not None: name = m.group(1) name_to_variable[name] = var init_vars = tf.train.list_variables(init_checkpoint) init_vars_name = [name for (name, _) in init_vars] if num_of_group > 0: assignment_map = [] for gid in range(num_of_group): assignment_map.append(collections.OrderedDict()) else: assignment_map = collections.OrderedDict() for name in name_to_variable: if name in init_vars_name: tvar_name = name elif (re.sub(r"/group_\d+/", "/group_0/", six.ensure_str(name)) in init_vars_name and num_of_group > 1): tvar_name = re.sub(r"/group_\d+/", "/group_0/", six.ensure_str(name)) elif (re.sub(r"/ffn_\d+/", "/ffn_1/", six.ensure_str(name)) in init_vars_name and num_of_group > 1): tvar_name = re.sub(r"/ffn_\d+/", "/ffn_1/", six.ensure_str(name)) elif (re.sub(r"/attention_\d+/", "/attention_1/", six.ensure_str(name)) in init_vars_name and num_of_group > 1): tvar_name = re.sub(r"/attention_\d+/", "/attention_1/", six.ensure_str(name)) else: tf.logging.info("name %s does not get matched", name) continue tf.logging.info("name %s match to %s", name, tvar_name) if num_of_group > 0: group_matched = False for gid in range(1, num_of_group): if (("/group_" + str(gid) + "/" in name) or ("/ffn_" + str(gid) + "/" in name) or ("/attention_" + str(gid) + "/" in name)): group_matched = True tf.logging.info("%s belongs to %dth", name, gid) assignment_map[gid][tvar_name] = name if not group_matched: assignment_map[0][tvar_name] = name else: assignment_map[tvar_name] = name initialized_variable_names[name] = 1 initialized_variable_names[six.ensure_str(name) + ":0"] = 1 return (assignment_map, initialized_variable_names) def dropout(input_tensor, dropout_prob): """Perform dropout. Args: input_tensor: float Tensor. dropout_prob: Python float. The probability of dropping out a value (NOT of *keeping* a dimension as in `tf.nn.dropout`). Returns: A version of `input_tensor` with dropout applied. """ if dropout_prob is None or dropout_prob == 0.0: return input_tensor output = tf.nn.dropout(input_tensor, rate=dropout_prob) return output def layer_norm(input_tensor, name=None): """Run layer normalization on the last dimension of the tensor.""" return tf.contrib.layers.layer_norm( inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name) def layer_norm_and_dropout(input_tensor, dropout_prob, name=None): """Runs layer normalization followed by dropout.""" output_tensor = layer_norm(input_tensor, name) output_tensor = dropout(output_tensor, dropout_prob) return output_tensor def create_initializer(initializer_range=0.02): """Creates a `truncated_normal_initializer` with the given range.""" return tf.truncated_normal_initializer(stddev=initializer_range) def get_timing_signal_1d_given_position(channels, position, min_timescale=1.0, max_timescale=1.0e4): """Get sinusoids of diff frequencies, with timing position given. Adapted from add_timing_signal_1d_given_position in //third_party/py/tensor2tensor/layers/common_attention.py Args: channels: scalar, size of timing embeddings to create. The number of different timescales is equal to channels / 2. position: a Tensor with shape [batch, seq_len] min_timescale: a float max_timescale: a float Returns: a Tensor of timing signals [batch, seq_len, channels] """ num_timescales = channels // 2 log_timescale_increment = ( math.log(float(max_timescale) / float(min_timescale)) / (tf.to_float(num_timescales) - 1)) inv_timescales = min_timescale * tf.exp( tf.to_float(tf.range(num_timescales)) * -log_timescale_increment) scaled_time = ( tf.expand_dims(tf.to_float(position), 2) * tf.expand_dims( tf.expand_dims(inv_timescales, 0), 0)) signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=2) signal = tf.pad(signal, [[0, 0], [0, 0], [0, tf.mod(channels, 2)]]) return signal def embedding_lookup(input_ids, vocab_size, embedding_size=128, initializer_range=0.02, word_embedding_name="word_embeddings", use_one_hot_embeddings=False): """Looks up words embeddings for id tensor. Args: input_ids: int32 Tensor of shape [batch_size, seq_length] containing word ids. vocab_size: int. Size of the embedding vocabulary. embedding_size: int. Width of the word embeddings. initializer_range: float. Embedding initialization range. word_embedding_name: string. Name of the embedding table. use_one_hot_embeddings: bool. If True, use one-hot method for word embeddings. If False, use `tf.nn.embedding_lookup()`. Returns: float Tensor of shape [batch_size, seq_length, embedding_size]. """ # This function assumes that the input is of shape [batch_size, seq_length, # num_inputs]. # # If the input is a 2D tensor of shape [batch_size, seq_length], we # reshape to [batch_size, seq_length, 1]. if input_ids.shape.ndims == 2: input_ids = tf.expand_dims(input_ids, axis=[-1]) embedding_table = tf.get_variable( name=word_embedding_name, shape=[vocab_size, embedding_size], initializer=create_initializer(initializer_range)) if use_one_hot_embeddings: flat_input_ids = tf.reshape(input_ids, [-1]) one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size) output = tf.matmul(one_hot_input_ids, embedding_table) else: output = tf.nn.embedding_lookup(embedding_table, input_ids) input_shape = get_shape_list(input_ids) output = tf.reshape(output, input_shape[0:-1] + [input_shape[-1] * embedding_size]) return (output, embedding_table) def embedding_postprocessor(input_tensor, use_token_type=False, token_type_ids=None, token_type_vocab_size=16, token_type_embedding_name="token_type_embeddings", use_position_embeddings=True, position_embedding_name="position_embeddings", initializer_range=0.02, max_position_embeddings=512, dropout_prob=0.1): """Performs various post-processing on a word embedding tensor. Args: input_tensor: float Tensor of shape [batch_size, seq_length, embedding_size]. use_token_type: bool. Whether to add embeddings for `token_type_ids`. token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length]. Must be specified if `use_token_type` is True. token_type_vocab_size: int. The vocabulary size of `token_type_ids`. token_type_embedding_name: string. The name of the embedding table variable for token type ids. use_position_embeddings: bool. Whether to add position embeddings for the position of each token in the sequence. position_embedding_name: string. The name of the embedding table variable for positional embeddings. initializer_range: float. Range of the weight initialization. max_position_embeddings: int. Maximum sequence length that might ever be used with this model. This can be longer than the sequence length of input_tensor, but cannot be shorter. dropout_prob: float. Dropout probability applied to the final output tensor. Returns: float tensor with same shape as `input_tensor`. Raises: ValueError: One of the tensor shapes or input values is invalid. """ input_shape = get_shape_list(input_tensor, expected_rank=3) batch_size = input_shape[0] seq_length = input_shape[1] width = input_shape[2] output = input_tensor if use_token_type: if token_type_ids is None: raise ValueError("`token_type_ids` must be specified if" "`use_token_type` is True.") token_type_table = tf.get_variable( name=token_type_embedding_name, shape=[token_type_vocab_size, width], initializer=create_initializer(initializer_range)) # This vocab will be small so we always do one-hot here, since it is always # faster for a small vocabulary. flat_token_type_ids = tf.reshape(token_type_ids, [-1]) one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size) token_type_embeddings = tf.matmul(one_hot_ids, token_type_table) token_type_embeddings = tf.reshape(token_type_embeddings, [batch_size, seq_length, width]) output += token_type_embeddings if use_position_embeddings: assert_op = tf.assert_less_equal(seq_length, max_position_embeddings) with tf.control_dependencies([assert_op]): full_position_embeddings = tf.get_variable( name=position_embedding_name, shape=[max_position_embeddings, width], initializer=create_initializer(initializer_range)) # Since the position embedding table is a learned variable, we create it # using a (long) sequence length `max_position_embeddings`. The actual # sequence length might be shorter than this, for faster training of # tasks that do not have long sequences. # # So `full_position_embeddings` is effectively an embedding table # for position [0, 1, 2, ..., max_position_embeddings-1], and the current # sequence has positions [0, 1, 2, ... seq_length-1], so we can just # perform a slice. position_embeddings = tf.slice(full_position_embeddings, [0, 0], [seq_length, -1]) num_dims = len(output.shape.as_list()) # Only the last two dimensions are relevant (`seq_length` and `width`), so # we broadcast among the first dimensions, which is typically just # the batch size. position_broadcast_shape = [] for _ in range(num_dims - 2): position_broadcast_shape.append(1) position_broadcast_shape.extend([seq_length, width]) position_embeddings = tf.reshape(position_embeddings, position_broadcast_shape) output += position_embeddings output = layer_norm_and_dropout(output, dropout_prob) return output def dense_layer_3d(input_tensor, num_attention_heads, head_size, initializer, activation, name=None): """A dense layer with 3D kernel. Args: input_tensor: float Tensor of shape [batch, seq_length, hidden_size]. num_attention_heads: Number of attention heads. head_size: The size per attention head. initializer: Kernel initializer. activation: Actication function. name: The name scope of this layer. Returns: float logits Tensor. """ input_shape = get_shape_list(input_tensor) hidden_size = input_shape[2] with tf.variable_scope(name): w = tf.get_variable( name="kernel", shape=[hidden_size, num_attention_heads * head_size], initializer=initializer) w = tf.reshape(w, [hidden_size, num_attention_heads, head_size]) b = tf.get_variable( name="bias", shape=[num_attention_heads * head_size], initializer=tf.zeros_initializer) b = tf.reshape(b, [num_attention_heads, head_size]) ret = tf.einsum("BFH,HND->BFND", input_tensor, w) ret += b if activation is not None: return activation(ret) else: return ret def dense_layer_3d_proj(input_tensor, hidden_size, head_size, initializer, activation, name=None): """A dense layer with 3D kernel for projection. Args: input_tensor: float Tensor of shape [batch,from_seq_length, num_attention_heads, size_per_head]. hidden_size: The size of hidden layer. num_attention_heads: The size of output dimension. head_size: The size of head. initializer: Kernel initializer. activation: Actication function. name: The name scope of this layer. Returns: float logits Tensor. """ input_shape = get_shape_list(input_tensor) num_attention_heads= input_shape[2] with tf.variable_scope(name): w = tf.get_variable( name="kernel", shape=[num_attention_heads * head_size, hidden_size], initializer=initializer) w = tf.reshape(w, [num_attention_heads, head_size, hidden_size]) b = tf.get_variable( name="bias", shape=[hidden_size], initializer=tf.zeros_initializer) ret = tf.einsum("BFND,NDH->BFH", input_tensor, w) ret += b if activation is not None: return activation(ret) else: return ret def dense_layer_2d(input_tensor, output_size, initializer, activation, num_attention_heads=1, name=None): """A dense layer with 2D kernel. Args: input_tensor: Float tensor with rank 3. output_size: The size of output dimension. initializer: Kernel initializer. activation: Activation function. num_attention_heads: number of attention head in attention layer. name: The name scope of this layer. Returns: float logits Tensor. """ del num_attention_heads # unused input_shape = get_shape_list(input_tensor) hidden_size = input_shape[2] with tf.variable_scope(name): w = tf.get_variable( name="kernel", shape=[hidden_size, output_size], initializer=initializer) b = tf.get_variable( name="bias", shape=[output_size], initializer=tf.zeros_initializer) ret = tf.einsum("BFH,HO->BFO", input_tensor, w) ret += b if activation is not None: return activation(ret) else: return ret def dot_product_attention(q, k, v, bias, dropout_rate=0.0): """Dot-product attention. Args: q: Tensor with shape [..., length_q, depth_k]. k: Tensor with shape [..., length_kv, depth_k]. Leading dimensions must match with q. v: Tensor with shape [..., length_kv, depth_v] Leading dimensions must match with q. bias: bias Tensor (see attention_bias()) dropout_rate: a float. Returns: Tensor with shape [..., length_q, depth_v]. """ logits = tf.matmul(q, k, transpose_b=True) # [..., length_q, length_kv] logits = tf.multiply(logits, 1.0 / math.sqrt(float(get_shape_list(q)[-1]))) if bias is not None: # `attention_mask` = [B, T] from_shape = get_shape_list(q) if len(from_shape) == 4: broadcast_ones = tf.ones([from_shape[0], 1, from_shape[2], 1], tf.float32) elif len(from_shape) == 5: # from_shape = [B, N, Block_num, block_size, depth]# broadcast_ones = tf.ones([from_shape[0], 1, from_shape[2], from_shape[3], 1], tf.float32) bias = tf.matmul(broadcast_ones, tf.cast(bias, tf.float32), transpose_b=True) # Since attention_mask is 1.0 for positions we want to attend and 0.0 for # masked positions, this operation will create a tensor which is 0.0 for # positions we want to attend and -10000.0 for masked positions. adder = (1.0 - bias) * -10000.0 # Since we are adding it to the raw scores before the softmax, this is # effectively the same as removing these entirely. logits += adder else: adder = 0.0 attention_probs = tf.nn.softmax(logits, name="attention_probs") attention_probs = dropout(attention_probs, dropout_rate) return tf.matmul(attention_probs, v) def attention_layer(from_tensor, to_tensor, attention_mask=None, num_attention_heads=1, query_act=None, key_act=None, value_act=None, attention_probs_dropout_prob=0.0, initializer_range=0.02, batch_size=None, from_seq_length=None, to_seq_length=None): """Performs multi-headed attention from `from_tensor` to `to_tensor`. Args: from_tensor: float Tensor of shape [batch_size, from_seq_length, from_width]. to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width]. attention_mask: (optional) int32 Tensor of shape [batch_size, from_seq_length, to_seq_length]. The values should be 1 or 0. The attention scores will effectively be set to -infinity for any positions in the mask that are 0, and will be unchanged for positions that are 1. num_attention_heads: int. Number of attention heads. query_act: (optional) Activation function for the query transform. key_act: (optional) Activation function for the key transform. value_act: (optional) Activation function for the value transform. attention_probs_dropout_prob: (optional) float. Dropout probability of the attention probabilities. initializer_range: float. Range of the weight initializer. batch_size: (Optional) int. If the input is 2D, this might be the batch size of the 3D version of the `from_tensor` and `to_tensor`. from_seq_length: (Optional) If the input is 2D, this might be the seq length of the 3D version of the `from_tensor`. to_seq_length: (Optional) If the input is 2D, this might be the seq length of the 3D version of the `to_tensor`. Returns: float Tensor of shape [batch_size, from_seq_length, num_attention_heads, size_per_head]. Raises: ValueError: Any of the arguments or tensor shapes are invalid. """ from_shape = get_shape_list(from_tensor, expected_rank=[2, 3]) to_shape = get_shape_list(to_tensor, expected_rank=[2, 3]) size_per_head = int(from_shape[2]/num_attention_heads) if len(from_shape) != len(to_shape): raise ValueError( "The rank of `from_tensor` must match the rank of `to_tensor`.") if len(from_shape) == 3: batch_size = from_shape[0] from_seq_length = from_shape[1] to_seq_length = to_shape[1] elif len(from_shape) == 2: if (batch_size is None or from_seq_length is None or to_seq_length is None): raise ValueError( "When passing in rank 2 tensors to attention_layer, the values " "for `batch_size`, `from_seq_length`, and `to_seq_length` " "must all be specified.") # Scalar dimensions referenced here: # B = batch size (number of sequences) # F = `from_tensor` sequence length # T = `to_tensor` sequence length # N = `num_attention_heads` # H = `size_per_head` # `query_layer` = [B, F, N, H] q = dense_layer_3d(from_tensor, num_attention_heads, size_per_head, create_initializer(initializer_range), query_act, "query") # `key_layer` = [B, T, N, H] k = dense_layer_3d(to_tensor, num_attention_heads, size_per_head, create_initializer(initializer_range), key_act, "key") # `value_layer` = [B, T, N, H] v = dense_layer_3d(to_tensor, num_attention_heads, size_per_head, create_initializer(initializer_range), value_act, "value") q = tf.transpose(q, [0, 2, 1, 3]) k = tf.transpose(k, [0, 2, 1, 3]) v = tf.transpose(v, [0, 2, 1, 3]) if attention_mask is not None: attention_mask = tf.reshape( attention_mask, [batch_size, 1, to_seq_length, 1]) # 'new_embeddings = [B, N, F, H]' new_embeddings = dot_product_attention(q, k, v, attention_mask, attention_probs_dropout_prob) return tf.transpose(new_embeddings, [0, 2, 1, 3]) def attention_ffn_block(layer_input, hidden_size=768, attention_mask=None, num_attention_heads=1, attention_head_size=64, attention_probs_dropout_prob=0.0, intermediate_size=3072, intermediate_act_fn=None, initializer_range=0.02, hidden_dropout_prob=0.0): """A network with attention-ffn as sub-block. Args: layer_input: float Tensor of shape [batch_size, from_seq_length, from_width]. hidden_size: (optional) int, size of hidden layer. attention_mask: (optional) int32 Tensor of shape [batch_size, from_seq_length, to_seq_length]. The values should be 1 or 0. The attention scores will effectively be set to -infinity for any positions in the mask that are 0, and will be unchanged for positions that are 1. num_attention_heads: int. Number of attention heads. attention_head_size: int. Size of attention head. attention_probs_dropout_prob: float. dropout probability for attention_layer intermediate_size: int. Size of intermediate hidden layer. intermediate_act_fn: (optional) Activation function for the intermediate layer. initializer_range: float. Range of the weight initializer. hidden_dropout_prob: (optional) float. Dropout probability of the hidden layer. Returns: layer output """ with tf.variable_scope("attention_1"): with tf.variable_scope("self"): attention_output = attention_layer( from_tensor=layer_input, to_tensor=layer_input, attention_mask=attention_mask, num_attention_heads=num_attention_heads, attention_probs_dropout_prob=attention_probs_dropout_prob, initializer_range=initializer_range) # Run a linear projection of `hidden_size` then add a residual # with `layer_input`. with tf.variable_scope("output"): attention_output = dense_layer_3d_proj( attention_output, hidden_size, attention_head_size, create_initializer(initializer_range), None, name="dense") attention_output = dropout(attention_output, hidden_dropout_prob) attention_output = layer_norm(attention_output + layer_input) with tf.variable_scope("ffn_1"): with tf.variable_scope("intermediate"): intermediate_output = dense_layer_2d( attention_output, intermediate_size, create_initializer(initializer_range), intermediate_act_fn, num_attention_heads=num_attention_heads, name="dense") with tf.variable_scope("output"): ffn_output = dense_layer_2d( intermediate_output, hidden_size, create_initializer(initializer_range), None, num_attention_heads=num_attention_heads, name="dense") ffn_output = dropout(ffn_output, hidden_dropout_prob) ffn_output = layer_norm(ffn_output + attention_output) return ffn_output def transformer_model(input_tensor, attention_mask=None, hidden_size=768, num_hidden_layers=12, num_hidden_groups=12, num_attention_heads=12, intermediate_size=3072, inner_group_num=1, intermediate_act_fn="gelu", hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, initializer_range=0.02, do_return_all_layers=False): """Multi-headed, multi-layer Transformer from "Attention is All You Need". This is almost an exact implementation of the original Transformer encoder. See the original paper: https://arxiv.org/abs/1706.03762 Also see: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py Args: input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size]. attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length, seq_length], with 1 for positions that can be attended to and 0 in positions that should not be. hidden_size: int. Hidden size of the Transformer. num_hidden_layers: int. Number of layers (blocks) in the Transformer. num_hidden_groups: int. Number of group for the hidden layers, parameters in the same group are shared. num_attention_heads: int. Number of attention heads in the Transformer. intermediate_size: int. The size of the "intermediate" (a.k.a., feed forward) layer. inner_group_num: int, number of inner repetition of attention and ffn. intermediate_act_fn: function. The non-linear activation function to apply to the output of the intermediate/feed-forward layer. hidden_dropout_prob: float. Dropout probability for the hidden layers. attention_probs_dropout_prob: float. Dropout probability of the attention probabilities. initializer_range: float. Range of the initializer (stddev of truncated normal). do_return_all_layers: Whether to also return all layers or just the final layer. Returns: float Tensor of shape [batch_size, seq_length, hidden_size], the final hidden layer of the Transformer. Raises: ValueError: A Tensor shape or parameter is invalid. """ if hidden_size % num_attention_heads != 0: raise ValueError( "The hidden size (%d) is not a multiple of the number of attention " "heads (%d)" % (hidden_size, num_attention_heads)) attention_head_size = hidden_size // num_attention_heads input_shape = get_shape_list(input_tensor, expected_rank=3) input_width = input_shape[2] all_layer_outputs = [] if input_width != hidden_size: prev_output = dense_layer_2d( input_tensor, hidden_size, create_initializer(initializer_range), None, name="embedding_hidden_mapping_in") else: prev_output = input_tensor with tf.variable_scope("transformer", reuse=tf.AUTO_REUSE): for layer_idx in range(num_hidden_layers): group_idx = int(layer_idx / num_hidden_layers * num_hidden_groups) with tf.variable_scope("group_%d" % group_idx): with tf.name_scope("layer_%d" % layer_idx): layer_output = prev_output for inner_group_idx in range(inner_group_num): with tf.variable_scope("inner_group_%d" % inner_group_idx): layer_output = attention_ffn_block( layer_output, hidden_size, attention_mask, num_attention_heads, attention_head_size, attention_probs_dropout_prob, intermediate_size, intermediate_act_fn, initializer_range, hidden_dropout_prob) prev_output = layer_output all_layer_outputs.append(layer_output) if do_return_all_layers: return all_layer_outputs else: return all_layer_outputs[-1] def get_shape_list(tensor, expected_rank=None, name=None): """Returns a list of the shape of tensor, preferring static dimensions. Args: tensor: A tf.Tensor object to find the shape of. expected_rank: (optional) int. The expected rank of `tensor`. If this is specified and the `tensor` has a different rank, and exception will be thrown. name: Optional name of the tensor for the error message. Returns: A list of dimensions of the shape of tensor. All static dimensions will be returned as python integers, and dynamic dimensions will be returned as tf.Tensor scalars. """ if name is None: name = tensor.name if expected_rank is not None: assert_rank(tensor, expected_rank, name) shape = tensor.shape.as_list() non_static_indexes = [] for (index, dim) in enumerate(shape): if dim is None: non_static_indexes.append(index) if not non_static_indexes: return shape dyn_shape = tf.shape(tensor) for index in non_static_indexes: shape[index] = dyn_shape[index] return shape def reshape_to_matrix(input_tensor): """Reshapes a >= rank 2 tensor to a rank 2 tensor (i.e., a matrix).""" ndims = input_tensor.shape.ndims if ndims < 2: raise ValueError("Input tensor must have at least rank 2. Shape = %s" % (input_tensor.shape)) if ndims == 2: return input_tensor width = input_tensor.shape[-1] output_tensor = tf.reshape(input_tensor, [-1, width]) return output_tensor def reshape_from_matrix(output_tensor, orig_shape_list): """Reshapes a rank 2 tensor back to its original rank >= 2 tensor.""" if len(orig_shape_list) == 2: return output_tensor output_shape = get_shape_list(output_tensor) orig_dims = orig_shape_list[0:-1] width = output_shape[-1] return tf.reshape(output_tensor, orig_dims + [width]) def assert_rank(tensor, expected_rank, name=None): """Raises an exception if the tensor rank is not of the expected rank. Args: tensor: A tf.Tensor to check the rank of. expected_rank: Python integer or list of integers, expected rank. name: Optional name of the tensor for the error message. Raises: ValueError: If the expected shape doesn't match the actual shape. """ if name is None: name = tensor.name expected_rank_dict = {} if isinstance(expected_rank, six.integer_types): expected_rank_dict[expected_rank] = True else: for x in expected_rank: expected_rank_dict[x] = True actual_rank = tensor.shape.ndims if actual_rank not in expected_rank_dict: scope_name = tf.get_variable_scope().name raise ValueError( "For the tensor `%s` in scope `%s`, the actual rank " "`%d` (shape = %s) is not equal to the expected rank `%s`" % (name, scope_name, actual_rank, str(tensor.shape), str(expected_rank))) ================================================ FILE: modeling_google_fast.py ================================================ # coding=utf-8 # Copyright 2019 The Google Research Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # Lint as: python2, python3 """The main ALBERT model and related functions. For a description of the algorithm, see https://arxiv.org/abs/1909.11942. """ from __future__ import absolute_import from __future__ import division from __future__ import print_function import collections import copy import json import math import re import numpy as np import six from six.moves import range import tensorflow as tf class AlbertConfig(object): """Configuration for `AlbertModel`. The default settings match the configuration of model `albert_xxlarge`. """ def __init__(self, vocab_size, embedding_size=128, hidden_size=4096, num_hidden_layers=12, num_hidden_groups=1, num_attention_heads=64, intermediate_size=16384, inner_group_num=1, down_scale_factor=1, hidden_act="gelu", hidden_dropout_prob=0, attention_probs_dropout_prob=0, max_position_embeddings=512, type_vocab_size=2, initializer_range=0.02): """Constructs AlbertConfig. Args: vocab_size: Vocabulary size of `inputs_ids` in `AlbertModel`. embedding_size: size of voc embeddings. hidden_size: Size of the encoder layers and the pooler layer. num_hidden_layers: Number of hidden layers in the Transformer encoder. num_hidden_groups: Number of group for the hidden layers, parameters in the same group are shared. num_attention_heads: Number of attention heads for each attention layer in the Transformer encoder. intermediate_size: The size of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. inner_group_num: int, number of inner repetition of attention and ffn. down_scale_factor: float, the scale to apply hidden_act: The non-linear activation function (function or string) in the encoder and pooler. hidden_dropout_prob: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. attention_probs_dropout_prob: The dropout ratio for the attention probabilities. max_position_embeddings: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). type_vocab_size: The vocabulary size of the `token_type_ids` passed into `AlbertModel`. initializer_range: The stdev of the truncated_normal_initializer for initializing all weight matrices. """ self.vocab_size = vocab_size self.embedding_size = embedding_size self.hidden_size = hidden_size self.num_hidden_layers = num_hidden_layers self.num_hidden_groups = num_hidden_groups self.num_attention_heads = num_attention_heads self.inner_group_num = inner_group_num self.down_scale_factor = down_scale_factor self.hidden_act = hidden_act self.intermediate_size = intermediate_size self.hidden_dropout_prob = hidden_dropout_prob self.attention_probs_dropout_prob = attention_probs_dropout_prob self.max_position_embeddings = max_position_embeddings self.type_vocab_size = type_vocab_size self.initializer_range = initializer_range @classmethod def from_dict(cls, json_object): """Constructs a `AlbertConfig` from a Python dictionary of parameters.""" config = AlbertConfig(vocab_size=None) for (key, value) in six.iteritems(json_object): config.__dict__[key] = value return config @classmethod def from_json_file(cls, json_file): """Constructs a `AlbertConfig` from a json file of parameters.""" with tf.gfile.GFile(json_file, "r") as reader: text = reader.read() return cls.from_dict(json.loads(text)) def to_dict(self): """Serializes this instance to a Python dictionary.""" output = copy.deepcopy(self.__dict__) return output def to_json_string(self): """Serializes this instance to a JSON string.""" return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n" class AlbertModel(object): """BERT model ("Bidirectional Encoder Representations from Transformers"). Example usage: ```python # Already been converted from strings into ids input_ids = tf.constant([[31, 51, 99], [15, 5, 0]]) input_mask = tf.constant([[1, 1, 1], [1, 1, 0]]) token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]]) config = modeling.AlbertConfig(vocab_size=32000, hidden_size=512, num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024) model = modeling.AlbertModel(config=config, is_training=True, input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids) label_embeddings = tf.get_variable(...) pooled_output = model.get_pooled_output() logits = tf.matmul(pooled_output, label_embeddings) ... ``` """ def __init__(self, config, is_training, input_ids, input_mask=None, token_type_ids=None, use_one_hot_embeddings=False, scope=None): """Constructor for AlbertModel. Args: config: `AlbertConfig` instance. is_training: bool. true for training model, false for eval model. Controls whether dropout will be applied. input_ids: int32 Tensor of shape [batch_size, seq_length]. input_mask: (optional) int32 Tensor of shape [batch_size, seq_length]. token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length]. use_one_hot_embeddings: (optional) bool. Whether to use one-hot word embeddings or tf.embedding_lookup() for the word embeddings. scope: (optional) variable scope. Defaults to "bert". Raises: ValueError: The config is invalid or one of the input tensor shapes is invalid. """ config = copy.deepcopy(config) if not is_training: config.hidden_dropout_prob = 0.0 config.attention_probs_dropout_prob = 0.0 input_shape = get_shape_list(input_ids, expected_rank=2) batch_size = input_shape[0] seq_length = input_shape[1] if input_mask is None: input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32) if token_type_ids is None: token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32) with tf.variable_scope(scope, default_name="bert"): with tf.variable_scope("embeddings"): # Perform embedding lookup on the word ids. (self.word_embedding_output, self.output_embedding_table) = embedding_lookup( input_ids=input_ids, vocab_size=config.vocab_size, embedding_size=config.embedding_size, initializer_range=config.initializer_range, word_embedding_name="word_embeddings", use_one_hot_embeddings=use_one_hot_embeddings) # Add positional embeddings and token type embeddings, then layer # normalize and perform dropout. self.embedding_output = embedding_postprocessor( input_tensor=self.word_embedding_output, use_token_type=True, token_type_ids=token_type_ids, token_type_vocab_size=config.type_vocab_size, token_type_embedding_name="token_type_embeddings", use_position_embeddings=True, position_embedding_name="position_embeddings", initializer_range=config.initializer_range, max_position_embeddings=config.max_position_embeddings, dropout_prob=config.hidden_dropout_prob) with tf.variable_scope("encoder"): # Run the stacked transformer. # `sequence_output` shape = [batch_size, seq_length, hidden_size]. self.all_encoder_layers = transformer_model( input_tensor=self.embedding_output, attention_mask=input_mask, hidden_size=config.hidden_size, num_hidden_layers=config.num_hidden_layers, num_hidden_groups=config.num_hidden_groups, num_attention_heads=config.num_attention_heads, intermediate_size=config.intermediate_size, inner_group_num=config.inner_group_num, intermediate_act_fn=get_activation(config.hidden_act), hidden_dropout_prob=config.hidden_dropout_prob, attention_probs_dropout_prob=config.attention_probs_dropout_prob, initializer_range=config.initializer_range, do_return_all_layers=True) self.sequence_output = self.all_encoder_layers[-1] # The "pooler" converts the encoded sequence tensor of shape # [batch_size, seq_length, hidden_size] to a tensor of shape # [batch_size, hidden_size]. This is necessary for segment-level # (or segment-pair-level) classification tasks where we need a fixed # dimensional representation of the segment. with tf.variable_scope("pooler"): # We "pool" the model by simply taking the hidden state corresponding # to the first token. We assume that this has been pre-trained first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1) self.pooled_output = tf.layers.dense( first_token_tensor, config.hidden_size, activation=tf.tanh, kernel_initializer=create_initializer(config.initializer_range)) def get_pooled_output(self): return self.pooled_output def get_sequence_output(self): """Gets final hidden layer of encoder. Returns: float Tensor of shape [batch_size, seq_length, hidden_size] corresponding to the final hidden of the transformer encoder. """ return self.sequence_output def get_all_encoder_layers(self): return self.all_encoder_layers def get_word_embedding_output(self): """Get output of the word(piece) embedding lookup. This is BEFORE positional embeddings and token type embeddings have been added. Returns: float Tensor of shape [batch_size, seq_length, hidden_size] corresponding to the output of the word(piece) embedding layer. """ return self.word_embedding_output def get_embedding_output(self): """Gets output of the embedding lookup (i.e., input to the transformer). Returns: float Tensor of shape [batch_size, seq_length, hidden_size] corresponding to the output of the embedding layer, after summing the word embeddings with the positional embeddings and the token type embeddings, then performing layer normalization. This is the input to the transformer. """ return self.embedding_output def get_embedding_table(self): return self.output_embedding_table def gelu(x): """Gaussian Error Linear Unit. This is a smoother version of the RELU. Original paper: https://arxiv.org/abs/1606.08415 Args: x: float Tensor to perform activation. Returns: `x` with the GELU activation applied. """ cdf = 0.5 * (1.0 + tf.tanh( (np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3))))) return x * cdf def get_activation(activation_string): """Maps a string to a Python function, e.g., "relu" => `tf.nn.relu`. Args: activation_string: String name of the activation function. Returns: A Python function corresponding to the activation function. If `activation_string` is None, empty, or "linear", this will return None. If `activation_string` is not a string, it will return `activation_string`. Raises: ValueError: The `activation_string` does not correspond to a known activation. """ # We assume that anything that"s not a string is already an activation # function, so we just return it. if not isinstance(activation_string, six.string_types): return activation_string if not activation_string: return None act = activation_string.lower() if act == "linear": return None elif act == "relu": return tf.nn.relu elif act == "gelu": return gelu elif act == "tanh": return tf.tanh elif act == "swish": return lambda x: x * tf.sigmoid(x) else: raise ValueError("Unsupported activation: %s" % act) def get_assignment_map_from_checkpoint(tvars, init_checkpoint, num_of_group=0): """Compute the union of the current variables and checkpoint variables.""" assignment_map = {} initialized_variable_names = {} name_to_variable = collections.OrderedDict() for var in tvars: name = var.name m = re.match("^(.*):\\d+$", name) if m is not None: name = m.group(1) name_to_variable[name] = var init_vars = tf.train.list_variables(init_checkpoint) init_vars_name = [name for (name, _) in init_vars] if num_of_group > 0: assignment_map = [] for gid in range(num_of_group): assignment_map.append(collections.OrderedDict()) else: assignment_map = collections.OrderedDict() for name in name_to_variable: if name in init_vars_name: tvar_name = name elif (re.sub(r"/group_\d+/", "/group_0/", six.ensure_str(name)) in init_vars_name and num_of_group > 1): tvar_name = re.sub(r"/group_\d+/", "/group_0/", six.ensure_str(name)) elif (re.sub(r"/ffn_\d+/", "/ffn_1/", six.ensure_str(name)) in init_vars_name and num_of_group > 1): tvar_name = re.sub(r"/ffn_\d+/", "/ffn_1/", six.ensure_str(name)) elif (re.sub(r"/attention_\d+/", "/attention_1/", six.ensure_str(name)) in init_vars_name and num_of_group > 1): tvar_name = re.sub(r"/attention_\d+/", "/attention_1/", six.ensure_str(name)) else: tf.logging.info("name %s does not get matched", name) continue tf.logging.info("name %s match to %s", name, tvar_name) if num_of_group > 0: group_matched = False for gid in range(1, num_of_group): if (("/group_" + str(gid) + "/" in name) or ("/ffn_" + str(gid) + "/" in name) or ("/attention_" + str(gid) + "/" in name)): group_matched = True tf.logging.info("%s belongs to %dth", name, gid) assignment_map[gid][tvar_name] = name if not group_matched: assignment_map[0][tvar_name] = name else: assignment_map[tvar_name] = name initialized_variable_names[name] = 1 initialized_variable_names[six.ensure_str(name) + ":0"] = 1 return (assignment_map, initialized_variable_names) def dropout(input_tensor, dropout_prob): """Perform dropout. Args: input_tensor: float Tensor. dropout_prob: Python float. The probability of dropping out a value (NOT of *keeping* a dimension as in `tf.nn.dropout`). Returns: A version of `input_tensor` with dropout applied. """ if dropout_prob is None or dropout_prob == 0.0: return input_tensor output = tf.nn.dropout(input_tensor, rate=dropout_prob) return output def layer_norm(input_tensor, name=None): """Run layer normalization on the last dimension of the tensor.""" return tf.contrib.layers.layer_norm( inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name) def layer_norm_and_dropout(input_tensor, dropout_prob, name=None): """Runs layer normalization followed by dropout.""" output_tensor = layer_norm(input_tensor, name) output_tensor = dropout(output_tensor, dropout_prob) return output_tensor def create_initializer(initializer_range=0.02): """Creates a `truncated_normal_initializer` with the given range.""" return tf.truncated_normal_initializer(stddev=initializer_range) def get_timing_signal_1d_given_position(channels, position, min_timescale=1.0, max_timescale=1.0e4): """Get sinusoids of diff frequencies, with timing position given. Adapted from add_timing_signal_1d_given_position in //third_party/py/tensor2tensor/layers/common_attention.py Args: channels: scalar, size of timing embeddings to create. The number of different timescales is equal to channels / 2. position: a Tensor with shape [batch, seq_len] min_timescale: a float max_timescale: a float Returns: a Tensor of timing signals [batch, seq_len, channels] """ num_timescales = channels // 2 log_timescale_increment = ( math.log(float(max_timescale) / float(min_timescale)) / (tf.to_float(num_timescales) - 1)) inv_timescales = min_timescale * tf.exp( tf.to_float(tf.range(num_timescales)) * -log_timescale_increment) scaled_time = ( tf.expand_dims(tf.to_float(position), 2) * tf.expand_dims( tf.expand_dims(inv_timescales, 0), 0)) signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=2) signal = tf.pad(signal, [[0, 0], [0, 0], [0, tf.mod(channels, 2)]]) return signal def embedding_lookup(input_ids, vocab_size, embedding_size=128, initializer_range=0.02, word_embedding_name="word_embeddings", use_one_hot_embeddings=False): """Looks up words embeddings for id tensor. Args: input_ids: int32 Tensor of shape [batch_size, seq_length] containing word ids. vocab_size: int. Size of the embedding vocabulary. embedding_size: int. Width of the word embeddings. initializer_range: float. Embedding initialization range. word_embedding_name: string. Name of the embedding table. use_one_hot_embeddings: bool. If True, use one-hot method for word embeddings. If False, use `tf.nn.embedding_lookup()`. Returns: float Tensor of shape [batch_size, seq_length, embedding_size]. """ # This function assumes that the input is of shape [batch_size, seq_length, # num_inputs]. # # If the input is a 2D tensor of shape [batch_size, seq_length], we # reshape to [batch_size, seq_length, 1]. if input_ids.shape.ndims == 2: input_ids = tf.expand_dims(input_ids, axis=[-1]) embedding_table = tf.get_variable( name=word_embedding_name, shape=[vocab_size, embedding_size], initializer=create_initializer(initializer_range)) if use_one_hot_embeddings: flat_input_ids = tf.reshape(input_ids, [-1]) one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size) output = tf.matmul(one_hot_input_ids, embedding_table) else: output = tf.nn.embedding_lookup(embedding_table, input_ids) input_shape = get_shape_list(input_ids) output = tf.reshape(output, input_shape[0:-1] + [input_shape[-1] * embedding_size]) return (output, embedding_table) def embedding_postprocessor(input_tensor, use_token_type=False, token_type_ids=None, token_type_vocab_size=16, token_type_embedding_name="token_type_embeddings", use_position_embeddings=True, position_embedding_name="position_embeddings", initializer_range=0.02, max_position_embeddings=512, dropout_prob=0.1): """Performs various post-processing on a word embedding tensor. Args: input_tensor: float Tensor of shape [batch_size, seq_length, embedding_size]. use_token_type: bool. Whether to add embeddings for `token_type_ids`. token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length]. Must be specified if `use_token_type` is True. token_type_vocab_size: int. The vocabulary size of `token_type_ids`. token_type_embedding_name: string. The name of the embedding table variable for token type ids. use_position_embeddings: bool. Whether to add position embeddings for the position of each token in the sequence. position_embedding_name: string. The name of the embedding table variable for positional embeddings. initializer_range: float. Range of the weight initialization. max_position_embeddings: int. Maximum sequence length that might ever be used with this model. This can be longer than the sequence length of input_tensor, but cannot be shorter. dropout_prob: float. Dropout probability applied to the final output tensor. Returns: float tensor with same shape as `input_tensor`. Raises: ValueError: One of the tensor shapes or input values is invalid. """ input_shape = get_shape_list(input_tensor, expected_rank=3) batch_size = input_shape[0] seq_length = input_shape[1] width = input_shape[2] output = input_tensor if use_token_type: if token_type_ids is None: raise ValueError("`token_type_ids` must be specified if" "`use_token_type` is True.") token_type_table = tf.get_variable( name=token_type_embedding_name, shape=[token_type_vocab_size, width], initializer=create_initializer(initializer_range)) # This vocab will be small so we always do one-hot here, since it is always # faster for a small vocabulary. flat_token_type_ids = tf.reshape(token_type_ids, [-1]) one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size) token_type_embeddings = tf.matmul(one_hot_ids, token_type_table) token_type_embeddings = tf.reshape(token_type_embeddings, [batch_size, seq_length, width]) output += token_type_embeddings if use_position_embeddings: assert_op = tf.assert_less_equal(seq_length, max_position_embeddings) with tf.control_dependencies([assert_op]): full_position_embeddings = tf.get_variable( name=position_embedding_name, shape=[max_position_embeddings, width], initializer=create_initializer(initializer_range)) # Since the position embedding table is a learned variable, we create it # using a (long) sequence length `max_position_embeddings`. The actual # sequence length might be shorter than this, for faster training of # tasks that do not have long sequences. # # So `full_position_embeddings` is effectively an embedding table # for position [0, 1, 2, ..., max_position_embeddings-1], and the current # sequence has positions [0, 1, 2, ... seq_length-1], so we can just # perform a slice. position_embeddings = tf.slice(full_position_embeddings, [0, 0], [seq_length, -1]) num_dims = len(output.shape.as_list()) # Only the last two dimensions are relevant (`seq_length` and `width`), so # we broadcast among the first dimensions, which is typically just # the batch size. position_broadcast_shape = [] for _ in range(num_dims - 2): position_broadcast_shape.append(1) position_broadcast_shape.extend([seq_length, width]) position_embeddings = tf.reshape(position_embeddings, position_broadcast_shape) output += position_embeddings output = layer_norm_and_dropout(output, dropout_prob) return output def dense_layer_3d(input_tensor, num_attention_heads, head_size, initializer, activation, name=None): """A dense layer with 3D kernel. Args: input_tensor: float Tensor of shape [batch, seq_length, hidden_size]. num_attention_heads: Number of attention heads. head_size: The size per attention head. initializer: Kernel initializer. activation: Actication function. name: The name scope of this layer. Returns: float logits Tensor. """ input_shape = get_shape_list(input_tensor) hidden_size = input_shape[2] with tf.variable_scope(name): w = tf.get_variable( name="kernel", shape=[hidden_size, num_attention_heads * head_size], initializer=initializer) w = tf.reshape(w, [hidden_size, num_attention_heads, head_size]) b = tf.get_variable( name="bias", shape=[num_attention_heads * head_size], initializer=tf.zeros_initializer) b = tf.reshape(b, [num_attention_heads, head_size]) ret = tf.einsum("BFH,HND->BFND", input_tensor, w) ret += b if activation is not None: return activation(ret) else: return ret def dense_layer_3d_proj(input_tensor, hidden_size, head_size, initializer, activation, name=None): """A dense layer with 3D kernel for projection. Args: input_tensor: float Tensor of shape [batch,from_seq_length, num_attention_heads, size_per_head]. hidden_size: The size of hidden layer. num_attention_heads: The size of output dimension. head_size: The size of head. initializer: Kernel initializer. activation: Actication function. name: The name scope of this layer. Returns: float logits Tensor. """ input_shape = get_shape_list(input_tensor) num_attention_heads= input_shape[2] with tf.variable_scope(name): w = tf.get_variable( name="kernel", shape=[num_attention_heads * head_size, hidden_size], initializer=initializer) w = tf.reshape(w, [num_attention_heads, head_size, hidden_size]) b = tf.get_variable( name="bias", shape=[hidden_size], initializer=tf.zeros_initializer) ret = tf.einsum("BFND,NDH->BFH", input_tensor, w) ret += b if activation is not None: return activation(ret) else: return ret def dense_layer_2d(input_tensor, output_size, initializer, activation, num_attention_heads=1, name=None, num_groups=1): """A dense layer with 2D kernel. Args: input_tensor: Float tensor with rank 3. output_size: The size of output dimension. initializer: Kernel initializer. activation: Activation function. num_groups: number of groups in dense layer num_attention_heads: number of attention head in attention layer. name: The name scope of this layer. Returns: float logits Tensor. """ del num_attention_heads # unused input_shape = get_shape_list(input_tensor) hidden_size = input_shape[2] if num_groups == 1: with tf.variable_scope(name): w = tf.get_variable( name="kernel", shape=[hidden_size, output_size], initializer=initializer) b = tf.get_variable( name="bias", shape=[output_size], initializer=tf.zeros_initializer) ret = tf.einsum("BFH,HO->BFO", input_tensor, w) ret += b else: assert hidden_size % num_groups == 0 assert output_size % num_groups == 0 with tf.variable_scope(name): w = tf.get_variable( name="kernel", shape=[hidden_size//num_groups, output_size//num_groups, num_groups], initializer=initializer) b = tf.get_variable( name="bias", shape=[output_size], initializer=tf.zeros_initializer) input_tensor = tf.reshape(input_tensor, input_shape[:2] + [hidden_size//num_groups, num_groups]) ret = tf.einsum("BFHG,HOG->BFGO", input_tensor, w) ret = tf.reshape(ret, input_shape[:2] + [output_size]) ret += b if activation is not None: return activation(ret) else: return ret def dense_layer_2d_old(input_tensor, output_size, initializer, activation, num_attention_heads=1, name=None, num_groups=1): """A dense layer with 2D kernel. 添加分组全连接的方式 Args: input_tensor: Float tensor with rank 3. [ batch_size,sequence_length, hidden_size] output_size: The size of output dimension. initializer: Kernel initializer. activation: Activation function. num_groups: number of groups in dense layer num_attention_heads: number of attention head in attention layer. name: The name scope of this layer. Returns: float logits Tensor. """ del num_attention_heads # unused input_shape = get_shape_list(input_tensor) # print("#dense_layer_2d.1.input_shape of input_tensor:",input_shape) # e.g. [2, 512, 768] = [ batch_size,sequence_length, hidden_size] hidden_size = input_shape[2] if num_groups == 1: with tf.variable_scope(name): w = tf.get_variable( name="kernel", shape=[hidden_size, output_size], initializer=initializer) b = tf.get_variable( name="bias", shape=[output_size], initializer=tf.zeros_initializer) ret = tf.einsum("BFH,HO->BFO", input_tensor, w) ret += b else: # e.g. input_shape = [2, 512, 768] = [ batch_size,sequence_length, hidden_size] assert hidden_size % num_groups == 0 assert output_size % num_groups == 0 # print("#dense_layer_2d.output_size:",output_size,";hidden_size:",hidden_size) # output_size = 3072; hidden_size = 768 with tf.variable_scope(name): w = tf.get_variable( name="kernel", shape=[num_groups, hidden_size//num_groups, output_size//num_groups], initializer=initializer) # print("#dense_layer_2d.2'w:",w.shape) # (16, 48, 192) b = tf.get_variable( name="bias", shape=[num_groups, output_size//num_groups], initializer=tf.zeros_initializer) # input_tensor = [ batch_size,sequence_length, hidden_size]. # input_shape[:2] + [hidden_size//num_groups, num_groups] = [batch_size, sequence_length, hidden_size/num_groups, num_groups] input_tensor = tf.reshape(input_tensor, input_shape[:2] + [hidden_size//num_groups, num_groups]) # print("#dense_layer_2d.2.input_shape of input_tensor:", input_tensor.shape) input_tensor = tf.transpose(input_tensor, [3, 0, 1, 2]) # [num_groups, batch_size, sequence_length, hidden_size/num_groups] # print("#dense_layer_2d.3.input_shape of input_tensor:", input_tensor.shape) # input_tensor=(16, 2, 512, 192) # input_tensor=[num_groups, batch_size, sequence_length, hidden_size/num_groups], w=[num_groups, hidden_size/num_groups, output_size/num_groups] ret = tf.einsum("GBFH,GHO->GBFO", input_tensor, w) # print("#dense_layer_2d.4. shape of ret:", ret.shape) # (16, 2, 512, 48) = [num_groups, batch_size, sequence_length ,output_size] b = tf.expand_dims(b, 1) b = tf.expand_dims(b, 1) # print("#dense_layer_2d.4.2.b:",b.shape) # (16, 1, 1, 48) ret += b ret = tf.transpose(ret, [1, 2, 0, 3]) # (2, 512, 16, 48) # print("#dense_layer_2d.5. shape of ret:", ret.shape) ret = tf.reshape(ret, input_shape[:2] + [output_size]) # [2, 512, 768] if activation is not None: return activation(ret) else: return ret def dot_product_attention(q, k, v, bias, dropout_rate=0.0): """Dot-product attention. Args: q: Tensor with shape [..., length_q, depth_k]. k: Tensor with shape [..., length_kv, depth_k]. Leading dimensions must match with q. v: Tensor with shape [..., length_kv, depth_v] Leading dimensions must match with q. bias: bias Tensor (see attention_bias()) dropout_rate: a float. Returns: Tensor with shape [..., length_q, depth_v]. """ logits = tf.matmul(q, k, transpose_b=True) # [..., length_q, length_kv] logits = tf.multiply(logits, 1.0 / math.sqrt(float(get_shape_list(q)[-1]))) if bias is not None: # `attention_mask` = [B, T] from_shape = get_shape_list(q) if len(from_shape) == 4: broadcast_ones = tf.ones([from_shape[0], 1, from_shape[2], 1], tf.float32) elif len(from_shape) == 5: # from_shape = [B, N, Block_num, block_size, depth]# broadcast_ones = tf.ones([from_shape[0], 1, from_shape[2], from_shape[3], 1], tf.float32) bias = tf.matmul(broadcast_ones, tf.cast(bias, tf.float32), transpose_b=True) # Since attention_mask is 1.0 for positions we want to attend and 0.0 for # masked positions, this operation will create a tensor which is 0.0 for # positions we want to attend and -10000.0 for masked positions. adder = (1.0 - bias) * -10000.0 # Since we are adding it to the raw scores before the softmax, this is # effectively the same as removing these entirely. logits += adder else: adder = 0.0 attention_probs = tf.nn.softmax(logits, name="attention_probs") attention_probs = dropout(attention_probs, dropout_rate) return tf.matmul(attention_probs, v) def attention_layer(from_tensor, to_tensor, attention_mask=None, num_attention_heads=1, query_act=None, key_act=None, value_act=None, attention_probs_dropout_prob=0.0, initializer_range=0.02, batch_size=None, from_seq_length=None, to_seq_length=None): """Performs multi-headed attention from `from_tensor` to `to_tensor`. Args: from_tensor: float Tensor of shape [batch_size, from_seq_length, from_width]. to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width]. attention_mask: (optional) int32 Tensor of shape [batch_size, from_seq_length, to_seq_length]. The values should be 1 or 0. The attention scores will effectively be set to -infinity for any positions in the mask that are 0, and will be unchanged for positions that are 1. num_attention_heads: int. Number of attention heads. query_act: (optional) Activation function for the query transform. key_act: (optional) Activation function for the key transform. value_act: (optional) Activation function for the value transform. attention_probs_dropout_prob: (optional) float. Dropout probability of the attention probabilities. initializer_range: float. Range of the weight initializer. batch_size: (Optional) int. If the input is 2D, this might be the batch size of the 3D version of the `from_tensor` and `to_tensor`. from_seq_length: (Optional) If the input is 2D, this might be the seq length of the 3D version of the `from_tensor`. to_seq_length: (Optional) If the input is 2D, this might be the seq length of the 3D version of the `to_tensor`. Returns: float Tensor of shape [batch_size, from_seq_length, num_attention_heads, size_per_head]. Raises: ValueError: Any of the arguments or tensor shapes are invalid. """ from_shape = get_shape_list(from_tensor, expected_rank=[2, 3]) to_shape = get_shape_list(to_tensor, expected_rank=[2, 3]) size_per_head = int(from_shape[2]/num_attention_heads) if len(from_shape) != len(to_shape): raise ValueError( "The rank of `from_tensor` must match the rank of `to_tensor`.") if len(from_shape) == 3: batch_size = from_shape[0] from_seq_length = from_shape[1] to_seq_length = to_shape[1] elif len(from_shape) == 2: if (batch_size is None or from_seq_length is None or to_seq_length is None): raise ValueError( "When passing in rank 2 tensors to attention_layer, the values " "for `batch_size`, `from_seq_length`, and `to_seq_length` " "must all be specified.") # Scalar dimensions referenced here: # B = batch size (number of sequences) # F = `from_tensor` sequence length # T = `to_tensor` sequence length # N = `num_attention_heads` # H = `size_per_head` # `query_layer` = [B, F, N, H] q = dense_layer_3d(from_tensor, num_attention_heads, size_per_head, create_initializer(initializer_range), query_act, "query") # `key_layer` = [B, T, N, H] k = dense_layer_3d(to_tensor, num_attention_heads, size_per_head, create_initializer(initializer_range), key_act, "key") # `value_layer` = [B, T, N, H] v = dense_layer_3d(to_tensor, num_attention_heads, size_per_head, create_initializer(initializer_range), value_act, "value") q = tf.transpose(q, [0, 2, 1, 3]) k = tf.transpose(k, [0, 2, 1, 3]) v = tf.transpose(v, [0, 2, 1, 3]) if attention_mask is not None: attention_mask = tf.reshape( attention_mask, [batch_size, 1, to_seq_length, 1]) # 'new_embeddings = [B, N, F, H]' new_embeddings = dot_product_attention(q, k, v, attention_mask, attention_probs_dropout_prob) return tf.transpose(new_embeddings, [0, 2, 1, 3]) def attention_ffn_block(layer_input, hidden_size=768, attention_mask=None, num_attention_heads=1, attention_head_size=64, attention_probs_dropout_prob=0.0, intermediate_size=3072, intermediate_act_fn=None, initializer_range=0.02, hidden_dropout_prob=0.0): """A network with attention-ffn as sub-block. Args: layer_input: float Tensor of shape [batch_size, from_seq_length, from_width]. hidden_size: (optional) int, size of hidden layer. attention_mask: (optional) int32 Tensor of shape [batch_size, from_seq_length, to_seq_length]. The values should be 1 or 0. The attention scores will effectively be set to -infinity for any positions in the mask that are 0, and will be unchanged for positions that are 1. num_attention_heads: int. Number of attention heads. attention_head_size: int. Size of attention head. attention_probs_dropout_prob: float. dropout probability for attention_layer intermediate_size: int. Size of intermediate hidden layer. intermediate_act_fn: (optional) Activation function for the intermediate layer. initializer_range: float. Range of the weight initializer. hidden_dropout_prob: (optional) float. Dropout probability of the hidden layer. Returns: layer output """ with tf.variable_scope("attention_1"): with tf.variable_scope("self"): attention_output = attention_layer( from_tensor=layer_input, to_tensor=layer_input, attention_mask=attention_mask, num_attention_heads=num_attention_heads, attention_probs_dropout_prob=attention_probs_dropout_prob, initializer_range=initializer_range) # Run a linear projection of `hidden_size` then add a residual # with `layer_input`. with tf.variable_scope("output"): attention_output = dense_layer_3d_proj( attention_output, hidden_size, attention_head_size, create_initializer(initializer_range), None, name="dense") attention_output = dropout(attention_output, hidden_dropout_prob) attention_output = layer_norm(attention_output + layer_input) with tf.variable_scope("ffn_1"): with tf.variable_scope("intermediate"): intermediate_output = dense_layer_2d( attention_output, intermediate_size, create_initializer(initializer_range), intermediate_act_fn, num_attention_heads=num_attention_heads, name="dense", num_groups=16) with tf.variable_scope("output"): ffn_output = dense_layer_2d( intermediate_output, hidden_size, create_initializer(initializer_range), None, num_attention_heads=num_attention_heads, name="dense", num_groups=16) ffn_output = dropout(ffn_output, hidden_dropout_prob) ffn_output = layer_norm(ffn_output + attention_output) return ffn_output def transformer_model(input_tensor, attention_mask=None, hidden_size=768, num_hidden_layers=12, num_hidden_groups=12, num_attention_heads=12, intermediate_size=3072, inner_group_num=1, intermediate_act_fn="gelu", hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, initializer_range=0.02, do_return_all_layers=False): """Multi-headed, multi-layer Transformer from "Attention is All You Need". This is almost an exact implementation of the original Transformer encoder. See the original paper: https://arxiv.org/abs/1706.03762 Also see: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py Args: input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size]. attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length, seq_length], with 1 for positions that can be attended to and 0 in positions that should not be. hidden_size: int. Hidden size of the Transformer. num_hidden_layers: int. Number of layers (blocks) in the Transformer. num_hidden_groups: int. Number of group for the hidden layers, parameters in the same group are shared. num_attention_heads: int. Number of attention heads in the Transformer. intermediate_size: int. The size of the "intermediate" (a.k.a., feed forward) layer. inner_group_num: int, number of inner repetition of attention and ffn. intermediate_act_fn: function. The non-linear activation function to apply to the output of the intermediate/feed-forward layer. hidden_dropout_prob: float. Dropout probability for the hidden layers. attention_probs_dropout_prob: float. Dropout probability of the attention probabilities. initializer_range: float. Range of the initializer (stddev of truncated normal). do_return_all_layers: Whether to also return all layers or just the final layer. Returns: float Tensor of shape [batch_size, seq_length, hidden_size], the final hidden layer of the Transformer. Raises: ValueError: A Tensor shape or parameter is invalid. """ if hidden_size % num_attention_heads != 0: raise ValueError( "The hidden size (%d) is not a multiple of the number of attention " "heads (%d)" % (hidden_size, num_attention_heads)) attention_head_size = hidden_size // num_attention_heads input_shape = get_shape_list(input_tensor, expected_rank=3) input_width = input_shape[2] all_layer_outputs = [] if input_width != hidden_size: prev_output = dense_layer_2d( input_tensor, hidden_size, create_initializer(initializer_range), None, name="embedding_hidden_mapping_in") else: prev_output = input_tensor with tf.variable_scope("transformer", reuse=tf.AUTO_REUSE): for layer_idx in range(num_hidden_layers): group_idx = int(layer_idx / num_hidden_layers * num_hidden_groups) with tf.variable_scope("group_%d" % group_idx): with tf.name_scope("layer_%d" % layer_idx): layer_output = prev_output for inner_group_idx in range(inner_group_num): with tf.variable_scope("inner_group_%d" % inner_group_idx): layer_output = attention_ffn_block( layer_output, hidden_size, attention_mask, num_attention_heads, attention_head_size, attention_probs_dropout_prob, intermediate_size, intermediate_act_fn, initializer_range, hidden_dropout_prob) prev_output = layer_output all_layer_outputs.append(layer_output) if do_return_all_layers: return all_layer_outputs else: return all_layer_outputs[-1] def get_shape_list(tensor, expected_rank=None, name=None): """Returns a list of the shape of tensor, preferring static dimensions. Args: tensor: A tf.Tensor object to find the shape of. expected_rank: (optional) int. The expected rank of `tensor`. If this is specified and the `tensor` has a different rank, and exception will be thrown. name: Optional name of the tensor for the error message. Returns: A list of dimensions of the shape of tensor. All static dimensions will be returned as python integers, and dynamic dimensions will be returned as tf.Tensor scalars. """ if name is None: name = tensor.name if expected_rank is not None: assert_rank(tensor, expected_rank, name) shape = tensor.shape.as_list() non_static_indexes = [] for (index, dim) in enumerate(shape): if dim is None: non_static_indexes.append(index) if not non_static_indexes: return shape dyn_shape = tf.shape(tensor) for index in non_static_indexes: shape[index] = dyn_shape[index] return shape def reshape_to_matrix(input_tensor): """Reshapes a >= rank 2 tensor to a rank 2 tensor (i.e., a matrix).""" ndims = input_tensor.shape.ndims if ndims < 2: raise ValueError("Input tensor must have at least rank 2. Shape = %s" % (input_tensor.shape)) if ndims == 2: return input_tensor width = input_tensor.shape[-1] output_tensor = tf.reshape(input_tensor, [-1, width]) return output_tensor def reshape_from_matrix(output_tensor, orig_shape_list): """Reshapes a rank 2 tensor back to its original rank >= 2 tensor.""" if len(orig_shape_list) == 2: return output_tensor output_shape = get_shape_list(output_tensor) orig_dims = orig_shape_list[0:-1] width = output_shape[-1] return tf.reshape(output_tensor, orig_dims + [width]) def assert_rank(tensor, expected_rank, name=None): """Raises an exception if the tensor rank is not of the expected rank. Args: tensor: A tf.Tensor to check the rank of. expected_rank: Python integer or list of integers, expected rank. name: Optional name of the tensor for the error message. Raises: ValueError: If the expected shape doesn't match the actual shape. """ if name is None: name = tensor.name expected_rank_dict = {} if isinstance(expected_rank, six.integer_types): expected_rank_dict[expected_rank] = True else: for x in expected_rank: expected_rank_dict[x] = True actual_rank = tensor.shape.ndims if actual_rank not in expected_rank_dict: scope_name = tf.get_variable_scope().name raise ValueError( "For the tensor `%s` in scope `%s`, the actual rank " "`%d` (shape = %s) is not equal to the expected rank `%s`" % (name, scope_name, actual_rank, str(tensor.shape), str(expected_rank))) ================================================ FILE: optimization.py ================================================ # coding=utf-8 # Copyright 2018 The Google AI Language Team Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Functions and classes related to optimization (weight updates).""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import re import tensorflow as tf def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu): """Creates an optimizer training op.""" global_step = tf.train.get_or_create_global_step() learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32) # Implements linear decay of the learning rate. learning_rate = tf.train.polynomial_decay( learning_rate, global_step, num_train_steps, end_learning_rate=0.0, power=1.0, cycle=False) # Implements linear warmup. I.e., if global_step < num_warmup_steps, the # learning rate will be `global_step/num_warmup_steps * init_lr`. if num_warmup_steps: global_steps_int = tf.cast(global_step, tf.int32) warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32) global_steps_float = tf.cast(global_steps_int, tf.float32) warmup_steps_float = tf.cast(warmup_steps_int, tf.float32) warmup_percent_done = global_steps_float / warmup_steps_float warmup_learning_rate = init_lr * warmup_percent_done is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32) learning_rate = ( (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate) # It is recommended that you use this optimizer for fine tuning, since this # is how the model was trained (note that the Adam m/v variables are NOT # loaded from init_checkpoint.) optimizer = LAMBOptimizer( learning_rate=learning_rate, weight_decay_rate=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-6, exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"]) if use_tpu: optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer) tvars = tf.trainable_variables() grads = tf.gradients(loss, tvars) # This is how the model was pre-trained. (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0) train_op = optimizer.apply_gradients( zip(grads, tvars), global_step=global_step) # Normally the global step update is done inside of `apply_gradients`. # However, `AdamWeightDecayOptimizer` doesn't do this. But if you use # a different optimizer, you should probably take this line out. new_global_step = global_step + 1 train_op = tf.group(train_op, [global_step.assign(new_global_step)]) return train_op class AdamWeightDecayOptimizer(tf.train.Optimizer): """A basic Adam optimizer that includes "correct" L2 weight decay.""" def __init__(self, learning_rate, weight_decay_rate=0.0, beta_1=0.9, beta_2=0.999, epsilon=1e-6, exclude_from_weight_decay=None, name="AdamWeightDecayOptimizer"): """Constructs a AdamWeightDecayOptimizer.""" super(AdamWeightDecayOptimizer, self).__init__(False, name) self.learning_rate = learning_rate self.weight_decay_rate = weight_decay_rate self.beta_1 = beta_1 self.beta_2 = beta_2 self.epsilon = epsilon self.exclude_from_weight_decay = exclude_from_weight_decay def apply_gradients(self, grads_and_vars, global_step=None, name=None): """See base class.""" assignments = [] for (grad, param) in grads_and_vars: if grad is None or param is None: continue param_name = self._get_variable_name(param.name) m = tf.get_variable( name=param_name + "/adam_m", shape=param.shape.as_list(), dtype=tf.float32, trainable=False, initializer=tf.zeros_initializer()) v = tf.get_variable( name=param_name + "/adam_v", shape=param.shape.as_list(), dtype=tf.float32, trainable=False, initializer=tf.zeros_initializer()) # Standard Adam update. next_m = ( tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad)) next_v = ( tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2, tf.square(grad))) update = next_m / (tf.sqrt(next_v) + self.epsilon) # Just adding the square of the weights to the loss function is *not* # the correct way of using L2 regularization/weight decay with Adam, # since that will interact with the m and v parameters in strange ways. # # Instead we want ot decay the weights in a manner that doesn't interact # with the m/v parameters. This is equivalent to adding the square # of the weights to the loss with plain (non-momentum) SGD. if self._do_use_weight_decay(param_name): update += self.weight_decay_rate * param update_with_lr = self.learning_rate * update next_param = param - update_with_lr assignments.extend( [param.assign(next_param), m.assign(next_m), v.assign(next_v)]) return tf.group(*assignments, name=name) def _do_use_weight_decay(self, param_name): """Whether to use L2 weight decay for `param_name`.""" if not self.weight_decay_rate: return False if self.exclude_from_weight_decay: for r in self.exclude_from_weight_decay: if re.search(r, param_name) is not None: return False return True def _get_variable_name(self, param_name): """Get the variable name from the tensor name.""" m = re.match("^(.*):\\d+$", param_name) if m is not None: param_name = m.group(1) return param_name # class LAMBOptimizer(tf.train.Optimizer): """ LAMBOptimizer optimizer. https://github.com/ymcui/LAMB_Optimizer_TF # IMPORTANT NOTE - This is NOT an official implementation. - LAMB optimizer is changed from arXiv v1 ~ v3. - We implement v3 version (which is the latest version on June, 2019.). - Our implementation is based on `AdamWeightDecayOptimizer` in BERT (provided by Google). # References - Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. https://arxiv.org/abs/1904.00962v3 - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805 # Parameters - There is nothing special, just the same as `AdamWeightDecayOptimizer`. """ def __init__(self, learning_rate, weight_decay_rate=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-6, exclude_from_weight_decay=None, name="LAMBOptimizer"): """Constructs a LAMBOptimizer.""" super(LAMBOptimizer, self).__init__(False, name) self.learning_rate = learning_rate self.weight_decay_rate = weight_decay_rate self.beta_1 = beta_1 self.beta_2 = beta_2 self.epsilon = epsilon self.exclude_from_weight_decay = exclude_from_weight_decay def apply_gradients(self, grads_and_vars, global_step=None, name=None): """See base class.""" assignments = [] for (grad, param) in grads_and_vars: if grad is None or param is None: continue param_name = self._get_variable_name(param.name) m = tf.get_variable( name=param_name + "/lamb_m", shape=param.shape.as_list(), dtype=tf.float32, trainable=False, initializer=tf.zeros_initializer()) v = tf.get_variable( name=param_name + "/lamb_v", shape=param.shape.as_list(), dtype=tf.float32, trainable=False, initializer=tf.zeros_initializer()) # Standard Adam update. next_m = ( tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad)) next_v = ( tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2, tf.square(grad))) update = next_m / (tf.sqrt(next_v) + self.epsilon) # Just adding the square of the weights to the loss function is *not* # the correct way of using L2 regularization/weight decay with Adam, # since that will interact with the m and v parameters in strange ways. # # Instead we want ot decay the weights in a manner that doesn't interact # with the m/v parameters. This is equivalent to adding the square # of the weights to the loss with plain (non-momentum) SGD. if self._do_use_weight_decay(param_name): update += self.weight_decay_rate * param ############## BELOW ARE THE SPECIFIC PARTS FOR LAMB ############## # Note: Here are two choices for scaling function \phi(z) # minmax: \phi(z) = min(max(z, \gamma_l), \gamma_u) # identity: \phi(z) = z # The authors does not mention what is \gamma_l and \gamma_u # UPDATE: after asking authors, they provide me the code below. # ratio = array_ops.where(math_ops.greater(w_norm, 0), array_ops.where( # math_ops.greater(g_norm, 0), (w_norm / g_norm), 1.0), 1.0) r1 = tf.sqrt(tf.reduce_sum(tf.square(param))) r2 = tf.sqrt(tf.reduce_sum(tf.square(update))) r = tf.where(tf.greater(r1, 0.0), tf.where(tf.greater(r2, 0.0), r1 / r2, 1.0), 1.0) eta = self.learning_rate * r update_with_lr = eta * update next_param = param - update_with_lr assignments.extend( [param.assign(next_param), m.assign(next_m), v.assign(next_v)]) return tf.group(*assignments, name=name) def _do_use_weight_decay(self, param_name): """Whether to use L2 weight decay for `param_name`.""" if not self.weight_decay_rate: return False if self.exclude_from_weight_decay: for r in self.exclude_from_weight_decay: if re.search(r, param_name) is not None: return False return True def _get_variable_name(self, param_name): """Get the variable name from the tensor name.""" m = re.match("^(.*):\\d+$", param_name) if m is not None: param_name = m.group(1) return param_name ================================================ FILE: optimization_finetuning.py ================================================ # coding=utf-8 # Copyright 2018 The Google AI Language Team Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Functions and classes related to optimization (weight updates).""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import re import tensorflow as tf def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu): """Creates an optimizer training op.""" global_step = tf.train.get_or_create_global_step() learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32) # Implements linear decay of the learning rate. learning_rate = tf.train.polynomial_decay( learning_rate, global_step, num_train_steps, end_learning_rate=0.0, power=1.0, cycle=False) # Implements linear warmup. I.e., if global_step < num_warmup_steps, the # learning rate will be `global_step/num_warmup_steps * init_lr`. if num_warmup_steps: global_steps_int = tf.cast(global_step, tf.int32) warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32) global_steps_float = tf.cast(global_steps_int, tf.float32) warmup_steps_float = tf.cast(warmup_steps_int, tf.float32) warmup_percent_done = global_steps_float / warmup_steps_float warmup_learning_rate = init_lr * warmup_percent_done is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32) learning_rate = ( (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate) # It is recommended that you use this optimizer for fine tuning, since this # is how the model was trained (note that the Adam m/v variables are NOT # loaded from init_checkpoint.) optimizer = AdamWeightDecayOptimizer( learning_rate=learning_rate, weight_decay_rate=0.01, beta_1=0.9, beta_2=0.999, # 0.98 ONLY USED FOR PRETRAIN. MUST CHANGE AT FINE-TUNING 0.999, epsilon=1e-6, exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"]) if use_tpu: optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer) tvars = tf.trainable_variables() grads = tf.gradients(loss, tvars) # This is how the model was pre-trained. (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0) train_op = optimizer.apply_gradients( zip(grads, tvars), global_step=global_step) # Normally the global step update is done inside of `apply_gradients`. # However, `AdamWeightDecayOptimizer` doesn't do this. But if you use # a different optimizer, you should probably take this line out. new_global_step = global_step + 1 train_op = tf.group(train_op, [global_step.assign(new_global_step)]) return train_op class AdamWeightDecayOptimizer(tf.train.Optimizer): """A basic Adam optimizer that includes "correct" L2 weight decay.""" def __init__(self, learning_rate, weight_decay_rate=0.0, beta_1=0.9, beta_2=0.999, epsilon=1e-6, exclude_from_weight_decay=None, name="AdamWeightDecayOptimizer"): """Constructs a AdamWeightDecayOptimizer.""" super(AdamWeightDecayOptimizer, self).__init__(False, name) self.learning_rate = learning_rate self.weight_decay_rate = weight_decay_rate self.beta_1 = beta_1 self.beta_2 = beta_2 self.epsilon = epsilon self.exclude_from_weight_decay = exclude_from_weight_decay def apply_gradients(self, grads_and_vars, global_step=None, name=None): """See base class.""" assignments = [] for (grad, param) in grads_and_vars: if grad is None or param is None: continue param_name = self._get_variable_name(param.name) m = tf.get_variable( name=param_name + "/adam_m", shape=param.shape.as_list(), dtype=tf.float32, trainable=False, initializer=tf.zeros_initializer()) v = tf.get_variable( name=param_name + "/adam_v", shape=param.shape.as_list(), dtype=tf.float32, trainable=False, initializer=tf.zeros_initializer()) # Standard Adam update. next_m = ( tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad)) next_v = ( tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2, tf.square(grad))) update = next_m / (tf.sqrt(next_v) + self.epsilon) # Just adding the square of the weights to the loss function is *not* # the correct way of using L2 regularization/weight decay with Adam, # since that will interact with the m and v parameters in strange ways. # # Instead we want ot decay the weights in a manner that doesn't interact # with the m/v parameters. This is equivalent to adding the square # of the weights to the loss with plain (non-momentum) SGD. if self._do_use_weight_decay(param_name): update += self.weight_decay_rate * param update_with_lr = self.learning_rate * update next_param = param - update_with_lr assignments.extend( [param.assign(next_param), m.assign(next_m), v.assign(next_v)]) return tf.group(*assignments, name=name) def _do_use_weight_decay(self, param_name): """Whether to use L2 weight decay for `param_name`.""" if not self.weight_decay_rate: return False if self.exclude_from_weight_decay: for r in self.exclude_from_weight_decay: if re.search(r, param_name) is not None: return False return True def _get_variable_name(self, param_name): """Get the variable name from the tensor name.""" m = re.match("^(.*):\\d+$", param_name) if m is not None: param_name = m.group(1) return param_name ================================================ FILE: optimization_google.py ================================================ # coding=utf-8 # Copyright 2019 The Google Research Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # Lint as: python2, python3 """Functions and classes related to optimization (weight updates).""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import re import six from six.moves import zip import tensorflow as tf import lamb_optimizer_google as lamb_optimizer def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu, optimizer="adamw", poly_power=1.0, start_warmup_step=0): """Creates an optimizer training op.""" global_step = tf.train.get_or_create_global_step() learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32) # Implements linear decay of the learning rate. learning_rate = tf.train.polynomial_decay( learning_rate, global_step, num_train_steps, end_learning_rate=0.0, power=poly_power, cycle=False) # Implements linear warmup. I.e., if global_step - start_warmup_step < # num_warmup_steps, the learning rate will be # `(global_step - start_warmup_step)/num_warmup_steps * init_lr`. if num_warmup_steps: tf.logging.info("++++++ warmup starts at step " + str(start_warmup_step) + ", for " + str(num_warmup_steps) + " steps ++++++") global_steps_int = tf.cast(global_step, tf.int32) start_warm_int = tf.constant(start_warmup_step, dtype=tf.int32) global_steps_int = global_steps_int - start_warm_int warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32) global_steps_float = tf.cast(global_steps_int, tf.float32) warmup_steps_float = tf.cast(warmup_steps_int, tf.float32) warmup_percent_done = global_steps_float / warmup_steps_float warmup_learning_rate = init_lr * warmup_percent_done is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32) learning_rate = ( (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate) # It is OK that you use this optimizer for finetuning, since this # is how the model was trained (note that the Adam m/v variables are NOT # loaded from init_checkpoint.) # It is OK to use AdamW in the finetuning even the model is trained by LAMB. # As report in the Bert pulic github, the learning rate for SQuAD 1.1 finetune # is 3e-5, 4e-5 or 5e-5. For LAMB, the users can use 3e-4, 4e-4,or 5e-4 for a # batch size of 64 in the finetune. if optimizer == "adamw": tf.logging.info("using adamw") optimizer = AdamWeightDecayOptimizer( learning_rate=learning_rate, weight_decay_rate=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-6, exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"]) elif optimizer == "lamb": tf.logging.info("using lamb") optimizer = lamb_optimizer.LAMBOptimizer( learning_rate=learning_rate, weight_decay_rate=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-6, exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"]) else: raise ValueError("Not supported optimizer: ", optimizer) if use_tpu: optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer) tvars = tf.trainable_variables() grads = tf.gradients(loss, tvars) # This is how the model was pre-trained. (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0) train_op = optimizer.apply_gradients( list(zip(grads, tvars)), global_step=global_step) # Normally the global step update is done inside of `apply_gradients`. # However, neither `AdamWeightDecayOptimizer` nor `LAMBOptimizer` do this. # But if you use a different optimizer, you should probably take this line # out. new_global_step = global_step + 1 train_op = tf.group(train_op, [global_step.assign(new_global_step)]) return train_op class AdamWeightDecayOptimizer(tf.train.Optimizer): """A basic Adam optimizer that includes "correct" L2 weight decay.""" def __init__(self, learning_rate, weight_decay_rate=0.0, beta_1=0.9, beta_2=0.999, epsilon=1e-6, exclude_from_weight_decay=None, name="AdamWeightDecayOptimizer"): """Constructs a AdamWeightDecayOptimizer.""" super(AdamWeightDecayOptimizer, self).__init__(False, name) self.learning_rate = learning_rate self.weight_decay_rate = weight_decay_rate self.beta_1 = beta_1 self.beta_2 = beta_2 self.epsilon = epsilon self.exclude_from_weight_decay = exclude_from_weight_decay def apply_gradients(self, grads_and_vars, global_step=None, name=None): """See base class.""" assignments = [] for (grad, param) in grads_and_vars: if grad is None or param is None: continue param_name = self._get_variable_name(param.name) m = tf.get_variable( name=six.ensure_str(param_name) + "/adam_m", shape=param.shape.as_list(), dtype=tf.float32, trainable=False, initializer=tf.zeros_initializer()) v = tf.get_variable( name=six.ensure_str(param_name) + "/adam_v", shape=param.shape.as_list(), dtype=tf.float32, trainable=False, initializer=tf.zeros_initializer()) # Standard Adam update. next_m = ( tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad)) next_v = ( tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2, tf.square(grad))) update = next_m / (tf.sqrt(next_v) + self.epsilon) # Just adding the square of the weights to the loss function is *not* # the correct way of using L2 regularization/weight decay with Adam, # since that will interact with the m and v parameters in strange ways. # # Instead we want ot decay the weights in a manner that doesn't interact # with the m/v parameters. This is equivalent to adding the square # of the weights to the loss with plain (non-momentum) SGD. if self._do_use_weight_decay(param_name): update += self.weight_decay_rate * param update_with_lr = self.learning_rate * update next_param = param - update_with_lr assignments.extend( [param.assign(next_param), m.assign(next_m), v.assign(next_v)]) return tf.group(*assignments, name=name) def _do_use_weight_decay(self, param_name): """Whether to use L2 weight decay for `param_name`.""" if not self.weight_decay_rate: return False if self.exclude_from_weight_decay: for r in self.exclude_from_weight_decay: if re.search(r, param_name) is not None: return False return True def _get_variable_name(self, param_name): """Get the variable name from the tensor name.""" m = re.match("^(.*):\\d+$", six.ensure_str(param_name)) if m is not None: param_name = m.group(1) return param_name ================================================ FILE: resources/create_pretraining_data_roberta.py ================================================ # coding=utf-8 # Copyright 2018 The Google AI Language Team Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Create masked LM/next sentence masked_lm TF examples for BERT.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import collections import random import re import tokenization import tensorflow as tf import jieba flags = tf.flags FLAGS = flags.FLAGS flags.DEFINE_string("input_file", None, "Input raw text file (or comma-separated list of files).") flags.DEFINE_string( "output_file", None, "Output TF example file (or comma-separated list of files).") flags.DEFINE_string("vocab_file", None, "The vocabulary file that the BERT model was trained on.") flags.DEFINE_bool( "do_lower_case", True, "Whether to lower case the input text. Should be True for uncased " "models and False for cased models.") flags.DEFINE_bool( "do_whole_word_mask", False, "Whether to use whole word masking rather than per-WordPiece masking.") flags.DEFINE_integer("max_seq_length", 128, "Maximum sequence length.") flags.DEFINE_integer("max_predictions_per_seq", 20, "Maximum number of masked LM predictions per sequence.") flags.DEFINE_integer("random_seed", 12345, "Random seed for data generation.") flags.DEFINE_integer( "dupe_factor", 10, "Number of times to duplicate the input data (with different masks).") flags.DEFINE_float("masked_lm_prob", 0.15, "Masked LM probability.") flags.DEFINE_float( "short_seq_prob", 0.1, "Probability of creating sequences which are shorter than the " "maximum length.") class TrainingInstance(object): """A single training instance (sentence pair).""" def __init__(self, tokens, segment_ids, masked_lm_positions, masked_lm_labels, is_random_next): self.tokens = tokens self.segment_ids = segment_ids self.is_random_next = is_random_next self.masked_lm_positions = masked_lm_positions self.masked_lm_labels = masked_lm_labels def __str__(self): s = "" s += "tokens: %s\n" % (" ".join( [tokenization.printable_text(x) for x in self.tokens])) s += "segment_ids: %s\n" % (" ".join([str(x) for x in self.segment_ids])) s += "is_random_next: %s\n" % self.is_random_next s += "masked_lm_positions: %s\n" % (" ".join( [str(x) for x in self.masked_lm_positions])) s += "masked_lm_labels: %s\n" % (" ".join( [tokenization.printable_text(x) for x in self.masked_lm_labels])) s += "\n" return s def __repr__(self): return self.__str__() def write_instance_to_example_files(instances, tokenizer, max_seq_length, max_predictions_per_seq, output_files): """Create TF example files from `TrainingInstance`s.""" writers = [] for output_file in output_files: writers.append(tf.python_io.TFRecordWriter(output_file)) writer_index = 0 total_written = 0 for (inst_index, instance) in enumerate(instances): input_ids = tokenizer.convert_tokens_to_ids(instance.tokens) input_mask = [1] * len(input_ids) segment_ids = list(instance.segment_ids) assert len(input_ids) <= max_seq_length while len(input_ids) < max_seq_length: input_ids.append(0) input_mask.append(0) segment_ids.append(0) assert len(input_ids) == max_seq_length assert len(input_mask) == max_seq_length # print("length of segment_ids:",len(segment_ids),"max_seq_length:", max_seq_length) assert len(segment_ids) == max_seq_length masked_lm_positions = list(instance.masked_lm_positions) masked_lm_ids = tokenizer.convert_tokens_to_ids(instance.masked_lm_labels) masked_lm_weights = [1.0] * len(masked_lm_ids) while len(masked_lm_positions) < max_predictions_per_seq: masked_lm_positions.append(0) masked_lm_ids.append(0) masked_lm_weights.append(0.0) next_sentence_label = 1 if instance.is_random_next else 0 features = collections.OrderedDict() features["input_ids"] = create_int_feature(input_ids) features["input_mask"] = create_int_feature(input_mask) features["segment_ids"] = create_int_feature(segment_ids) features["masked_lm_positions"] = create_int_feature(masked_lm_positions) features["masked_lm_ids"] = create_int_feature(masked_lm_ids) features["masked_lm_weights"] = create_float_feature(masked_lm_weights) features["next_sentence_labels"] = create_int_feature([next_sentence_label]) tf_example = tf.train.Example(features=tf.train.Features(feature=features)) writers[writer_index].write(tf_example.SerializeToString()) writer_index = (writer_index + 1) % len(writers) total_written += 1 if inst_index < 20: tf.logging.info("*** Example ***") tf.logging.info("tokens: %s" % " ".join( [tokenization.printable_text(x) for x in instance.tokens])) for feature_name in features.keys(): feature = features[feature_name] values = [] if feature.int64_list.value: values = feature.int64_list.value elif feature.float_list.value: values = feature.float_list.value tf.logging.info( "%s: %s" % (feature_name, " ".join([str(x) for x in values]))) for writer in writers: writer.close() tf.logging.info("Wrote %d total instances", total_written) def create_int_feature(values): feature = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) return feature def create_float_feature(values): feature = tf.train.Feature(float_list=tf.train.FloatList(value=list(values))) return feature def create_training_instances(input_files, tokenizer, max_seq_length, dupe_factor, short_seq_prob, masked_lm_prob, max_predictions_per_seq, rng): """Create `TrainingInstance`s from raw text.""" all_documents = [[]] # Input file format: # (1) One sentence per line. These should ideally be actual sentences, not # entire paragraphs or arbitrary spans of text. (Because we use the # sentence boundaries for the "next sentence prediction" task). # (2) Blank lines between documents. Document boundaries are needed so # that the "next sentence prediction" task doesn't span between documents. print("create_training_instances.started...") for input_file in input_files: with tf.gfile.GFile(input_file, "r") as reader: while True: line = tokenization.convert_to_unicode(reader.readline().replace("",""))# .replace("”","")) # 将、”替换掉。 if not line: break line = line.strip() # Empty lines are used as document delimiters if not line: all_documents.append([]) tokens = tokenizer.tokenize(line) if tokens: all_documents[-1].append(tokens) # Remove empty documents all_documents = [x for x in all_documents if x] rng.shuffle(all_documents) vocab_words = list(tokenizer.vocab.keys()) instances = [] for _ in range(dupe_factor): for document_index in range(len(all_documents)): instances.extend( create_instances_from_document( all_documents, document_index, max_seq_length, short_seq_prob, masked_lm_prob, max_predictions_per_seq, vocab_words, rng)) rng.shuffle(instances) print("create_training_instances.ended...") return instances def _is_chinese_char(cp): """Checks whether CP is the codepoint of a CJK character.""" # This defines a "chinese character" as anything in the CJK Unicode block: # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) # # Note that the CJK Unicode block is NOT all Japanese and Korean characters, # despite its name. The modern Korean Hangul alphabet is a different block, # as is Japanese Hiragana and Katakana. Those alphabets are used to write # space-separated words, so they are not treated specially and handled # like the all of the other languages. if ((cp >= 0x4E00 and cp <= 0x9FFF) or # (cp >= 0x3400 and cp <= 0x4DBF) or # (cp >= 0x20000 and cp <= 0x2A6DF) or # (cp >= 0x2A700 and cp <= 0x2B73F) or # (cp >= 0x2B740 and cp <= 0x2B81F) or # (cp >= 0x2B820 and cp <= 0x2CEAF) or (cp >= 0xF900 and cp <= 0xFAFF) or # (cp >= 0x2F800 and cp <= 0x2FA1F)): # return True def get_new_segment(segment): # 新增的方法 #### """ 输入一句话,返回一句经过处理的话: 为了支持中文全称mask,将被分开的词,将上特殊标记("#"),使得后续处理模块,能够知道哪些字是属于同一个词的。 :param segment: 一句话 :return: 一句处理过的话 """ seq_cws = jieba.lcut("".join(segment)) seq_cws_dict = {x: 1 for x in seq_cws} new_segment = [] i = 0 while i < len(segment): if len(re.findall('[\u4E00-\u9FA5]', segment[i]))==0: # 不是中文的,原文加进去。 new_segment.append(segment[i]) i += 1 continue has_add = False for length in range(3,0,-1): if i+length>len(segment): continue if ''.join(segment[i:i+length]) in seq_cws_dict: new_segment.append(segment[i]) for l in range(1, length): new_segment.append('##' + segment[i+l]) i += length has_add = True break if not has_add: new_segment.append(segment[i]) i += 1 return new_segment def get_raw_instance(document,max_sequence_length): # 新增的方法 TODO need check again to ensure full use of data """ 获取初步的训练实例,将整段按照max_sequence_length切分成多个部分,并以多个处理好的实例的形式返回。 :param document: 一整段 :param max_sequence_length: :return: a list. each element is a sequence of text """ max_sequence_length_allowed=max_sequence_length-2 document = [seq for seq in document if len(seq)max_sequence_length_allowed/2: # /2 result_list.append(curr_seq) # # 计算总共可以得到多少份 # num_instance=int(len(big_list)/max_sequence_length_allowed)+1 # print("num_instance:",num_instance) # # 切分成多份,添加到列表中 # result_list=[] # for j in range(num_instance): # index=j*max_sequence_length_allowed # end_index=index+max_sequence_length_allowed if j!=num_instance-1 else -1 # result_list.append(big_list[index:end_index]) return result_list def create_instances_from_document( # 新增的方法 # 目标按照RoBERTa的思路,使用DOC-SENTENCES,并会去掉NSP任务: 从一个文档中连续的获得文本,直到达到最大长度。如果是从下一个文档中获得,那么加上一个分隔符 # document即一整段话,包含多个句子。每个句子叫做segment. # 给定一个document即一整段话,生成一些instance. all_documents, document_index, max_seq_length, short_seq_prob, masked_lm_prob, max_predictions_per_seq, vocab_words, rng): """Creates `TrainingInstance`s for a single document.""" document = all_documents[document_index] # Account for [CLS], [SEP], [SEP] max_num_tokens = max_seq_length - 3 # We *usually* want to fill up the entire sequence since we are padding # to `max_seq_length` anyways, so short sequences are generally wasted # computation. However, we *sometimes* # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter # sequences to minimize the mismatch between pre-training and fine-tuning. # The `target_seq_length` is just a rough target however, whereas # `max_seq_length` is a hard limit. #target_seq_length = max_num_tokens #if rng.random() < short_seq_prob: # target_seq_length = rng.randint(2, max_num_tokens) instances = [] raw_text_list_list=get_raw_instance(document, max_seq_length) # document即一整段话,包含多个句子。每个句子叫做segment. for j, raw_text_list in enumerate(raw_text_list_list): #################################################################################################################### raw_text_list = get_new_segment(raw_text_list) # 结合分词的中文的whole mask设置即在需要的地方加上“##” # 1、设置token, segment_ids is_random_next=True # this will not be used, so it's value doesn't matter tokens = [] segment_ids = [] tokens.append("[CLS]") segment_ids.append(0) for token in raw_text_list: tokens.append(token) segment_ids.append(0) tokens.append("[SEP]") segment_ids.append(0) ################################################################################################################ # 2、调用原有的方法 (tokens, masked_lm_positions, masked_lm_labels) = create_masked_lm_predictions( tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng) instance = TrainingInstance( tokens=tokens, segment_ids=segment_ids, is_random_next=is_random_next, masked_lm_positions=masked_lm_positions, masked_lm_labels=masked_lm_labels) instances.append(instance) return instances def create_instances_from_document_original( all_documents, document_index, max_seq_length, short_seq_prob, masked_lm_prob, max_predictions_per_seq, vocab_words, rng): """Creates `TrainingInstance`s for a single document.""" document = all_documents[document_index] # Account for [CLS], [SEP], [SEP] max_num_tokens = max_seq_length - 3 # We *usually* want to fill up the entire sequence since we are padding # to `max_seq_length` anyways, so short sequences are generally wasted # computation. However, we *sometimes* # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter # sequences to minimize the mismatch between pre-training and fine-tuning. # The `target_seq_length` is just a rough target however, whereas # `max_seq_length` is a hard limit. target_seq_length = max_num_tokens if rng.random() < short_seq_prob: target_seq_length = rng.randint(2, max_num_tokens) # We DON'T just concatenate all of the tokens from a document into a long # sequence and choose an arbitrary split point because this would make the # next sentence prediction task too easy. Instead, we split the input into # segments "A" and "B" based on the actual "sentences" provided by the user # input. instances = [] current_chunk = [] current_length = 0 i = 0 print("document_index:",document_index,"document:",type(document)," ;document:",document) # document即一整段话,包含多个句子。每个句子叫做segment. while i < len(document): segment = document[i] # 取到一个部分(可能是一段话) print("i:",i," ;segment:",segment) #################################################################################################################### segment = get_new_segment(segment) # 结合分词的中文的whole mask设置即在需要的地方加上“##” ################################################################################################################### current_chunk.append(segment) current_length += len(segment) print("#####condition:",i == len(document) - 1 or current_length >= target_seq_length) if i == len(document) - 1 or current_length >= target_seq_length: if current_chunk: # `a_end` is how many segments from `current_chunk` go into the `A` # (first) sentence. a_end = 1 if len(current_chunk) >= 2: a_end = rng.randint(1, len(current_chunk) - 1) tokens_a = [] for j in range(a_end): tokens_a.extend(current_chunk[j]) tokens_b = [] # Random next is_random_next = False if len(current_chunk) == 1 or rng.random() < 0.5: is_random_next = True target_b_length = target_seq_length - len(tokens_a) # This should rarely go for more than one iteration for large # corpora. However, just to be careful, we try to make sure that # the random document is not the same as the document # we're processing. for _ in range(10): random_document_index = rng.randint(0, len(all_documents) - 1) if random_document_index != document_index: break random_document = all_documents[random_document_index] random_start = rng.randint(0, len(random_document) - 1) for j in range(random_start, len(random_document)): tokens_b.extend(random_document[j]) if len(tokens_b) >= target_b_length: break # We didn't actually use these segments so we "put them back" so # they don't go to waste. num_unused_segments = len(current_chunk) - a_end i -= num_unused_segments # Actual next else: is_random_next = False for j in range(a_end, len(current_chunk)): tokens_b.extend(current_chunk[j]) truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng) assert len(tokens_a) >= 1 assert len(tokens_b) >= 1 tokens = [] segment_ids = [] tokens.append("[CLS]") segment_ids.append(0) for token in tokens_a: tokens.append(token) segment_ids.append(0) tokens.append("[SEP]") segment_ids.append(0) for token in tokens_b: tokens.append(token) segment_ids.append(1) tokens.append("[SEP]") segment_ids.append(1) (tokens, masked_lm_positions, masked_lm_labels) = create_masked_lm_predictions( tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng) instance = TrainingInstance( tokens=tokens, segment_ids=segment_ids, is_random_next=is_random_next, masked_lm_positions=masked_lm_positions, masked_lm_labels=masked_lm_labels) instances.append(instance) current_chunk = [] current_length = 0 i += 1 return instances MaskedLmInstance = collections.namedtuple("MaskedLmInstance", ["index", "label"]) def create_masked_lm_predictions(tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng): """Creates the predictions for the masked LM objective.""" cand_indexes = [] for (i, token) in enumerate(tokens): if token == "[CLS]" or token == "[SEP]": continue # Whole Word Masking means that if we mask all of the wordpieces # corresponding to an original word. When a word has been split into # WordPieces, the first token does not have any marker and any subsequence # tokens are prefixed with ##. So whenever we see the ## token, we # append it to the previous set of word indexes. # # Note that Whole Word Masking does *not* change the training code # at all -- we still predict each WordPiece independently, softmaxed # over the entire vocabulary. if (FLAGS.do_whole_word_mask and len(cand_indexes) >= 1 and token.startswith("##")): cand_indexes[-1].append(i) else: cand_indexes.append([i]) rng.shuffle(cand_indexes) output_tokens = [t[2:] if len(re.findall('##[\u4E00-\u9FA5]', t))>0 else t for t in tokens] # 去掉"##" num_to_predict = min(max_predictions_per_seq, max(1, int(round(len(tokens) * masked_lm_prob)))) masked_lms = [] covered_indexes = set() for index_set in cand_indexes: if len(masked_lms) >= num_to_predict: break # If adding a whole-word mask would exceed the maximum number of # predictions, then just skip this candidate. if len(masked_lms) + len(index_set) > num_to_predict: continue is_any_index_covered = False for index in index_set: if index in covered_indexes: is_any_index_covered = True break if is_any_index_covered: continue for index in index_set: covered_indexes.add(index) masked_token = None # 80% of the time, replace with [MASK] if rng.random() < 0.8: masked_token = "[MASK]" else: # 10% of the time, keep original if rng.random() < 0.5: masked_token = tokens[index][2:] if len(re.findall('##[\u4E00-\u9FA5]', tokens[index]))>0 else tokens[index] # 去掉"##" # 10% of the time, replace with random word else: masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)] output_tokens[index] = masked_token masked_lms.append(MaskedLmInstance(index=index, label=tokens[index])) assert len(masked_lms) <= num_to_predict masked_lms = sorted(masked_lms, key=lambda x: x.index) masked_lm_positions = [] masked_lm_labels = [] for p in masked_lms: masked_lm_positions.append(p.index) masked_lm_labels.append(p.label) # tf.logging.info('%s' % (tokens)) # tf.logging.info('%s' % (output_tokens)) return (output_tokens, masked_lm_positions, masked_lm_labels) def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng): """Truncates a pair of sequences to a maximum sequence length.""" while True: total_length = len(tokens_a) + len(tokens_b) if total_length <= max_num_tokens: break trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b assert len(trunc_tokens) >= 1 # We want to sometimes truncate from the front and sometimes from the # back to add more randomness and avoid biases. if rng.random() < 0.5: del trunc_tokens[0] else: trunc_tokens.pop() def main(_): tf.logging.set_verbosity(tf.logging.INFO) tokenizer = tokenization.FullTokenizer( vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) input_files = [] for input_pattern in FLAGS.input_file.split(","): input_files.extend(tf.gfile.Glob(input_pattern)) tf.logging.info("*** Reading from input files ***") for input_file in input_files: tf.logging.info(" %s", input_file) rng = random.Random(FLAGS.random_seed) instances = create_training_instances( input_files, tokenizer, FLAGS.max_seq_length, FLAGS.dupe_factor, FLAGS.short_seq_prob, FLAGS.masked_lm_prob, FLAGS.max_predictions_per_seq, rng) output_files = FLAGS.output_file.split(",") tf.logging.info("*** Writing to output files ***") for output_file in output_files: tf.logging.info(" %s", output_file) write_instance_to_example_files(instances, tokenizer, FLAGS.max_seq_length, FLAGS.max_predictions_per_seq, output_files) if __name__ == "__main__": flags.mark_flag_as_required("input_file") flags.mark_flag_as_required("output_file") flags.mark_flag_as_required("vocab_file") tf.app.run() ================================================ FILE: resources/shell_scripts/create_pretrain_data_batch_webtext.sh ================================================ #!/usr/bin/env bash echo $1,$2 BERT_BASE_DIR=./bert_config for((i=$1;i<=$2;i++)); do python3 create_pretraining_data.py --do_whole_word_mask=True --input_file=gs://raw_text/web_text_zh_raw/web_text_zh_$i.txt \ --output_file=gs://albert_zh/tf_records/tf_web_text_zh_$i.tfrecord --vocab_file=$BERT_BASE_DIR/vocab.txt --do_lower_case=True \ --max_seq_length=512 --max_predictions_per_seq=76 --masked_lm_prob=0.15 done ================================================ FILE: run_classifier.py ================================================ # coding=utf-8 # Copyright 2018 The Google AI Language Team Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """BERT finetuning runner.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import collections import csv import os import modeling import optimization_finetuning as optimization import tokenization import tensorflow as tf # from loss import bi_tempered_logistic_loss flags = tf.flags FLAGS = flags.FLAGS ## Required parameters flags.DEFINE_string( "data_dir", None, "The input data dir. Should contain the .tsv files (or other data files) " "for the task.") flags.DEFINE_string( "bert_config_file", None, "The config json file corresponding to the pre-trained BERT model. " "This specifies the model architecture.") flags.DEFINE_string("task_name", None, "The name of the task to train.") flags.DEFINE_string("vocab_file", None, "The vocabulary file that the BERT model was trained on.") flags.DEFINE_string( "output_dir", None, "The output directory where the model checkpoints will be written.") ## Other parameters flags.DEFINE_string( "init_checkpoint", None, "Initial checkpoint (usually from a pre-trained BERT model).") flags.DEFINE_bool( "do_lower_case", True, "Whether to lower case the input text. Should be True for uncased " "models and False for cased models.") flags.DEFINE_integer( "max_seq_length", 128, "The maximum total input sequence length after WordPiece tokenization. " "Sequences longer than this will be truncated, and sequences shorter " "than this will be padded.") flags.DEFINE_bool("do_train", False, "Whether to run training.") flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.") flags.DEFINE_bool( "do_predict", False, "Whether to run the model in inference mode on the test set.") flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.") flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.") flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.") flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.") flags.DEFINE_float("num_train_epochs", 3.0, "Total number of training epochs to perform.") flags.DEFINE_float( "warmup_proportion", 0.1, "Proportion of training to perform linear learning rate warmup for. " "E.g., 0.1 = 10% of training.") flags.DEFINE_integer("save_checkpoints_steps", 1000, "How often to save the model checkpoint.") flags.DEFINE_integer("iterations_per_loop", 1000, "How many steps to make in each estimator call.") flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") tf.flags.DEFINE_string( "tpu_name", None, "The Cloud TPU to use for training. This should be either the name " "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " "url.") tf.flags.DEFINE_string( "tpu_zone", None, "[Optional] GCE zone where the Cloud TPU is located in. If not " "specified, we will attempt to automatically detect the GCE project from " "metadata.") tf.flags.DEFINE_string( "gcp_project", None, "[Optional] Project name for the Cloud TPU-enabled project. If not " "specified, we will attempt to automatically detect the GCE project from " "metadata.") tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") flags.DEFINE_integer( "num_tpu_cores", 8, "Only used if `use_tpu` is True. Total number of TPU cores to use.") class InputExample(object): """A single training/test example for simple sequence classification.""" def __init__(self, guid, text_a, text_b=None, label=None): """Constructs a InputExample. Args: guid: Unique id for the example. text_a: string. The untokenized text of the first sequence. For single sequence tasks, only this sequence must be specified. text_b: (Optional) string. The untokenized text of the second sequence. Only must be specified for sequence pair tasks. label: (Optional) string. The label of the example. This should be specified for train and dev examples, but not for test examples. """ self.guid = guid self.text_a = text_a self.text_b = text_b self.label = label class PaddingInputExample(object): """Fake example so the num input examples is a multiple of the batch size. When running eval/predict on the TPU, we need to pad the number of examples to be a multiple of the batch size, because the TPU requires a fixed batch size. The alternative is to drop the last batch, which is bad because it means the entire output data won't be generated. We use this class instead of `None` because treating `None` as padding battches could cause silent errors. """ class InputFeatures(object): """A single set of features of data.""" def __init__(self, input_ids, input_mask, segment_ids, label_id, is_real_example=True): self.input_ids = input_ids self.input_mask = input_mask self.segment_ids = segment_ids self.label_id = label_id self.is_real_example = is_real_example class DataProcessor(object): """Base class for data converters for sequence classification data sets.""" def get_train_examples(self, data_dir): """Gets a collection of `InputExample`s for the train set.""" raise NotImplementedError() def get_dev_examples(self, data_dir): """Gets a collection of `InputExample`s for the dev set.""" raise NotImplementedError() def get_test_examples(self, data_dir): """Gets a collection of `InputExample`s for prediction.""" raise NotImplementedError() def get_labels(self): """Gets the list of labels for this data set.""" raise NotImplementedError() @classmethod def _read_tsv(cls, input_file, quotechar=None): """Reads a tab separated value file.""" with tf.gfile.Open(input_file, "r") as f: reader = csv.reader(f, delimiter="\t", quotechar=quotechar) lines = [] for line in reader: lines.append(line) return lines def convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer): """Converts a single `InputExample` into a single `InputFeatures`.""" if isinstance(example, PaddingInputExample): return InputFeatures( input_ids=[0] * max_seq_length, input_mask=[0] * max_seq_length, segment_ids=[0] * max_seq_length, label_id=0, is_real_example=False) label_map = {} for (i, label) in enumerate(label_list): label_map[label] = i tokens_a = tokenizer.tokenize(example.text_a) tokens_b = None if example.text_b: tokens_b = tokenizer.tokenize(example.text_b) if tokens_b: # Modifies `tokens_a` and `tokens_b` in place so that the total # length is less than the specified length. # Account for [CLS], [SEP], [SEP] with "- 3" _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) else: # Account for [CLS] and [SEP] with "- 2" if len(tokens_a) > max_seq_length - 2: tokens_a = tokens_a[0:(max_seq_length - 2)] # The convention in BERT is: # (a) For sequence pairs: # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 # (b) For single sequences: # tokens: [CLS] the dog is hairy . [SEP] # type_ids: 0 0 0 0 0 0 0 # # Where "type_ids" are used to indicate whether this is the first # sequence or the second sequence. The embedding vectors for `type=0` and # `type=1` were learned during pre-training and are added to the wordpiece # embedding vector (and position vector). This is not *strictly* necessary # since the [SEP] token unambiguously separates the sequences, but it makes # it easier for the model to learn the concept of sequences. # # For classification tasks, the first vector (corresponding to [CLS]) is # used as the "sentence vector". Note that this only makes sense because # the entire model is fine-tuned. tokens = [] segment_ids = [] tokens.append("[CLS]") segment_ids.append(0) for token in tokens_a: tokens.append(token) segment_ids.append(0) tokens.append("[SEP]") segment_ids.append(0) if tokens_b: for token in tokens_b: tokens.append(token) segment_ids.append(1) tokens.append("[SEP]") segment_ids.append(1) input_ids = tokenizer.convert_tokens_to_ids(tokens) # The mask has 1 for real tokens and 0 for padding tokens. Only real # tokens are attended to. input_mask = [1] * len(input_ids) # Zero-pad up to the sequence length. while len(input_ids) < max_seq_length: input_ids.append(0) input_mask.append(0) segment_ids.append(0) assert len(input_ids) == max_seq_length assert len(input_mask) == max_seq_length assert len(segment_ids) == max_seq_length label_id = label_map[example.label] if ex_index < 5: tf.logging.info("*** Example ***") tf.logging.info("guid: %s" % (example.guid)) tf.logging.info("tokens: %s" % " ".join( [tokenization.printable_text(x) for x in tokens])) tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) tf.logging.info("label: %s (id = %d)" % (example.label, label_id)) feature = InputFeatures( input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids, label_id=label_id, is_real_example=True) return feature def file_based_convert_examples_to_features( examples, label_list, max_seq_length, tokenizer, output_file): """Convert a set of `InputExample`s to a TFRecord file.""" writer = tf.python_io.TFRecordWriter(output_file) for (ex_index, example) in enumerate(examples): if ex_index % 10000 == 0: tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) feature = convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer) def create_int_feature(values): f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) return f features = collections.OrderedDict() features["input_ids"] = create_int_feature(feature.input_ids) features["input_mask"] = create_int_feature(feature.input_mask) features["segment_ids"] = create_int_feature(feature.segment_ids) features["label_ids"] = create_int_feature([feature.label_id]) features["is_real_example"] = create_int_feature( [int(feature.is_real_example)]) tf_example = tf.train.Example(features=tf.train.Features(feature=features)) writer.write(tf_example.SerializeToString()) writer.close() def file_based_input_fn_builder(input_file, seq_length, is_training, drop_remainder): """Creates an `input_fn` closure to be passed to TPUEstimator.""" name_to_features = { "input_ids": tf.FixedLenFeature([seq_length], tf.int64), "input_mask": tf.FixedLenFeature([seq_length], tf.int64), "segment_ids": tf.FixedLenFeature([seq_length], tf.int64), "label_ids": tf.FixedLenFeature([], tf.int64), "is_real_example": tf.FixedLenFeature([], tf.int64), } def _decode_record(record, name_to_features): """Decodes a record to a TensorFlow example.""" example = tf.parse_single_example(record, name_to_features) # tf.Example only supports tf.int64, but the TPU only supports tf.int32. # So cast all int64 to int32. for name in list(example.keys()): t = example[name] if t.dtype == tf.int64: t = tf.to_int32(t) example[name] = t return example def input_fn(params): """The actual input function.""" batch_size = params["batch_size"] # For training, we want a lot of parallel reading and shuffling. # For eval, we want no shuffling and parallel reading doesn't matter. d = tf.data.TFRecordDataset(input_file) if is_training: d = d.repeat() d = d.shuffle(buffer_size=100) d = d.apply( tf.contrib.data.map_and_batch( lambda record: _decode_record(record, name_to_features), batch_size=batch_size, drop_remainder=drop_remainder)) return d return input_fn def _truncate_seq_pair(tokens_a, tokens_b, max_length): """Truncates a sequence pair in place to the maximum length.""" # This is a simple heuristic which will always truncate the longer sequence # one token at a time. This makes more sense than truncating an equal percent # of tokens from each, since if one sequence is very short then each token # that's truncated likely contains more information than a longer sequence. while True: total_length = len(tokens_a) + len(tokens_b) if total_length <= max_length: break if len(tokens_a) > len(tokens_b): tokens_a.pop() else: tokens_b.pop() def create_model(bert_config, is_training, input_ids, input_mask, segment_ids, labels, num_labels, use_one_hot_embeddings): """Creates a classification model.""" model = modeling.BertModel( config=bert_config, is_training=is_training, input_ids=input_ids, input_mask=input_mask, token_type_ids=segment_ids, use_one_hot_embeddings=use_one_hot_embeddings) # In the demo, we are doing a simple classification task on the entire # segment. # # If you want to use the token-level output, use model.get_sequence_output() # instead. output_layer = model.get_pooled_output() hidden_size = output_layer.shape[-1].value output_weights = tf.get_variable( "output_weights", [num_labels, hidden_size], initializer=tf.truncated_normal_initializer(stddev=0.02)) output_bias = tf.get_variable( "output_bias", [num_labels], initializer=tf.zeros_initializer()) with tf.variable_scope("loss"): ln_type = bert_config.ln_type if ln_type == 'preln': # add by brightmart, 10-06. if it is preln, we need to an additonal layer: layer normalization as suggested in paper "ON LAYER NORMALIZATION IN THE TRANSFORMER ARCHITECTURE" print("ln_type is preln. add LN layer.") output_layer=layer_norm(output_layer) else: print("ln_type is postln or other,do nothing.") if is_training: # I.e., 0.1 dropout output_layer = tf.nn.dropout(output_layer, keep_prob=0.9) logits = tf.matmul(output_layer, output_weights, transpose_b=True) logits = tf.nn.bias_add(logits, output_bias) probabilities = tf.nn.softmax(logits, axis=-1) log_probs = tf.nn.log_softmax(logits, axis=-1) one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) # todo 08-29 try temp-loss ###############bi_tempered_logistic_loss############################################################################ # print("##cross entropy loss is used...."); tf.logging.info("##cross entropy loss is used....") # t1=0.9 #t1=0.90 # t2=1.05 #t2=1.05 # per_example_loss=bi_tempered_logistic_loss(log_probs,one_hot_labels,t1,t2,label_smoothing=0.1,num_iters=5) # TODO label_smoothing=0.0 #tf.logging.info("per_example_loss:"+str(per_example_loss.shape)) ##############bi_tempered_logistic_loss############################################################################# loss = tf.reduce_mean(per_example_loss) return (loss, per_example_loss, logits, probabilities) def layer_norm(input_tensor, name=None): """Run layer normalization on the last dimension of the tensor.""" return tf.contrib.layers.layer_norm( inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name) def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate, num_train_steps, num_warmup_steps, use_tpu, use_one_hot_embeddings): """Returns `model_fn` closure for TPUEstimator.""" def model_fn(features, labels, mode, params): # pylint: disable=unused-argument """The `model_fn` for TPUEstimator.""" tf.logging.info("*** Features ***") for name in sorted(features.keys()): tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) input_ids = features["input_ids"] input_mask = features["input_mask"] segment_ids = features["segment_ids"] label_ids = features["label_ids"] is_real_example = None if "is_real_example" in features: is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32) else: is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32) is_training = (mode == tf.estimator.ModeKeys.TRAIN) (total_loss, per_example_loss, logits, probabilities) = create_model( bert_config, is_training, input_ids, input_mask, segment_ids, label_ids, num_labels, use_one_hot_embeddings) tvars = tf.trainable_variables() initialized_variable_names = {} scaffold_fn = None if init_checkpoint: (assignment_map, initialized_variable_names ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) if use_tpu: def tpu_scaffold(): tf.train.init_from_checkpoint(init_checkpoint, assignment_map) return tf.train.Scaffold() scaffold_fn = tpu_scaffold else: tf.train.init_from_checkpoint(init_checkpoint, assignment_map) tf.logging.info("**** Trainable Variables ****") for var in tvars: init_string = "" if var.name in initialized_variable_names: init_string = ", *INIT_FROM_CKPT*" tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, init_string) output_spec = None if mode == tf.estimator.ModeKeys.TRAIN: train_op = optimization.create_optimizer( total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu) output_spec = tf.contrib.tpu.TPUEstimatorSpec( mode=mode, loss=total_loss, train_op=train_op, scaffold_fn=scaffold_fn) elif mode == tf.estimator.ModeKeys.EVAL: def metric_fn(per_example_loss, label_ids, logits, is_real_example): predictions = tf.argmax(logits, axis=-1, output_type=tf.int32) accuracy = tf.metrics.accuracy( labels=label_ids, predictions=predictions, weights=is_real_example) loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example) return { "eval_accuracy": accuracy, "eval_loss": loss, } eval_metrics = (metric_fn, [per_example_loss, label_ids, logits, is_real_example]) output_spec = tf.contrib.tpu.TPUEstimatorSpec( mode=mode, loss=total_loss, eval_metrics=eval_metrics, scaffold_fn=scaffold_fn) else: output_spec = tf.contrib.tpu.TPUEstimatorSpec( mode=mode, predictions={"probabilities": probabilities}, scaffold_fn=scaffold_fn) return output_spec return model_fn # This function is not used by this file but is still used by the Colab and # people who depend on it. def input_fn_builder(features, seq_length, is_training, drop_remainder): """Creates an `input_fn` closure to be passed to TPUEstimator.""" all_input_ids = [] all_input_mask = [] all_segment_ids = [] all_label_ids = [] for feature in features: all_input_ids.append(feature.input_ids) all_input_mask.append(feature.input_mask) all_segment_ids.append(feature.segment_ids) all_label_ids.append(feature.label_id) def input_fn(params): """The actual input function.""" batch_size = params["batch_size"] num_examples = len(features) # This is for demo purposes and does NOT scale to large data sets. We do # not use Dataset.from_generator() because that uses tf.py_func which is # not TPU compatible. The right way to load data is with TFRecordReader. d = tf.data.Dataset.from_tensor_slices({ "input_ids": tf.constant( all_input_ids, shape=[num_examples, seq_length], dtype=tf.int32), "input_mask": tf.constant( all_input_mask, shape=[num_examples, seq_length], dtype=tf.int32), "segment_ids": tf.constant( all_segment_ids, shape=[num_examples, seq_length], dtype=tf.int32), "label_ids": tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32), }) if is_training: d = d.repeat() d = d.shuffle(buffer_size=100) d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder) return d return input_fn class LCQMCPairClassificationProcessor(DataProcessor): # TODO NEED CHANGE2 """Processor for the internal data set. sentence pair classification""" def __init__(self): self.language = "zh" def get_train_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "train.txt")), "train") # dev_0827.tsv def get_dev_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "dev.txt")), "dev") def get_test_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "test.txt")), "test") def get_labels(self): """See base class.""" return ["0", "1"] #return ["-1","0", "1"] def _create_examples(self, lines, set_type): """Creates examples for the training and dev sets.""" examples = [] print("length of lines:",len(lines)) for (i, line) in enumerate(lines): #print('#i:',i,line) if i == 0: continue guid = "%s-%s" % (set_type, i) try: label = tokenization.convert_to_unicode(line[2]) text_a = tokenization.convert_to_unicode(line[0]) text_b = tokenization.convert_to_unicode(line[1]) examples.append( InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) except Exception: print('###error.i:', i, line) return examples class SentencePairClassificationProcessor(DataProcessor): """Processor for the internal data set. sentence pair classification""" def __init__(self): self.language = "zh" def get_train_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "train_0827.tsv")), "train") # dev_0827.tsv def get_dev_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "dev_0827.tsv")), "dev") def get_test_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "test_0827.tsv")), "test") def get_labels(self): """See base class.""" return ["0", "1"] #return ["-1","0", "1"] def _create_examples(self, lines, set_type): """Creates examples for the training and dev sets.""" examples = [] print("length of lines:",len(lines)) for (i, line) in enumerate(lines): #print('#i:',i,line) if i == 0: continue guid = "%s-%s" % (set_type, i) try: label = tokenization.convert_to_unicode(line[0]) text_a = tokenization.convert_to_unicode(line[1]) text_b = tokenization.convert_to_unicode(line[2]) examples.append( InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) except Exception: print('###error.i:', i, line) return examples # This function is not used by this file but is still used by the Colab and # people who depend on it. def convert_examples_to_features(examples, label_list, max_seq_length, tokenizer): """Convert a set of `InputExample`s to a list of `InputFeatures`.""" features = [] for (ex_index, example) in enumerate(examples): if ex_index % 10000 == 0: tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) feature = convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer) features.append(feature) return features def main(_): tf.logging.set_verbosity(tf.logging.INFO) processors = { "sentence_pair": SentencePairClassificationProcessor, "lcqmc_pair":LCQMCPairClassificationProcessor, "lcqmc": LCQMCPairClassificationProcessor } tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case, FLAGS.init_checkpoint) if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict: raise ValueError( "At least one of `do_train`, `do_eval` or `do_predict' must be True.") bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file) if FLAGS.max_seq_length > bert_config.max_position_embeddings: raise ValueError( "Cannot use sequence length %d because the BERT model " "was only trained up to sequence length %d" % (FLAGS.max_seq_length, bert_config.max_position_embeddings)) tf.gfile.MakeDirs(FLAGS.output_dir) task_name = FLAGS.task_name.lower() if task_name not in processors: raise ValueError("Task not found: %s" % (task_name)) processor = processors[task_name]() label_list = processor.get_labels() tokenizer = tokenization.FullTokenizer( vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) tpu_cluster_resolver = None if FLAGS.use_tpu and FLAGS.tpu_name: tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 # Cloud TPU: Invalid TPU configuration, ensure ClusterResolver is passed to tpu. print("###tpu_cluster_resolver:",tpu_cluster_resolver) run_config = tf.contrib.tpu.RunConfig( cluster=tpu_cluster_resolver, master=FLAGS.master, model_dir=FLAGS.output_dir, save_checkpoints_steps=FLAGS.save_checkpoints_steps, tpu_config=tf.contrib.tpu.TPUConfig( iterations_per_loop=FLAGS.iterations_per_loop, num_shards=FLAGS.num_tpu_cores, per_host_input_for_training=is_per_host)) train_examples = None num_train_steps = None num_warmup_steps = None if FLAGS.do_train: train_examples =processor.get_train_examples(FLAGS.data_dir) # TODO print("###length of total train_examples:",len(train_examples)) num_train_steps = int(len(train_examples)/ FLAGS.train_batch_size * FLAGS.num_train_epochs) num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion) model_fn = model_fn_builder( bert_config=bert_config, num_labels=len(label_list), init_checkpoint=FLAGS.init_checkpoint, learning_rate=FLAGS.learning_rate, num_train_steps=num_train_steps, num_warmup_steps=num_warmup_steps, use_tpu=FLAGS.use_tpu, use_one_hot_embeddings=FLAGS.use_tpu) # If TPU is not available, this will fall back to normal Estimator on CPU # or GPU. estimator = tf.contrib.tpu.TPUEstimator( use_tpu=FLAGS.use_tpu, model_fn=model_fn, config=run_config, train_batch_size=FLAGS.train_batch_size, eval_batch_size=FLAGS.eval_batch_size, predict_batch_size=FLAGS.predict_batch_size) if FLAGS.do_train: train_file = os.path.join(FLAGS.output_dir, "train.tf_record") train_file_exists=os.path.exists(train_file) print("###train_file_exists:", train_file_exists," ;train_file:",train_file) if not train_file_exists: # if tf_record file not exist, convert from raw text file. # TODO file_based_convert_examples_to_features(train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file) tf.logging.info("***** Running training *****") tf.logging.info(" Num examples = %d", len(train_examples)) tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) tf.logging.info(" Num steps = %d", num_train_steps) train_input_fn = file_based_input_fn_builder( input_file=train_file, seq_length=FLAGS.max_seq_length, is_training=True, drop_remainder=True) estimator.train(input_fn=train_input_fn, max_steps=num_train_steps) if FLAGS.do_eval: eval_examples = processor.get_dev_examples(FLAGS.data_dir) num_actual_eval_examples = len(eval_examples) if FLAGS.use_tpu: # TPU requires a fixed batch size for all batches, therefore the number # of examples must be a multiple of the batch size, or else examples # will get dropped. So we pad with fake examples which are ignored # later on. These do NOT count towards the metric (all tf.metrics # support a per-instance weight, and these get a weight of 0.0). while len(eval_examples) % FLAGS.eval_batch_size != 0: eval_examples.append(PaddingInputExample()) eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record") file_based_convert_examples_to_features( eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file) tf.logging.info("***** Running evaluation *****") tf.logging.info(" Num examples = %d (%d actual, %d padding)", len(eval_examples), num_actual_eval_examples, len(eval_examples) - num_actual_eval_examples) tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) # This tells the estimator to run through the entire set. eval_steps = None # However, if running eval on the TPU, you will need to specify the # number of steps. if FLAGS.use_tpu: assert len(eval_examples) % FLAGS.eval_batch_size == 0 eval_steps = int(len(eval_examples) // FLAGS.eval_batch_size) eval_drop_remainder = True if FLAGS.use_tpu else False eval_input_fn = file_based_input_fn_builder( input_file=eval_file, seq_length=FLAGS.max_seq_length, is_training=False, drop_remainder=eval_drop_remainder) ####################################################################################################################### # evaluate all checkpoints; you can use the checkpoint with the best dev accuarcy steps_and_files = [] filenames = tf.gfile.ListDirectory(FLAGS.output_dir) for filename in filenames: if filename.endswith(".index"): ckpt_name = filename[:-6] cur_filename = os.path.join(FLAGS.output_dir, ckpt_name) global_step = int(cur_filename.split("-")[-1]) tf.logging.info("Add {} to eval list.".format(cur_filename)) steps_and_files.append([global_step, cur_filename]) steps_and_files = sorted(steps_and_files, key=lambda x: x[0]) output_eval_file = os.path.join(FLAGS.data_dir, "eval_results_albert_zh.txt") print("output_eval_file:",output_eval_file) tf.logging.info("output_eval_file:"+output_eval_file) with tf.gfile.GFile(output_eval_file, "w") as writer: for global_step, filename in sorted(steps_and_files, key=lambda x: x[0]): result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps, checkpoint_path=filename) tf.logging.info("***** Eval results %s *****" % (filename)) writer.write("***** Eval results %s *****\n" % (filename)) for key in sorted(result.keys()): tf.logging.info(" %s = %s", key, str(result[key])) writer.write("%s = %s\n" % (key, str(result[key]))) ####################################################################################################################### #result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps) # #output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt") #with tf.gfile.GFile(output_eval_file, "w") as writer: # tf.logging.info("***** Eval results *****") # for key in sorted(result.keys()): # tf.logging.info(" %s = %s", key, str(result[key])) # writer.write("%s = %s\n" % (key, str(result[key]))) if FLAGS.do_predict: predict_examples = processor.get_test_examples(FLAGS.data_dir) num_actual_predict_examples = len(predict_examples) if FLAGS.use_tpu: # TPU requires a fixed batch size for all batches, therefore the number # of examples must be a multiple of the batch size, or else examples # will get dropped. So we pad with fake examples which are ignored # later on. while len(predict_examples) % FLAGS.predict_batch_size != 0: predict_examples.append(PaddingInputExample()) predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record") file_based_convert_examples_to_features(predict_examples, label_list, FLAGS.max_seq_length, tokenizer, predict_file) tf.logging.info("***** Running prediction*****") tf.logging.info(" Num examples = %d (%d actual, %d padding)", len(predict_examples), num_actual_predict_examples, len(predict_examples) - num_actual_predict_examples) tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size) predict_drop_remainder = True if FLAGS.use_tpu else False predict_input_fn = file_based_input_fn_builder( input_file=predict_file, seq_length=FLAGS.max_seq_length, is_training=False, drop_remainder=predict_drop_remainder) result = estimator.predict(input_fn=predict_input_fn) output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv") with tf.gfile.GFile(output_predict_file, "w") as writer: num_written_lines = 0 tf.logging.info("***** Predict results *****") for (i, prediction) in enumerate(result): probabilities = prediction["probabilities"] if i >= num_actual_predict_examples: break output_line = "\t".join( str(class_probability) for class_probability in probabilities) + "\n" writer.write(output_line) num_written_lines += 1 assert num_written_lines == num_actual_predict_examples if __name__ == "__main__": flags.mark_flag_as_required("data_dir") flags.mark_flag_as_required("task_name") flags.mark_flag_as_required("vocab_file") flags.mark_flag_as_required("bert_config_file") flags.mark_flag_as_required("output_dir") tf.app.run() ================================================ FILE: run_classifier_clue.py ================================================ # -*- coding: utf-8 -*- # @Author: bo.shi # @Date: 2019-11-04 09:56:36 # @Last Modified by: bo.shi # @Last Modified time: 2019-12-04 14:29:04 # coding=utf-8 # Copyright 2018 The Google AI Language Team Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """BERT finetuning runner.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import collections import modeling import optimization_finetuning as optimization import tokenization import tensorflow as tf # from loss import bi_tempered_logistic_loss import sys sys.path.append('..') from classifier_utils import * flags = tf.flags FLAGS = flags.FLAGS # Required parameters flags.DEFINE_string( "data_dir", None, "The input data dir. Should contain the .tsv files (or other data files) " "for the task.") flags.DEFINE_string( "bert_config_file", None, "The config json file corresponding to the pre-trained BERT model. " "This specifies the model architecture.") flags.DEFINE_string("task_name", None, "The name of the task to train.") flags.DEFINE_string("vocab_file", None, "The vocabulary file that the BERT model was trained on.") flags.DEFINE_string( "output_dir", None, "The output directory where the model checkpoints will be written.") # Other parameters flags.DEFINE_string( "init_checkpoint", None, "Initial checkpoint (usually from a pre-trained BERT model).") flags.DEFINE_bool( "do_lower_case", True, "Whether to lower case the input text. Should be True for uncased " "models and False for cased models.") flags.DEFINE_integer( "max_seq_length", 128, "The maximum total input sequence length after WordPiece tokenization. " "Sequences longer than this will be truncated, and sequences shorter " "than this will be padded.") flags.DEFINE_bool("do_train", False, "Whether to run training.") flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.") flags.DEFINE_bool( "do_predict", False, "Whether to run the model in inference mode on the test set.") flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.") flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.") flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.") flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.") flags.DEFINE_float("num_train_epochs", 3.0, "Total number of training epochs to perform.") flags.DEFINE_float( "warmup_proportion", 0.1, "Proportion of training to perform linear learning rate warmup for. " "E.g., 0.1 = 10% of training.") flags.DEFINE_integer("save_checkpoints_steps", 1000, "How often to save the model checkpoint.") flags.DEFINE_integer("iterations_per_loop", 1000, "How many steps to make in each estimator call.") flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") tf.flags.DEFINE_string( "tpu_name", None, "The Cloud TPU to use for training. This should be either the name " "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " "url.") tf.flags.DEFINE_string( "tpu_zone", None, "[Optional] GCE zone where the Cloud TPU is located in. If not " "specified, we will attempt to automatically detect the GCE project from " "metadata.") tf.flags.DEFINE_string( "gcp_project", None, "[Optional] Project name for the Cloud TPU-enabled project. If not " "specified, we will attempt to automatically detect the GCE project from " "metadata.") tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") flags.DEFINE_integer( "num_tpu_cores", 8, "Only used if `use_tpu` is True. Total number of TPU cores to use.") class InputFeatures(object): """A single set of features of data.""" def __init__(self, input_ids, input_mask, segment_ids, label_id, is_real_example=True): self.input_ids = input_ids self.input_mask = input_mask self.segment_ids = segment_ids self.label_id = label_id self.is_real_example = is_real_example def convert_single_example_for_inews(ex_index, tokens_a, tokens_b, label_map, max_seq_length, tokenizer, example): if tokens_b: # Modifies `tokens_a` and `tokens_b` in place so that the total # length is less than the specified length. # Account for [CLS], [SEP], [SEP] with "- 3" _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) else: # Account for [CLS] and [SEP] with "- 2" if len(tokens_a) > max_seq_length - 2: tokens_a = tokens_a[0:(max_seq_length - 2)] # The convention in BERT is: # (a) For sequence pairs: # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 # (b) For single sequences: # tokens: [CLS] the dog is hairy . [SEP] # type_ids: 0 0 0 0 0 0 0 # # Where "type_ids" are used to indicate whether this is the first # sequence or the second sequence. The embedding vectors for `type=0` and # `type=1` were learned during pre-training and are added to the wordpiece # embedding vector (and position vector). This is not *strictly* necessary # since the [SEP] token unambiguously separates the sequences, but it makes # it easier for the model to learn the concept of sequences. # # For classification tasks, the first vector (corresponding to [CLS]) is # used as the "sentence vector". Note that this only makes sense because # the entire model is fine-tuned. tokens = [] segment_ids = [] tokens.append("[CLS]") segment_ids.append(0) for token in tokens_a: tokens.append(token) segment_ids.append(0) tokens.append("[SEP]") segment_ids.append(0) if tokens_b: for token in tokens_b: tokens.append(token) segment_ids.append(1) tokens.append("[SEP]") segment_ids.append(1) input_ids = tokenizer.convert_tokens_to_ids(tokens) # The mask has 1 for real tokens and 0 for padding tokens. Only real # tokens are attended to. input_mask = [1] * len(input_ids) # Zero-pad up to the sequence length. while len(input_ids) < max_seq_length: input_ids.append(0) input_mask.append(0) segment_ids.append(0) assert len(input_ids) == max_seq_length assert len(input_mask) == max_seq_length assert len(segment_ids) == max_seq_length label_id = label_map[example.label] if ex_index < 5: tf.logging.info("*** Example ***") tf.logging.info("guid: %s" % (example.guid)) tf.logging.info("tokens: %s" % " ".join( [tokenization.printable_text(x) for x in tokens])) tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) tf.logging.info("label: %s (id = %d)" % (example.label, label_id)) feature = InputFeatures( input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids, label_id=label_id, is_real_example=True) return feature def convert_example_list_for_inews(ex_index, example, label_list, max_seq_length, tokenizer): """Converts a single `InputExample` into a single `InputFeatures`.""" if isinstance(example, PaddingInputExample): return [InputFeatures( input_ids=[0] * max_seq_length, input_mask=[0] * max_seq_length, segment_ids=[0] * max_seq_length, label_id=0, is_real_example=False)] label_map = {} for (i, label) in enumerate(label_list): label_map[label] = i tokens_a = tokenizer.tokenize(example.text_a) tokens_b = None if example.text_b: tokens_b = tokenizer.tokenize(example.text_b) must_len = len(tokens_a) + 3 extra_len = max_seq_length - must_len feature_list = [] if example.text_b and extra_len > 0: extra_num = int((len(tokens_b) - 1) / extra_len) + 1 for num in range(extra_num): max_len = min((num + 1) * extra_len, len(tokens_b)) tokens_b_sub = tokens_b[num * extra_len: max_len] feature = convert_single_example_for_inews( ex_index, tokens_a, tokens_b_sub, label_map, max_seq_length, tokenizer, example) feature_list.append(feature) else: feature = convert_single_example_for_inews( ex_index, tokens_a, tokens_b, label_map, max_seq_length, tokenizer, example) feature_list.append(feature) return feature_list def file_based_convert_examples_to_features_for_inews( examples, label_list, max_seq_length, tokenizer, output_file): """Convert a set of `InputExample`s to a TFRecord file.""" writer = tf.python_io.TFRecordWriter(output_file) num_example = 0 for (ex_index, example) in enumerate(examples): if ex_index % 1000 == 0: tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) feature_list = convert_example_list_for_inews(ex_index, example, label_list, max_seq_length, tokenizer) num_example += len(feature_list) def create_int_feature(values): f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) return f features = collections.OrderedDict() for feature in feature_list: features["input_ids"] = create_int_feature(feature.input_ids) features["input_mask"] = create_int_feature(feature.input_mask) features["segment_ids"] = create_int_feature(feature.segment_ids) features["label_ids"] = create_int_feature([feature.label_id]) features["is_real_example"] = create_int_feature( [int(feature.is_real_example)]) tf_example = tf.train.Example(features=tf.train.Features(feature=features)) writer.write(tf_example.SerializeToString()) tf.logging.info("feature num: %s", num_example) writer.close() def convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer): """Converts a single `InputExample` into a single `InputFeatures`.""" if isinstance(example, PaddingInputExample): return InputFeatures( input_ids=[0] * max_seq_length, input_mask=[0] * max_seq_length, segment_ids=[0] * max_seq_length, label_id=0, is_real_example=False) label_map = {} for (i, label) in enumerate(label_list): label_map[label] = i tokens_a = tokenizer.tokenize(example.text_a) tokens_b = None if example.text_b: tokens_b = tokenizer.tokenize(example.text_b) if tokens_b: # Modifies `tokens_a` and `tokens_b` in place so that the total # length is less than the specified length. # Account for [CLS], [SEP], [SEP] with "- 3" _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) else: # Account for [CLS] and [SEP] with "- 2" if len(tokens_a) > max_seq_length - 2: tokens_a = tokens_a[0:(max_seq_length - 2)] # The convention in BERT is: # (a) For sequence pairs: # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 # (b) For single sequences: # tokens: [CLS] the dog is hairy . [SEP] # type_ids: 0 0 0 0 0 0 0 # # Where "type_ids" are used to indicate whether this is the first # sequence or the second sequence. The embedding vectors for `type=0` and # `type=1` were learned during pre-training and are added to the wordpiece # embedding vector (and position vector). This is not *strictly* necessary # since the [SEP] token unambiguously separates the sequences, but it makes # it easier for the model to learn the concept of sequences. # # For classification tasks, the first vector (corresponding to [CLS]) is # used as the "sentence vector". Note that this only makes sense because # the entire model is fine-tuned. tokens = [] segment_ids = [] tokens.append("[CLS]") segment_ids.append(0) for token in tokens_a: tokens.append(token) segment_ids.append(0) tokens.append("[SEP]") segment_ids.append(0) if tokens_b: for token in tokens_b: tokens.append(token) segment_ids.append(1) tokens.append("[SEP]") segment_ids.append(1) input_ids = tokenizer.convert_tokens_to_ids(tokens) # The mask has 1 for real tokens and 0 for padding tokens. Only real # tokens are attended to. input_mask = [1] * len(input_ids) # Zero-pad up to the sequence length. while len(input_ids) < max_seq_length: input_ids.append(0) input_mask.append(0) segment_ids.append(0) assert len(input_ids) == max_seq_length assert len(input_mask) == max_seq_length assert len(segment_ids) == max_seq_length label_id = label_map[example.label] if ex_index < 5: tf.logging.info("*** Example ***") tf.logging.info("guid: %s" % (example.guid)) tf.logging.info("tokens: %s" % " ".join( [tokenization.printable_text(x) for x in tokens])) tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) tf.logging.info("label: %s (id = %d)" % (example.label, label_id)) feature = InputFeatures( input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids, label_id=label_id, is_real_example=True) return feature def file_based_convert_examples_to_features( examples, label_list, max_seq_length, tokenizer, output_file): """Convert a set of `InputExample`s to a TFRecord file.""" writer = tf.python_io.TFRecordWriter(output_file) for (ex_index, example) in enumerate(examples): if ex_index % 10000 == 0: tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) feature = convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer) def create_int_feature(values): f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) return f features = collections.OrderedDict() features["input_ids"] = create_int_feature(feature.input_ids) features["input_mask"] = create_int_feature(feature.input_mask) features["segment_ids"] = create_int_feature(feature.segment_ids) features["label_ids"] = create_int_feature([feature.label_id]) features["is_real_example"] = create_int_feature( [int(feature.is_real_example)]) tf_example = tf.train.Example(features=tf.train.Features(feature=features)) writer.write(tf_example.SerializeToString()) writer.close() def file_based_input_fn_builder(input_file, seq_length, is_training, drop_remainder): """Creates an `input_fn` closure to be passed to TPUEstimator.""" name_to_features = { "input_ids": tf.FixedLenFeature([seq_length], tf.int64), "input_mask": tf.FixedLenFeature([seq_length], tf.int64), "segment_ids": tf.FixedLenFeature([seq_length], tf.int64), "label_ids": tf.FixedLenFeature([], tf.int64), "is_real_example": tf.FixedLenFeature([], tf.int64), } def _decode_record(record, name_to_features): """Decodes a record to a TensorFlow example.""" example = tf.parse_single_example(record, name_to_features) # tf.Example only supports tf.int64, but the TPU only supports tf.int32. # So cast all int64 to int32. for name in list(example.keys()): t = example[name] if t.dtype == tf.int64: t = tf.to_int32(t) example[name] = t return example def input_fn(params): """The actual input function.""" batch_size = params["batch_size"] # For training, we want a lot of parallel reading and shuffling. # For eval, we want no shuffling and parallel reading doesn't matter. d = tf.data.TFRecordDataset(input_file) if is_training: d = d.repeat() d = d.shuffle(buffer_size=100) d = d.apply( tf.contrib.data.map_and_batch( lambda record: _decode_record(record, name_to_features), batch_size=batch_size, drop_remainder=drop_remainder)) return d return input_fn def _truncate_seq_pair(tokens_a, tokens_b, max_length): """Truncates a sequence pair in place to the maximum length.""" # This is a simple heuristic which will always truncate the longer sequence # one token at a time. This makes more sense than truncating an equal percent # of tokens from each, since if one sequence is very short then each token # that's truncated likely contains more information than a longer sequence. while True: total_length = len(tokens_a) + len(tokens_b) if total_length <= max_length: break if len(tokens_a) > len(tokens_b): tokens_a.pop() else: tokens_b.pop() def create_model(bert_config, is_training, input_ids, input_mask, segment_ids, labels, num_labels, use_one_hot_embeddings): """Creates a classification model.""" model = modeling.BertModel( config=bert_config, is_training=is_training, input_ids=input_ids, input_mask=input_mask, token_type_ids=segment_ids, use_one_hot_embeddings=use_one_hot_embeddings) # In the demo, we are doing a simple classification task on the entire # segment. # # If you want to use the token-level output, use model.get_sequence_output() # instead. output_layer = model.get_pooled_output() hidden_size = output_layer.shape[-1].value output_weights = tf.get_variable( "output_weights", [num_labels, hidden_size], initializer=tf.truncated_normal_initializer(stddev=0.02)) output_bias = tf.get_variable( "output_bias", [num_labels], initializer=tf.zeros_initializer()) with tf.variable_scope("loss"): ln_type = bert_config.ln_type if ln_type == 'preln': # add by brightmart, 10-06. if it is preln, we need to an additonal layer: layer normalization as suggested in paper "ON LAYER NORMALIZATION IN THE TRANSFORMER ARCHITECTURE" print("ln_type is preln. add LN layer.") output_layer = layer_norm(output_layer) else: print("ln_type is postln or other,do nothing.") if is_training: # I.e., 0.1 dropout output_layer = tf.nn.dropout(output_layer, keep_prob=0.9) logits = tf.matmul(output_layer, output_weights, transpose_b=True) logits = tf.nn.bias_add(logits, output_bias) probabilities = tf.nn.softmax(logits, axis=-1) log_probs = tf.nn.log_softmax(logits, axis=-1) one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) # todo 08-29 try temp-loss ###############bi_tempered_logistic_loss############################################################################ # print("##cross entropy loss is used...."); tf.logging.info("##cross entropy loss is used....") # t1=0.9 #t1=0.90 # t2=1.05 #t2=1.05 # per_example_loss=bi_tempered_logistic_loss(log_probs,one_hot_labels,t1,t2,label_smoothing=0.1,num_iters=5) # TODO label_smoothing=0.0 # tf.logging.info("per_example_loss:"+str(per_example_loss.shape)) ##############bi_tempered_logistic_loss############################################################################# loss = tf.reduce_mean(per_example_loss) return (loss, per_example_loss, logits, probabilities) def layer_norm(input_tensor, name=None): """Run layer normalization on the last dimension of the tensor.""" return tf.contrib.layers.layer_norm( inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name) def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate, num_train_steps, num_warmup_steps, use_tpu, use_one_hot_embeddings): """Returns `model_fn` closure for TPUEstimator.""" def model_fn(features, labels, mode, params): # pylint: disable=unused-argument """The `model_fn` for TPUEstimator.""" tf.logging.info("*** Features ***") for name in sorted(features.keys()): tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) input_ids = features["input_ids"] input_mask = features["input_mask"] segment_ids = features["segment_ids"] label_ids = features["label_ids"] is_real_example = None if "is_real_example" in features: is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32) else: is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32) is_training = (mode == tf.estimator.ModeKeys.TRAIN) (total_loss, per_example_loss, logits, probabilities) = create_model( bert_config, is_training, input_ids, input_mask, segment_ids, label_ids, num_labels, use_one_hot_embeddings) tvars = tf.trainable_variables() initialized_variable_names = {} scaffold_fn = None if init_checkpoint: (assignment_map, initialized_variable_names ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) if use_tpu: def tpu_scaffold(): tf.train.init_from_checkpoint(init_checkpoint, assignment_map) return tf.train.Scaffold() scaffold_fn = tpu_scaffold else: tf.train.init_from_checkpoint(init_checkpoint, assignment_map) tf.logging.info("**** Trainable Variables ****") for var in tvars: init_string = "" if var.name in initialized_variable_names: init_string = ", *INIT_FROM_CKPT*" tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, init_string) output_spec = None if mode == tf.estimator.ModeKeys.TRAIN: train_op = optimization.create_optimizer( total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu) output_spec = tf.contrib.tpu.TPUEstimatorSpec( mode=mode, loss=total_loss, train_op=train_op, scaffold_fn=scaffold_fn) elif mode == tf.estimator.ModeKeys.EVAL: def metric_fn(per_example_loss, label_ids, logits, is_real_example): predictions = tf.argmax(logits, axis=-1, output_type=tf.int32) accuracy = tf.metrics.accuracy( labels=label_ids, predictions=predictions, weights=is_real_example) loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example) return { "eval_accuracy": accuracy, "eval_loss": loss, } eval_metrics = (metric_fn, [per_example_loss, label_ids, logits, is_real_example]) output_spec = tf.contrib.tpu.TPUEstimatorSpec( mode=mode, loss=total_loss, eval_metrics=eval_metrics, scaffold_fn=scaffold_fn) else: output_spec = tf.contrib.tpu.TPUEstimatorSpec( mode=mode, predictions={"probabilities": probabilities}, scaffold_fn=scaffold_fn) return output_spec return model_fn # This function is not used by this file but is still used by the Colab and # people who depend on it. def input_fn_builder(features, seq_length, is_training, drop_remainder): """Creates an `input_fn` closure to be passed to TPUEstimator.""" all_input_ids = [] all_input_mask = [] all_segment_ids = [] all_label_ids = [] for feature in features: all_input_ids.append(feature.input_ids) all_input_mask.append(feature.input_mask) all_segment_ids.append(feature.segment_ids) all_label_ids.append(feature.label_id) def input_fn(params): """The actual input function.""" batch_size = params["batch_size"] num_examples = len(features) # This is for demo purposes and does NOT scale to large data sets. We do # not use Dataset.from_generator() because that uses tf.py_func which is # not TPU compatible. The right way to load data is with TFRecordReader. d = tf.data.Dataset.from_tensor_slices({ "input_ids": tf.constant( all_input_ids, shape=[num_examples, seq_length], dtype=tf.int32), "input_mask": tf.constant( all_input_mask, shape=[num_examples, seq_length], dtype=tf.int32), "segment_ids": tf.constant( all_segment_ids, shape=[num_examples, seq_length], dtype=tf.int32), "label_ids": tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32), }) if is_training: d = d.repeat() d = d.shuffle(buffer_size=100) d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder) return d return input_fn # This function is not used by this file but is still used by the Colab and # people who depend on it. def convert_examples_to_features(examples, label_list, max_seq_length, tokenizer): """Convert a set of `InputExample`s to a list of `InputFeatures`.""" features = [] for (ex_index, example) in enumerate(examples): if ex_index % 10000 == 0: tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) feature = convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer) features.append(feature) return features def main(_): tf.logging.set_verbosity(tf.logging.INFO) processors = { "xnli": XnliProcessor, "tnews": TnewsProcessor, "afqmc": AFQMCProcessor, "iflytek": iFLYTEKDataProcessor, "copa": COPAProcessor, "cmnli": CMNLIProcessor, "wsc": WSCProcessor, "csl": CslProcessor, "copa": COPAProcessor, } tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case, FLAGS.init_checkpoint) if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict: raise ValueError( "At least one of `do_train`, `do_eval` or `do_predict' must be True.") bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file) if FLAGS.max_seq_length > bert_config.max_position_embeddings: raise ValueError( "Cannot use sequence length %d because the BERT model " "was only trained up to sequence length %d" % (FLAGS.max_seq_length, bert_config.max_position_embeddings)) tf.gfile.MakeDirs(FLAGS.output_dir) task_name = FLAGS.task_name.lower() if task_name not in processors: raise ValueError("Task not found: %s" % (task_name)) processor = processors[task_name]() label_list = processor.get_labels() tokenizer = tokenization.FullTokenizer( vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) tpu_cluster_resolver = None if FLAGS.use_tpu and FLAGS.tpu_name: tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 # Cloud TPU: Invalid TPU configuration, ensure ClusterResolver is passed to tpu. print("###tpu_cluster_resolver:", tpu_cluster_resolver) run_config = tf.contrib.tpu.RunConfig( cluster=tpu_cluster_resolver, master=FLAGS.master, model_dir=FLAGS.output_dir, save_checkpoints_steps=FLAGS.save_checkpoints_steps, tpu_config=tf.contrib.tpu.TPUConfig( iterations_per_loop=FLAGS.iterations_per_loop, num_shards=FLAGS.num_tpu_cores, per_host_input_for_training=is_per_host)) train_examples = None num_train_steps = None num_warmup_steps = None if FLAGS.do_train: train_examples = processor.get_train_examples(FLAGS.data_dir) # TODO print("###length of total train_examples:", len(train_examples)) num_train_steps = int(len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs) num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion) model_fn = model_fn_builder( bert_config=bert_config, num_labels=len(label_list), init_checkpoint=FLAGS.init_checkpoint, learning_rate=FLAGS.learning_rate, num_train_steps=num_train_steps, num_warmup_steps=num_warmup_steps, use_tpu=FLAGS.use_tpu, use_one_hot_embeddings=FLAGS.use_tpu) # If TPU is not available, this will fall back to normal Estimator on CPU # or GPU. estimator = tf.contrib.tpu.TPUEstimator( use_tpu=FLAGS.use_tpu, model_fn=model_fn, config=run_config, train_batch_size=FLAGS.train_batch_size, eval_batch_size=FLAGS.eval_batch_size, predict_batch_size=FLAGS.predict_batch_size) if FLAGS.do_train: train_file = os.path.join(FLAGS.output_dir, "train.tf_record") train_file_exists = os.path.exists(train_file) print("###train_file_exists:", train_file_exists, " ;train_file:", train_file) if not train_file_exists: # if tf_record file not exist, convert from raw text file. # TODO if task_name == "inews": file_based_convert_examples_to_features_for_inews( train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file) else: file_based_convert_examples_to_features( train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file) tf.logging.info("***** Running training *****") tf.logging.info(" Num examples = %d", len(train_examples)) tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) tf.logging.info(" Num steps = %d", num_train_steps) train_input_fn = file_based_input_fn_builder( input_file=train_file, seq_length=FLAGS.max_seq_length, is_training=True, drop_remainder=True) estimator.train(input_fn=train_input_fn, max_steps=num_train_steps) if FLAGS.do_eval: # dev dataset eval_examples = processor.get_dev_examples(FLAGS.data_dir) num_actual_eval_examples = len(eval_examples) if FLAGS.use_tpu: # TPU requires a fixed batch size for all batches, therefore the number # of examples must be a multiple of the batch size, or else examples # will get dropped. So we pad with fake examples which are ignored # later on. These do NOT count towards the metric (all tf.metrics # support a per-instance weight, and these get a weight of 0.0). while len(eval_examples) % FLAGS.eval_batch_size != 0: eval_examples.append(PaddingInputExample()) eval_file = os.path.join(FLAGS.output_dir, "dev.tf_record") if task_name == "inews": file_based_convert_examples_to_features_for_inews( eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file) else: file_based_convert_examples_to_features( eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file) tf.logging.info("***** Running evaluation *****") tf.logging.info(" Num examples = %d (%d actual, %d padding)", len(eval_examples), num_actual_eval_examples, len(eval_examples) - num_actual_eval_examples) tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) # This tells the estimator to run through the entire set. eval_steps = None # However, if running eval on the TPU, you will need to specify the # number of steps. if FLAGS.use_tpu: assert len(eval_examples) % FLAGS.eval_batch_size == 0 eval_steps = int(len(eval_examples) // FLAGS.eval_batch_size) eval_drop_remainder = True if FLAGS.use_tpu else False eval_input_fn = file_based_input_fn_builder( input_file=eval_file, seq_length=FLAGS.max_seq_length, is_training=False, drop_remainder=eval_drop_remainder) ####################################################################################################################### # evaluate all checkpoints; you can use the checkpoint with the best dev accuarcy steps_and_files = [] filenames = tf.gfile.ListDirectory(FLAGS.output_dir) for filename in filenames: if filename.endswith(".index"): ckpt_name = filename[:-6] cur_filename = os.path.join(FLAGS.output_dir, ckpt_name) global_step = int(cur_filename.split("-")[-1]) tf.logging.info("Add {} to eval list.".format(cur_filename)) steps_and_files.append([global_step, cur_filename]) steps_and_files = sorted(steps_and_files, key=lambda x: x[0]) output_eval_file = os.path.join(FLAGS.data_dir, "dev_results_albert_zh.txt") print("output_eval_file:", output_eval_file) tf.logging.info("output_eval_file:" + output_eval_file) with tf.gfile.GFile(output_eval_file, "w") as writer: for global_step, filename in sorted(steps_and_files, key=lambda x: x[0]): result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps, checkpoint_path=filename) tf.logging.info("***** Eval results %s *****" % (filename)) writer.write("***** Eval results %s *****\n" % (filename)) for key in sorted(result.keys()): tf.logging.info(" %s = %s", key, str(result[key])) writer.write("%s = %s\n" % (key, str(result[key]))) ####################################################################################################################### # result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps) # # output_eval_file = os.path.join(FLAGS.output_dir, "dev_results_albert_zh.txt") # with tf.gfile.GFile(output_eval_file, "w") as writer: # tf.logging.info("***** Eval results *****") # for key in sorted(result.keys()): # tf.logging.info(" %s = %s", key, str(result[key])) # writer.write("%s = %s\n" % (key, str(result[key]))) if FLAGS.do_predict: predict_examples = processor.get_test_examples(FLAGS.data_dir) num_actual_predict_examples = len(predict_examples) if FLAGS.use_tpu: # TPU requires a fixed batch size for all batches, therefore the number # of examples must be a multiple of the batch size, or else examples # will get dropped. So we pad with fake examples which are ignored # later on. while len(predict_examples) % FLAGS.predict_batch_size != 0: predict_examples.append(PaddingInputExample()) predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record") if task_name == "inews": file_based_convert_examples_to_features_for_inews(predict_examples, label_list, FLAGS.max_seq_length, tokenizer, predict_file) else: file_based_convert_examples_to_features(predict_examples, label_list, FLAGS.max_seq_length, tokenizer, predict_file) tf.logging.info("***** Running prediction*****") tf.logging.info(" Num examples = %d (%d actual, %d padding)", len(predict_examples), num_actual_predict_examples, len(predict_examples) - num_actual_predict_examples) tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size) predict_drop_remainder = True if FLAGS.use_tpu else False predict_input_fn = file_based_input_fn_builder( input_file=predict_file, seq_length=FLAGS.max_seq_length, is_training=False, drop_remainder=predict_drop_remainder) result = estimator.predict(input_fn=predict_input_fn) index2label_map = {} for (i, label) in enumerate(label_list): index2label_map[i] = label output_predict_file_label_name = task_name + "_predict.json" output_predict_file_label = os.path.join(FLAGS.output_dir, output_predict_file_label_name) output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv") with tf.gfile.GFile(output_predict_file_label, "w") as writer_label: with tf.gfile.GFile(output_predict_file, "w") as writer: num_written_lines = 0 tf.logging.info("***** Predict results *****") for (i, prediction) in enumerate(result): probabilities = prediction["probabilities"] label_index = probabilities.argmax(0) if i >= num_actual_predict_examples: break output_line = "\t".join( str(class_probability) for class_probability in probabilities) + "\n" test_label_dict = {} test_label_dict["id"] = i test_label_dict["label"] = str(index2label_map[label_index]) if task_name == "tnews": test_label_dict["label_desc"] = "" writer.write(output_line) json.dump(test_label_dict, writer_label) writer_label.write("\n") num_written_lines += 1 assert num_written_lines == num_actual_predict_examples if __name__ == "__main__": flags.mark_flag_as_required("data_dir") flags.mark_flag_as_required("task_name") flags.mark_flag_as_required("vocab_file") flags.mark_flag_as_required("bert_config_file") flags.mark_flag_as_required("output_dir") tf.app.run() ================================================ FILE: run_classifier_clue.sh ================================================ # @Author: bo.shi # @Date: 2020-03-15 16:11:00 # @Last Modified by: bo.shi # @Last Modified time: 2020-04-02 17:54:05 #!/usr/bin/env bash export CUDA_VISIBLE_DEVICES="0" CURRENT_DIR=$(cd -P -- "$(dirname -- "$0")" && pwd -P) CLUE_DATA_DIR=$CURRENT_DIR/CLUEdataset ALBERT_TINY_DIR=$CURRENT_DIR/albert_tiny download_data(){ TASK_NAME=$1 if [ ! -d $CLUE_DATA_DIR ]; then mkdir -p $CLUE_DATA_DIR echo "makedir $CLUE_DATA_DIR" fi cd $CLUE_DATA_DIR if [ ! -d ${TASK_NAME} ]; then mkdir $TASK_NAME echo "make dataset dir $CLUE_DATA_DIR/$TASK_NAME" fi cd $TASK_NAME if [ ! -f "train.json" ] || [ ! -f "dev.json" ] || [ ! -f "test.json" ]; then rm * wget https://storage.googleapis.com/cluebenchmark/tasks/${TASK_NAME}_public.zip unzip ${TASK_NAME}_public.zip rm ${TASK_NAME}_public.zip else echo "data exists" fi echo "Finish download dataset." } download_model(){ if [ ! -d $ALBERT_TINY_DIR ]; then mkdir -p $ALBERT_TINY_DIR echo "makedir $ALBERT_TINY_DIR" fi cd $ALBERT_TINY_DIR if [ ! -f "albert_config_tiny.json" ] || [ ! -f "vocab.txt" ] || [ ! -f "checkpoint" ] || [ ! -f "albert_model.ckpt.index" ] || [ ! -f "albert_model.ckpt.meta" ] || [ ! -f "albert_model.ckpt.data-00000-of-00001" ]; then rm * wget -c https://storage.googleapis.com/albert_zh/albert_tiny_489k.zip unzip albert_tiny_489k.zip rm albert_tiny_489k.zip else echo "model exists" fi echo "Finish download model." } run_task() { TASK_NAME=$1 download_data $TASK_NAME download_model $MODEL_NAME DATA_DIR=$CLUE_DATA_DIR/${TASK_NAME} PREV_TRAINED_MODEL_DIR=$ALBERT_TINY_DIR MAX_SEQ_LENGTH=$2 TRAIN_BATCH_SIZE=$3 LEARNING_RATE=$4 NUM_TRAIN_EPOCHS=$5 SAVE_CHECKPOINTS_STEPS=$6 OUTPUT_DIR=$CURRENT_DIR/${TASK_NAME}_output/ COMMON_ARGS=" --task_name=$TASK_NAME \ --data_dir=$DATA_DIR \ --vocab_file=$PREV_TRAINED_MODEL_DIR/vocab.txt \ --bert_config_file=$PREV_TRAINED_MODEL_DIR/albert_config_tiny.json \ --init_checkpoint=$PREV_TRAINED_MODEL_DIR/albert_model.ckpt \ --max_seq_length=$MAX_SEQ_LENGTH \ --train_batch_size=$TRAIN_BATCH_SIZE \ --learning_rate=$LEARNING_RATE \ --num_train_epochs=$NUM_TRAIN_EPOCHS \ --save_checkpoints_steps=$SAVE_CHECKPOINTS_STEPS \ --output_dir=$OUTPUT_DIR \ --keep_checkpoint_max=0 \ " cd $CURRENT_DIR echo "Start running..." python run_classifier_clue.py \ $COMMON_ARGS \ --do_train=true \ --do_eval=false \ --do_predict=false echo "Start predict..." python run_classifier_clue.py \ $COMMON_ARGS \ --do_train=false \ --do_eval=true \ --do_predict=true } ##command##task_name##model_name##max_seq_length##train_batch_size##learning_rate##num_train_epochs##save_checkpoints_steps##tpu_ip run_task afqmc 128 16 2e-5 3 300 run_task cmnli 128 64 3e-5 2 300 run_task csl 128 16 1e-5 5 100 run_task iflytek 128 32 2e-5 3 300 run_task tnews 128 16 2e-5 3 300 run_task wsc 128 16 1e-5 10 10 ================================================ FILE: run_classifier_lcqmc.sh ================================================ #!/usr/bin/env bash # @Author: bo.shi, https://github.com/chineseGLUE/chineseGLUE # @Date: 2019-11-04 09:56:36 # @Last Modified by: bright # @Last Modified time: 2019-11-10 09:00:00 TASK_NAME="lcqmc" MODEL_NAME="albert_tiny_zh" CURRENT_DIR=$(cd -P -- "$(dirname -- "$0")" && pwd -P) export CUDA_VISIBLE_DEVICES="0" export ALBERT_CONFIG_DIR=$CURRENT_DIR/albert_config export ALBERT_PRETRAINED_MODELS_DIR=$CURRENT_DIR/prev_trained_model export ALBERT_TINY_DIR=$ALBERT_PRETRAINED_MODELS_DIR/$MODEL_NAME #mkdir chineseGLUEdatasets export GLUE_DATA_DIR=$CURRENT_DIR/chineseGLUEdatasets # download and unzip dataset if [ ! -d $GLUE_DATA_DIR ]; then mkdir -p $GLUE_DATA_DIR echo "makedir $GLUE_DATA_DIR" fi cd $GLUE_DATA_DIR if [ ! -d $TASK_NAME ]; then mkdir $TASK_NAME echo "makedir $GLUE_DATA_DIR/$TASK_NAME" fi cd $TASK_NAME echo "Please try again if the data is not downloaded successfully." wget -c https://raw.githubusercontent.com/pengming617/text_matching/master/data/train.txt wget -c https://raw.githubusercontent.com/pengming617/text_matching/master/data/dev.txt wget -c https://raw.githubusercontent.com/pengming617/text_matching/master/data/test.txt echo "Finish download dataset." # download model if [ ! -d $ALBERT_TINY_DIR ]; then mkdir -p $ALBERT_TINY_DIR echo "makedir $ALBERT_TINY_DIR" fi cd $ALBERT_TINY_DIR if [ ! -f "albert_config_tiny.json" ] || [ ! -f "vocab.txt" ] || [ ! -f "checkpoint" ] || [ ! -f "albert_model.ckpt.index" ] || [ ! -f "albert_model.ckpt.meta" ] || [ ! -f "albert_model.ckpt.data-00000-of-00001" ]; then rm * wget https://storage.googleapis.com/albert_zh/albert_tiny_489k.zip unzip albert_tiny_489k.zip rm albert_tiny_489k.zip else echo "model exists" fi echo "Finish download model." # run task cd $CURRENT_DIR echo "Start running..." python run_classifier.py \ --task_name=$TASK_NAME \ --do_train=true \ --do_eval=true \ --data_dir=$GLUE_DATA_DIR/$TASK_NAME \ --vocab_file=$ALBERT_CONFIG_DIR/vocab.txt \ --bert_config_file=$ALBERT_CONFIG_DIR/albert_config_tiny.json \ --init_checkpoint=$ALBERT_TINY_DIR/albert_model.ckpt \ --max_seq_length=128 \ --train_batch_size=64 \ --learning_rate=1e-4 \ --num_train_epochs=5.0 \ --output_dir=$CURRENT_DIR/${TASK_NAME}_output/ ================================================ FILE: run_classifier_sp_google.py ================================================ # coding=utf-8 # Copyright 2019 The Google Research Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # Lint as: python2, python3 """BERT finetuning runner with sentence piece tokenization.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import collections import csv import os import six from six.moves import zip import tensorflow as tf import modeling_google as modeling import optimization_google as optimization import tokenization_google as tokenization flags = tf.flags FLAGS = flags.FLAGS ## Required parameters flags.DEFINE_string( "data_dir", None, "The input data dir. Should contain the .tsv files (or other data files) " "for the task.") flags.DEFINE_string( "albert_config_file", None, "The config json file corresponding to the pre-trained ALBERT model. " "This specifies the model architecture.") flags.DEFINE_string("task_name", None, "The name of the task to train.") flags.DEFINE_string( "vocab_file", None, "The vocabulary file that the ALBERT model was trained on.") flags.DEFINE_string("spm_model_file", None, "The model file for sentence piece tokenization.") flags.DEFINE_string( "output_dir", None, "The output directory where the model checkpoints will be written.") ## Other parameters flags.DEFINE_string( "init_checkpoint", None, "Initial checkpoint (usually from a pre-trained ALBERT model).") flags.DEFINE_bool( "use_pooled_output", True, "Whether to use the CLS token outputs") flags.DEFINE_bool( "do_lower_case", True, "Whether to lower case the input text. Should be True for uncased " "models and False for cased models.") flags.DEFINE_integer( "max_seq_length", 512, "The maximum total input sequence length after WordPiece tokenization. " "Sequences longer than this will be truncated, and sequences shorter " "than this will be padded.") flags.DEFINE_bool("do_train", False, "Whether to run training.") flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.") flags.DEFINE_bool( "do_predict", False, "Whether to run the model in inference mode on the test set.") flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.") flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.") flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.") flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.") flags.DEFINE_float("num_train_epochs", 3.0, "Total number of training epochs to perform.") flags.DEFINE_float( "warmup_proportion", 0.1, "Proportion of training to perform linear learning rate warmup for. " "E.g., 0.1 = 10% of training.") flags.DEFINE_integer("save_checkpoints_steps", 1000, "How often to save the model checkpoint.") flags.DEFINE_integer("iterations_per_loop", 1000, "How many steps to make in each estimator call.") flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") tf.flags.DEFINE_string( "tpu_name", None, "The Cloud TPU to use for training. This should be either the name " "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " "url.") tf.flags.DEFINE_string( "tpu_zone", None, "[Optional] GCE zone where the Cloud TPU is located in. If not " "specified, we will attempt to automatically detect the GCE project from " "metadata.") tf.flags.DEFINE_string( "gcp_project", None, "[Optional] Project name for the Cloud TPU-enabled project. If not " "specified, we will attempt to automatically detect the GCE project from " "metadata.") tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") flags.DEFINE_integer( "num_tpu_cores", 8, "Only used if `use_tpu` is True. Total number of TPU cores to use.") class InputExample(object): """A single training/test example for simple sequence classification.""" def __init__(self, guid, text_a, text_b=None, label=None): """Constructs a InputExample. Args: guid: Unique id for the example. text_a: string. The untokenized text of the first sequence. For single sequence tasks, only this sequence must be specified. text_b: (Optional) string. The untokenized text of the second sequence. Only must be specified for sequence pair tasks. label: (Optional) string. The label of the example. This should be specified for train and dev examples, but not for test examples. """ self.guid = guid self.text_a = text_a self.text_b = text_b self.label = label class PaddingInputExample(object): """Fake example so the num input examples is a multiple of the batch size. When running eval/predict on the TPU, we need to pad the number of examples to be a multiple of the batch size, because the TPU requires a fixed batch size. The alternative is to drop the last batch, which is bad because it means the entire output data won't be generated. We use this class instead of `None` because treating `None` as padding battches could cause silent errors. """ class InputFeatures(object): """A single set of features of data.""" def __init__(self, input_ids, input_mask, segment_ids, label_id, is_real_example=True): self.input_ids = input_ids self.input_mask = input_mask self.segment_ids = segment_ids self.label_id = label_id self.is_real_example = is_real_example class DataProcessor(object): """Base class for data converters for sequence classification data sets.""" def get_train_examples(self, data_dir): """Gets a collection of `InputExample`s for the train set.""" raise NotImplementedError() def get_dev_examples(self, data_dir): """Gets a collection of `InputExample`s for the dev set.""" raise NotImplementedError() def get_test_examples(self, data_dir): """Gets a collection of `InputExample`s for prediction.""" raise NotImplementedError() def get_labels(self): """Gets the list of labels for this data set.""" raise NotImplementedError() @classmethod def _read_tsv(cls, input_file, quotechar=None): """Reads a tab separated value file.""" with tf.gfile.Open(input_file, "r") as f: reader = csv.reader(f, delimiter="\t", quotechar=quotechar) lines = [] for line in reader: lines.append(line) return lines class XnliProcessor(DataProcessor): """Processor for the XNLI data set.""" def __init__(self): self.language = "zh" def get_train_examples(self, data_dir): """See base class.""" lines = self._read_tsv( os.path.join(data_dir, "multinli", "multinli.train.%s.tsv" % self.language)) examples = [] for (i, line) in enumerate(lines): if i == 0: continue guid = "train-%d" % (i) text_a = tokenization.convert_to_unicode(line[0]) text_b = tokenization.convert_to_unicode(line[1]) label = tokenization.convert_to_unicode(line[2]) if label == tokenization.convert_to_unicode("contradictory"): label = tokenization.convert_to_unicode("contradiction") examples.append( InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) return examples def get_dev_examples(self, data_dir): """See base class.""" lines = self._read_tsv(os.path.join(data_dir, "xnli.dev.tsv")) examples = [] for (i, line) in enumerate(lines): if i == 0: continue guid = "dev-%d" % (i) language = tokenization.convert_to_unicode(line[0]) if language != tokenization.convert_to_unicode(self.language): continue text_a = tokenization.convert_to_unicode(line[6]) text_b = tokenization.convert_to_unicode(line[7]) label = tokenization.convert_to_unicode(line[1]) examples.append( InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) return examples def get_labels(self): """See base class.""" return ["contradiction", "entailment", "neutral"] class MnliProcessor(DataProcessor): """Processor for the MultiNLI data set (GLUE version).""" def get_train_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") def get_dev_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")), "dev_matched") def get_test_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "test_matched.tsv")), "test") def get_labels(self): """See base class.""" return ["contradiction", "entailment", "neutral"] def _create_examples(self, lines, set_type): """Creates examples for the training and dev sets.""" examples = [] for (i, line) in enumerate(lines): if i == 0: continue # Note(mingdachen): We will rely on this guid for GLUE submission. guid = tokenization.preprocess_text(line[0], lower=FLAGS.do_lower_case) text_a = tokenization.preprocess_text(line[8], lower=FLAGS.do_lower_case) text_b = tokenization.preprocess_text(line[9], lower=FLAGS.do_lower_case) if set_type == "test": label = "contradiction" else: label = tokenization.preprocess_text(line[-1]) examples.append( InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) return examples class LCQMCPairClassificationProcessor(DataProcessor): """Processor for the internal data set. sentence pair classification""" def __init__(self): self.language = "zh" def get_train_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "train.txt")), "train") # dev_0827.tsv def get_dev_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "test.txt")), "dev") def get_test_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "test.txt")), "test") def get_labels(self): """See base class.""" return ["0", "1"] def _create_examples(self, lines, set_type): """Creates examples for the training and dev sets.""" examples = [] print("length of lines:",len(lines)) for (i, line) in enumerate(lines): if i == 0: continue guid = "%s-%s" % (set_type, i) try: label = tokenization.convert_to_unicode(line[2]) text_a = tokenization.convert_to_unicode(line[0]) text_b = tokenization.convert_to_unicode(line[1]) examples.append( InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) except Exception: print('###error.i:', i, line) return examples class MrpcProcessor(DataProcessor): """Processor for the MRPC data set (GLUE version).""" def get_train_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") def get_dev_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") def get_test_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "test.tsv")), "test") def get_labels(self): """See base class.""" return ["0", "1"] def _create_examples(self, lines, set_type): """Creates examples for the training and dev sets.""" examples = [] for (i, line) in enumerate(lines): if i == 0: continue guid = "%s-%s" % (set_type, i) text_a = tokenization.preprocess_text(line[3], lower=FLAGS.do_lower_case) text_b = tokenization.preprocess_text(line[4], lower=FLAGS.do_lower_case) if set_type == "test": guid = line[0] label = "0" else: label = tokenization.preprocess_text(line[0]) examples.append( InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) return examples class ColaProcessor(DataProcessor): """Processor for the CoLA data set (GLUE version).""" def get_train_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") def get_dev_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") def get_test_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "test.tsv")), "test") def get_labels(self): """See base class.""" return ["0", "1"] def _create_examples(self, lines, set_type): """Creates examples for the training and dev sets.""" examples = [] for (i, line) in enumerate(lines): # Only the test set has a header if set_type == "test" and i == 0: continue guid = "%s-%s" % (set_type, i) if set_type == "test": guid = line[0] text_a = tokenization.preprocess_text( line[1], lower=FLAGS.do_lower_case) label = "0" else: text_a = tokenization.preprocess_text( line[3], lower=FLAGS.do_lower_case) label = tokenization.preprocess_text(line[1]) examples.append( InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) return examples def convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer): """Converts a single `InputExample` into a single `InputFeatures`.""" if isinstance(example, PaddingInputExample): return InputFeatures( input_ids=[0] * max_seq_length, input_mask=[0] * max_seq_length, segment_ids=[0] * max_seq_length, label_id=0, is_real_example=False) label_map = {} for (i, label) in enumerate(label_list): label_map[label] = i tokens_a = tokenizer.tokenize(example.text_a) tokens_b = None if example.text_b: tokens_b = tokenizer.tokenize(example.text_b) if tokens_b: # Modifies `tokens_a` and `tokens_b` in place so that the total # length is less than the specified length. # Account for [CLS], [SEP], [SEP] with "- 3" _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) else: # Account for [CLS] and [SEP] with "- 2" if len(tokens_a) > max_seq_length - 2: tokens_a = tokens_a[0:(max_seq_length - 2)] # The convention in ALBERT is: # (a) For sequence pairs: # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 # (b) For single sequences: # tokens: [CLS] the dog is hairy . [SEP] # type_ids: 0 0 0 0 0 0 0 # # Where "type_ids" are used to indicate whether this is the first # sequence or the second sequence. The embedding vectors for `type=0` and # `type=1` were learned during pre-training and are added to the wordpiece # embedding vector (and position vector). This is not *strictly* necessary # since the [SEP] token unambiguously separates the sequences, but it makes # it easier for the model to learn the concept of sequences. # # For classification tasks, the first vector (corresponding to [CLS]) is # used as the "sentence vector". Note that this only makes sense because # the entire model is fine-tuned. tokens = [] segment_ids = [] tokens.append("[CLS]") segment_ids.append(0) for token in tokens_a: tokens.append(token) segment_ids.append(0) tokens.append("[SEP]") segment_ids.append(0) if tokens_b: for token in tokens_b: tokens.append(token) segment_ids.append(1) tokens.append("[SEP]") segment_ids.append(1) input_ids = tokenizer.convert_tokens_to_ids(tokens) # The mask has 1 for real tokens and 0 for padding tokens. Only real # tokens are attended to. input_mask = [1] * len(input_ids) # Zero-pad up to the sequence length. while len(input_ids) < max_seq_length: input_ids.append(0) input_mask.append(0) segment_ids.append(0) assert len(input_ids) == max_seq_length assert len(input_mask) == max_seq_length assert len(segment_ids) == max_seq_length label_id = label_map[example.label] if ex_index < 5: tf.logging.info("*** Example ***") tf.logging.info("guid: %s" % (example.guid)) tf.logging.info("tokens: %s" % " ".join( [tokenization.printable_text(x) for x in tokens])) tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) tf.logging.info("label: %s (id = %d)" % (example.label, label_id)) feature = InputFeatures( input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids, label_id=label_id, is_real_example=True) return feature def file_based_convert_examples_to_features( examples, label_list, max_seq_length, tokenizer, output_file): """Convert a set of `InputExample`s to a TFRecord file.""" writer = tf.python_io.TFRecordWriter(output_file) for (ex_index, example) in enumerate(examples): if ex_index % 10000 == 0: tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) feature = convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer) def create_int_feature(values): f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) return f features = collections.OrderedDict() features["input_ids"] = create_int_feature(feature.input_ids) features["input_mask"] = create_int_feature(feature.input_mask) features["segment_ids"] = create_int_feature(feature.segment_ids) features["label_ids"] = create_int_feature([feature.label_id]) features["is_real_example"] = create_int_feature( [int(feature.is_real_example)]) tf_example = tf.train.Example(features=tf.train.Features(feature=features)) writer.write(tf_example.SerializeToString()) writer.close() def file_based_input_fn_builder(input_file, seq_length, is_training, drop_remainder): """Creates an `input_fn` closure to be passed to TPUEstimator.""" name_to_features = { "input_ids": tf.FixedLenFeature([seq_length], tf.int64), "input_mask": tf.FixedLenFeature([seq_length], tf.int64), "segment_ids": tf.FixedLenFeature([seq_length], tf.int64), "label_ids": tf.FixedLenFeature([], tf.int64), "is_real_example": tf.FixedLenFeature([], tf.int64), } def _decode_record(record, name_to_features): """Decodes a record to a TensorFlow example.""" example = tf.parse_single_example(record, name_to_features) # tf.Example only supports tf.int64, but the TPU only supports tf.int32. # So cast all int64 to int32. for name in list(example.keys()): t = example[name] if t.dtype == tf.int64: t = tf.to_int32(t) example[name] = t return example def input_fn(params): """The actual input function.""" batch_size = params["batch_size"] # For training, we want a lot of parallel reading and shuffling. # For eval, we want no shuffling and parallel reading doesn't matter. d = tf.data.TFRecordDataset(input_file) if is_training: d = d.repeat() d = d.shuffle(buffer_size=100) d = d.apply( tf.contrib.data.map_and_batch( lambda record: _decode_record(record, name_to_features), batch_size=batch_size, drop_remainder=drop_remainder)) return d return input_fn def _truncate_seq_pair(tokens_a, tokens_b, max_length): """Truncates a sequence pair in place to the maximum length.""" # This is a simple heuristic which will always truncate the longer sequence # one token at a time. This makes more sense than truncating an equal percent # of tokens from each, since if one sequence is very short then each token # that's truncated likely contains more information than a longer sequence. while True: total_length = len(tokens_a) + len(tokens_b) if total_length <= max_length: break if len(tokens_a) > len(tokens_b): tokens_a.pop() else: tokens_b.pop() def create_model(albert_config, is_training, input_ids, input_mask, segment_ids, labels, num_labels, use_one_hot_embeddings): """Creates a classification model.""" model = modeling.AlbertModel( config=albert_config, is_training=is_training, input_ids=input_ids, input_mask=input_mask, token_type_ids=segment_ids, use_one_hot_embeddings=use_one_hot_embeddings) # In the demo, we are doing a simple classification task on the entire # segment. # # If you want to use the token-level output, use model.get_sequence_output() # instead. if FLAGS.use_pooled_output: tf.logging.info("using pooled output") output_layer = model.get_pooled_output() else: tf.logging.info("using meaned output") output_layer = tf.reduce_mean(model.get_sequence_output(), axis=1) hidden_size = output_layer.shape[-1].value output_weights = tf.get_variable( "output_weights", [num_labels, hidden_size], initializer=tf.truncated_normal_initializer(stddev=0.02)) output_bias = tf.get_variable( "output_bias", [num_labels], initializer=tf.zeros_initializer()) with tf.variable_scope("loss"): if is_training: # I.e., 0.1 dropout output_layer = tf.nn.dropout(output_layer, keep_prob=0.9) logits = tf.matmul(output_layer, output_weights, transpose_b=True) logits = tf.nn.bias_add(logits, output_bias) predictions = tf.argmax(logits, axis=-1, output_type=tf.int32) probabilities = tf.nn.softmax(logits, axis=-1) log_probs = tf.nn.log_softmax(logits, axis=-1) one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) loss = tf.reduce_mean(per_example_loss) return (loss, per_example_loss, probabilities, predictions) def model_fn_builder(albert_config, num_labels, init_checkpoint, learning_rate, num_train_steps, num_warmup_steps, use_tpu, use_one_hot_embeddings): """Returns `model_fn` closure for TPUEstimator.""" def model_fn(features, labels, mode, params): # pylint: disable=unused-argument """The `model_fn` for TPUEstimator.""" tf.logging.info("*** Features ***") for name in sorted(features.keys()): tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) input_ids = features["input_ids"] input_mask = features["input_mask"] segment_ids = features["segment_ids"] label_ids = features["label_ids"] is_real_example = None if "is_real_example" in features: is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32) else: is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32) is_training = (mode == tf.estimator.ModeKeys.TRAIN) (total_loss, per_example_loss, probabilities, predictions) = \ create_model(albert_config, is_training, input_ids, input_mask, segment_ids, label_ids, num_labels, use_one_hot_embeddings) tvars = tf.trainable_variables() initialized_variable_names = {} scaffold_fn = None if init_checkpoint: (assignment_map, initialized_variable_names ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) if use_tpu: def tpu_scaffold(): tf.train.init_from_checkpoint(init_checkpoint, assignment_map) return tf.train.Scaffold() scaffold_fn = tpu_scaffold else: tf.train.init_from_checkpoint(init_checkpoint, assignment_map) tf.logging.info("**** Trainable Variables ****") for var in tvars: init_string = "" if var.name in initialized_variable_names: init_string = ", *INIT_FROM_CKPT*" tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, init_string) output_spec = None if mode == tf.estimator.ModeKeys.TRAIN: train_op = optimization.create_optimizer( total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu) output_spec = tf.contrib.tpu.TPUEstimatorSpec( mode=mode, loss=total_loss, train_op=train_op, scaffold_fn=scaffold_fn) elif mode == tf.estimator.ModeKeys.EVAL: def metric_fn(per_example_loss, label_ids, predictions, is_real_example): accuracy = tf.metrics.accuracy( labels=label_ids, predictions=predictions, weights=is_real_example) loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example) return { "eval_accuracy": accuracy, "eval_loss": loss, } eval_metrics = (metric_fn, [per_example_loss, label_ids, predictions, is_real_example]) output_spec = tf.contrib.tpu.TPUEstimatorSpec( mode=mode, loss=total_loss, eval_metrics=eval_metrics, scaffold_fn=scaffold_fn) else: output_spec = tf.contrib.tpu.TPUEstimatorSpec( mode=mode, predictions={"probabilities": probabilities, "predictions": predictions}, scaffold_fn=scaffold_fn) return output_spec return model_fn # This function is not used by this file but is still used by the Colab and # people who depend on it. def input_fn_builder(features, seq_length, is_training, drop_remainder): """Creates an `input_fn` closure to be passed to TPUEstimator.""" all_input_ids = [] all_input_mask = [] all_segment_ids = [] all_label_ids = [] for feature in features: all_input_ids.append(feature.input_ids) all_input_mask.append(feature.input_mask) all_segment_ids.append(feature.segment_ids) all_label_ids.append(feature.label_id) def input_fn(params): """The actual input function.""" batch_size = params["batch_size"] num_examples = len(features) # This is for demo purposes and does NOT scale to large data sets. We do # not use Dataset.from_generator() because that uses tf.py_func which is # not TPU compatible. The right way to load data is with TFRecordReader. d = tf.data.Dataset.from_tensor_slices({ "input_ids": tf.constant( all_input_ids, shape=[num_examples, seq_length], dtype=tf.int32), "input_mask": tf.constant( all_input_mask, shape=[num_examples, seq_length], dtype=tf.int32), "segment_ids": tf.constant( all_segment_ids, shape=[num_examples, seq_length], dtype=tf.int32), "label_ids": tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32), }) if is_training: d = d.repeat() d = d.shuffle(buffer_size=100) d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder) return d return input_fn # This function is not used by this file but is still used by the Colab and # people who depend on it. def convert_examples_to_features(examples, label_list, max_seq_length, tokenizer): """Convert a set of `InputExample`s to a list of `InputFeatures`.""" features = [] for (ex_index, example) in enumerate(examples): if ex_index % 10000 == 0: tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) feature = convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer) features.append(feature) return features def main(_): tf.logging.set_verbosity(tf.logging.INFO) processors = { "cola": ColaProcessor, "mnli": MnliProcessor, "mrpc": MrpcProcessor, "xnli": XnliProcessor, "lcqmc_pair": LCQMCPairClassificationProcessor } tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case, FLAGS.init_checkpoint) if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict: raise ValueError( "At least one of `do_train`, `do_eval` or `do_predict' must be True.") albert_config = modeling.AlbertConfig.from_json_file(FLAGS.albert_config_file) if FLAGS.max_seq_length > albert_config.max_position_embeddings: raise ValueError( "Cannot use sequence length %d because the ALBERT model " "was only trained up to sequence length %d" % (FLAGS.max_seq_length, albert_config.max_position_embeddings)) tf.gfile.MakeDirs(FLAGS.output_dir) task_name = FLAGS.task_name.lower() if task_name not in processors: raise ValueError("Task not found: %s" % (task_name)) processor = processors[task_name]() label_list = processor.get_labels() tokenizer = tokenization.FullTokenizer( vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case, spm_model_file=FLAGS.spm_model_file) tpu_cluster_resolver = None if FLAGS.use_tpu and FLAGS.tpu_name: tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 run_config = tf.contrib.tpu.RunConfig( cluster=tpu_cluster_resolver, master=FLAGS.master, model_dir=FLAGS.output_dir, save_checkpoints_steps=FLAGS.save_checkpoints_steps, tpu_config=tf.contrib.tpu.TPUConfig( iterations_per_loop=FLAGS.iterations_per_loop, num_shards=FLAGS.num_tpu_cores, per_host_input_for_training=is_per_host)) train_examples = None num_train_steps = None num_warmup_steps = None if FLAGS.do_train: train_examples = processor.get_train_examples(FLAGS.data_dir) num_train_steps = int( len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs) num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion) model_fn = model_fn_builder( albert_config=albert_config, num_labels=len(label_list), init_checkpoint=FLAGS.init_checkpoint, learning_rate=FLAGS.learning_rate, num_train_steps=num_train_steps, num_warmup_steps=num_warmup_steps, use_tpu=FLAGS.use_tpu, use_one_hot_embeddings=FLAGS.use_tpu) # If TPU is not available, this will fall back to normal Estimator on CPU # or GPU. estimator = tf.contrib.tpu.TPUEstimator( use_tpu=FLAGS.use_tpu, model_fn=model_fn, config=run_config, train_batch_size=FLAGS.train_batch_size, eval_batch_size=FLAGS.eval_batch_size, predict_batch_size=FLAGS.predict_batch_size) if FLAGS.do_train: train_file = os.path.join(FLAGS.output_dir, "train.tf_record") file_based_convert_examples_to_features( train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file) tf.logging.info("***** Running training *****") tf.logging.info(" Num examples = %d", len(train_examples)) tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) tf.logging.info(" Num steps = %d", num_train_steps) train_input_fn = file_based_input_fn_builder( input_file=train_file, seq_length=FLAGS.max_seq_length, is_training=True, drop_remainder=True) estimator.train(input_fn=train_input_fn, max_steps=num_train_steps) if FLAGS.do_eval: eval_examples = processor.get_dev_examples(FLAGS.data_dir) num_actual_eval_examples = len(eval_examples) if FLAGS.use_tpu: # TPU requires a fixed batch size for all batches, therefore the number # of examples must be a multiple of the batch size, or else examples # will get dropped. So we pad with fake examples which are ignored # later on. These do NOT count towards the metric (all tf.metrics # support a per-instance weight, and these get a weight of 0.0). while len(eval_examples) % FLAGS.eval_batch_size != 0: eval_examples.append(PaddingInputExample()) eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record") file_based_convert_examples_to_features( eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file) tf.logging.info("***** Running evaluation *****") tf.logging.info(" Num examples = %d (%d actual, %d padding)", len(eval_examples), num_actual_eval_examples, len(eval_examples) - num_actual_eval_examples) tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) # This tells the estimator to run through the entire set. eval_steps = None # However, if running eval on the TPU, you will need to specify the # number of steps. if FLAGS.use_tpu: assert len(eval_examples) % FLAGS.eval_batch_size == 0 eval_steps = int(len(eval_examples) // FLAGS.eval_batch_size) eval_drop_remainder = True if FLAGS.use_tpu else False eval_input_fn = file_based_input_fn_builder( input_file=eval_file, seq_length=FLAGS.max_seq_length, is_training=False, drop_remainder=eval_drop_remainder) ####################################################################################################################### # evaluate all checkpoints; you can use the checkpoint with the best dev accuarcy steps_and_files = [] filenames = tf.gfile.ListDirectory(FLAGS.output_dir) for filename in filenames: if filename.endswith(".index"): ckpt_name = filename[:-6] cur_filename = os.path.join(FLAGS.output_dir, ckpt_name) global_step = int(cur_filename.split("-")[-1]) tf.logging.info("Add {} to eval list.".format(cur_filename)) steps_and_files.append([global_step, cur_filename]) steps_and_files = sorted(steps_and_files, key=lambda x: x[0]) output_eval_file = os.path.join(FLAGS.data_dir, "eval_results_albert_zh.txt") print("output_eval_file:",output_eval_file) tf.logging.info("output_eval_file:"+output_eval_file) with tf.gfile.GFile(output_eval_file, "w") as writer: for global_step, filename in sorted(steps_and_files, key=lambda x: x[0]): result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps, checkpoint_path=filename) tf.logging.info("***** Eval results %s *****" % (filename)) writer.write("***** Eval results %s *****\n" % (filename)) for key in sorted(result.keys()): tf.logging.info(" %s = %s", key, str(result[key])) writer.write("%s = %s\n" % (key, str(result[key]))) ####################################################################################################################### # result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps) # output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt") # with tf.gfile.GFile(output_eval_file, "w") as writer: # tf.logging.info("***** Eval results *****") # for key in sorted(result.keys()): # tf.logging.info(" %s = %s", key, str(result[key])) # writer.write("%s = %s\n" % (key, str(result[key]))) if FLAGS.do_predict: predict_examples = processor.get_test_examples(FLAGS.data_dir) num_actual_predict_examples = len(predict_examples) if FLAGS.use_tpu: # TPU requires a fixed batch size for all batches, therefore the number # of examples must be a multiple of the batch size, or else examples # will get dropped. So we pad with fake examples which are ignored # later on. while len(predict_examples) % FLAGS.predict_batch_size != 0: predict_examples.append(PaddingInputExample()) predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record") file_based_convert_examples_to_features(predict_examples, label_list, FLAGS.max_seq_length, tokenizer, predict_file) tf.logging.info("***** Running prediction*****") tf.logging.info(" Num examples = %d (%d actual, %d padding)", len(predict_examples), num_actual_predict_examples, len(predict_examples) - num_actual_predict_examples) tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size) predict_drop_remainder = True if FLAGS.use_tpu else False predict_input_fn = file_based_input_fn_builder( input_file=predict_file, seq_length=FLAGS.max_seq_length, is_training=False, drop_remainder=predict_drop_remainder) result = estimator.predict(input_fn=predict_input_fn) output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv") output_submit_file = os.path.join(FLAGS.output_dir, "submit_results.tsv") with tf.gfile.GFile(output_predict_file, "w") as pred_writer,\ tf.gfile.GFile(output_submit_file, "w") as sub_writer: num_written_lines = 0 tf.logging.info("***** Predict results *****") for (i, (example, prediction)) in\ enumerate(zip(predict_examples, result)): probabilities = prediction["probabilities"] if i >= num_actual_predict_examples: break output_line = "\t".join( str(class_probability) for class_probability in probabilities) + "\n" pred_writer.write(output_line) actual_label = label_list[int(prediction["predictions"])] sub_writer.write( six.ensure_str(example.guid) + "\t" + actual_label + "\n") num_written_lines += 1 assert num_written_lines == num_actual_predict_examples if __name__ == "__main__": flags.mark_flag_as_required("data_dir") flags.mark_flag_as_required("task_name") flags.mark_flag_as_required("vocab_file") flags.mark_flag_as_required("albert_config_file") flags.mark_flag_as_required("output_dir") tf.app.run() ================================================ FILE: run_pretraining.py ================================================ # coding=utf-8 # Copyright 2018 The Google AI Language Team Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Run masked LM/next sentence masked_lm pre-training for BERT.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import os import modeling import optimization import tensorflow as tf flags = tf.flags FLAGS = flags.FLAGS ## Required parameters flags.DEFINE_string( "bert_config_file", None, "The config json file corresponding to the pre-trained BERT model. " "This specifies the model architecture.") flags.DEFINE_string( "input_file", None, "Input TF example files (can be a glob or comma separated).") flags.DEFINE_string( "output_dir", None, "The output directory where the model checkpoints will be written.") ## Other parameters flags.DEFINE_string( "init_checkpoint", None, "Initial checkpoint (usually from a pre-trained BERT model).") flags.DEFINE_integer( "max_seq_length", 128, "The maximum total input sequence length after WordPiece tokenization. " "Sequences longer than this will be truncated, and sequences shorter " "than this will be padded. Must match data generation.") flags.DEFINE_integer( "max_predictions_per_seq", 20, "Maximum number of masked LM predictions per sequence. " "Must match data generation.") flags.DEFINE_bool("do_train", False, "Whether to run training.") flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.") flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.") flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.") flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.") flags.DEFINE_integer("num_train_steps", 100000, "Number of training steps.") flags.DEFINE_integer("num_warmup_steps", 10000, "Number of warmup steps.") flags.DEFINE_integer("save_checkpoints_steps", 1000, "How often to save the model checkpoint.") flags.DEFINE_integer("iterations_per_loop", 1000, "How many steps to make in each estimator call.") flags.DEFINE_integer("max_eval_steps", 100, "Maximum number of eval steps.") flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") tf.flags.DEFINE_string( "tpu_name", None, "The Cloud TPU to use for training. This should be either the name " "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " "url.") tf.flags.DEFINE_string( "tpu_zone", None, "[Optional] GCE zone where the Cloud TPU is located in. If not " "specified, we will attempt to automatically detect the GCE project from " "metadata.") tf.flags.DEFINE_string( "gcp_project", None, "[Optional] Project name for the Cloud TPU-enabled project. If not " "specified, we will attempt to automatically detect the GCE project from " "metadata.") tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") flags.DEFINE_integer( "num_tpu_cores", 8, "Only used if `use_tpu` is True. Total number of TPU cores to use.") def model_fn_builder(bert_config, init_checkpoint, learning_rate, num_train_steps, num_warmup_steps, use_tpu, use_one_hot_embeddings): """Returns `model_fn` closure for TPUEstimator.""" def model_fn(features, labels, mode, params): # pylint: disable=unused-argument """The `model_fn` for TPUEstimator.""" tf.logging.info("*** Features ***") for name in sorted(features.keys()): tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) input_ids = features["input_ids"] input_mask = features["input_mask"] segment_ids = features["segment_ids"] masked_lm_positions = features["masked_lm_positions"] masked_lm_ids = features["masked_lm_ids"] masked_lm_weights = features["masked_lm_weights"] next_sentence_labels = features["next_sentence_labels"] is_training = (mode == tf.estimator.ModeKeys.TRAIN) model = modeling.BertModel( config=bert_config, is_training=is_training, input_ids=input_ids, input_mask=input_mask, token_type_ids=segment_ids, use_one_hot_embeddings=use_one_hot_embeddings) (masked_lm_loss, masked_lm_example_loss, masked_lm_log_probs) = get_masked_lm_output( bert_config, model.get_sequence_output(), model.get_embedding_table(),model.get_embedding_table_2(), masked_lm_positions, masked_lm_ids, masked_lm_weights) (next_sentence_loss, next_sentence_example_loss, next_sentence_log_probs) = get_next_sentence_output( bert_config, model.get_pooled_output(), next_sentence_labels) total_loss = masked_lm_loss + next_sentence_loss tvars = tf.trainable_variables() initialized_variable_names = {} print("init_checkpoint:",init_checkpoint) scaffold_fn = None if init_checkpoint: (assignment_map, initialized_variable_names ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) if use_tpu: def tpu_scaffold(): tf.train.init_from_checkpoint(init_checkpoint, assignment_map) return tf.train.Scaffold() scaffold_fn = tpu_scaffold else: tf.train.init_from_checkpoint(init_checkpoint, assignment_map) tf.logging.info("**** Trainable Variables ****") for var in tvars: init_string = "" if var.name in initialized_variable_names: init_string = ", *INIT_FROM_CKPT*" tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, init_string) output_spec = None if mode == tf.estimator.ModeKeys.TRAIN: train_op = optimization.create_optimizer( total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu) output_spec = tf.contrib.tpu.TPUEstimatorSpec( mode=mode, loss=total_loss, train_op=train_op, scaffold_fn=scaffold_fn) elif mode == tf.estimator.ModeKeys.EVAL: def metric_fn(masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids, masked_lm_weights, next_sentence_example_loss, next_sentence_log_probs, next_sentence_labels): """Computes the loss and accuracy of the model.""" masked_lm_log_probs = tf.reshape(masked_lm_log_probs,[-1, masked_lm_log_probs.shape[-1]]) masked_lm_predictions = tf.argmax(masked_lm_log_probs, axis=-1, output_type=tf.int32) masked_lm_example_loss = tf.reshape(masked_lm_example_loss, [-1]) masked_lm_ids = tf.reshape(masked_lm_ids, [-1]) masked_lm_weights = tf.reshape(masked_lm_weights, [-1]) masked_lm_accuracy = tf.metrics.accuracy( labels=masked_lm_ids, predictions=masked_lm_predictions, weights=masked_lm_weights) masked_lm_mean_loss = tf.metrics.mean( values=masked_lm_example_loss, weights=masked_lm_weights) next_sentence_log_probs = tf.reshape( next_sentence_log_probs, [-1, next_sentence_log_probs.shape[-1]]) next_sentence_predictions = tf.argmax( next_sentence_log_probs, axis=-1, output_type=tf.int32) next_sentence_labels = tf.reshape(next_sentence_labels, [-1]) next_sentence_accuracy = tf.metrics.accuracy( labels=next_sentence_labels, predictions=next_sentence_predictions) next_sentence_mean_loss = tf.metrics.mean( values=next_sentence_example_loss) return { "masked_lm_accuracy": masked_lm_accuracy, "masked_lm_loss": masked_lm_mean_loss, "next_sentence_accuracy": next_sentence_accuracy, "next_sentence_loss": next_sentence_mean_loss, } # next_sentence_example_loss=0.0 TODO # next_sentence_log_probs=0.0 # TODO eval_metrics = (metric_fn, [ masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids, masked_lm_weights, next_sentence_example_loss, next_sentence_log_probs, next_sentence_labels ]) output_spec = tf.contrib.tpu.TPUEstimatorSpec( mode=mode, loss=total_loss, eval_metrics=eval_metrics, scaffold_fn=scaffold_fn) else: raise ValueError("Only TRAIN and EVAL modes are supported: %s" % (mode)) return output_spec return model_fn def get_masked_lm_output(bert_config, input_tensor, output_weights,project_weights, positions, label_ids, label_weights): """Get loss and log probs for the masked LM.""" input_tensor = gather_indexes(input_tensor, positions) with tf.variable_scope("cls/predictions"): # We apply one more non-linear transformation before the output layer. # This matrix is not used after pre-training. with tf.variable_scope("transform"): input_tensor = tf.layers.dense( input_tensor, units=bert_config.hidden_size, activation=modeling.get_activation(bert_config.hidden_act), kernel_initializer=modeling.create_initializer( bert_config.initializer_range)) input_tensor = modeling.layer_norm(input_tensor) # The output weights are the same as the input embeddings, but there is # an output-only bias for each token. output_bias = tf.get_variable( "output_bias", shape=[bert_config.vocab_size], initializer=tf.zeros_initializer()) # logits = tf.matmul(input_tensor, output_weights, transpose_b=True) # input_tensor=[-1,hidden_size], project_weights=[embedding_size, hidden_size], project_weights_transpose=[hidden_size, embedding_size]--->[-1, embedding_size] input_project = tf.matmul(input_tensor, project_weights, transpose_b=True) logits = tf.matmul(input_project, output_weights, transpose_b=True) # # input_project=[-1, embedding_size], output_weights=[vocab_size, embedding_size], output_weights_transpose=[embedding_size, vocab_size] ---> [-1, vocab_size] logits = tf.nn.bias_add(logits, output_bias) log_probs = tf.nn.log_softmax(logits, axis=-1) label_ids = tf.reshape(label_ids, [-1]) label_weights = tf.reshape(label_weights, [-1]) one_hot_labels = tf.one_hot(label_ids, depth=bert_config.vocab_size, dtype=tf.float32) # The `positions` tensor might be zero-padded (if the sequence is too # short to have the maximum number of predictions). The `label_weights` # tensor has a value of 1.0 for every real prediction and 0.0 for the # padding predictions. per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1]) numerator = tf.reduce_sum(label_weights * per_example_loss) denominator = tf.reduce_sum(label_weights) + 1e-5 loss = numerator / denominator return (loss, per_example_loss, log_probs) def get_next_sentence_output(bert_config, input_tensor, labels): """Get loss and log probs for the next sentence prediction.""" # Simple binary classification. Note that 0 is "next sentence" and 1 is # "random sentence". This weight matrix is not used after pre-training. with tf.variable_scope("cls/seq_relationship"): output_weights = tf.get_variable( "output_weights", shape=[2, bert_config.hidden_size], initializer=modeling.create_initializer(bert_config.initializer_range)) output_bias = tf.get_variable( "output_bias", shape=[2], initializer=tf.zeros_initializer()) logits = tf.matmul(input_tensor, output_weights, transpose_b=True) logits = tf.nn.bias_add(logits, output_bias) log_probs = tf.nn.log_softmax(logits, axis=-1) labels = tf.reshape(labels, [-1]) one_hot_labels = tf.one_hot(labels, depth=2, dtype=tf.float32) per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) loss = tf.reduce_mean(per_example_loss) return (loss, per_example_loss, log_probs) def gather_indexes(sequence_tensor, positions): """Gathers the vectors at the specific positions over a minibatch.""" sequence_shape = modeling.get_shape_list(sequence_tensor, expected_rank=3) batch_size = sequence_shape[0] seq_length = sequence_shape[1] width = sequence_shape[2] flat_offsets = tf.reshape( tf.range(0, batch_size, dtype=tf.int32) * seq_length, [-1, 1]) flat_positions = tf.reshape(positions + flat_offsets, [-1]) flat_sequence_tensor = tf.reshape(sequence_tensor, [batch_size * seq_length, width]) output_tensor = tf.gather(flat_sequence_tensor, flat_positions) return output_tensor def input_fn_builder(input_files, max_seq_length, max_predictions_per_seq, is_training, num_cpu_threads=4): """Creates an `input_fn` closure to be passed to TPUEstimator.""" def input_fn(params): """The actual input function.""" batch_size = params["batch_size"] name_to_features = { "input_ids": tf.FixedLenFeature([max_seq_length], tf.int64), "input_mask": tf.FixedLenFeature([max_seq_length], tf.int64), "segment_ids": tf.FixedLenFeature([max_seq_length], tf.int64), "masked_lm_positions": tf.FixedLenFeature([max_predictions_per_seq], tf.int64), "masked_lm_ids": tf.FixedLenFeature([max_predictions_per_seq], tf.int64), "masked_lm_weights": tf.FixedLenFeature([max_predictions_per_seq], tf.float32), "next_sentence_labels": tf.FixedLenFeature([1], tf.int64), } # For training, we want a lot of parallel reading and shuffling. # For eval, we want no shuffling and parallel reading doesn't matter. if is_training: d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files)) d = d.repeat() d = d.shuffle(buffer_size=len(input_files)) # `cycle_length` is the number of parallel files that get read. cycle_length = min(num_cpu_threads, len(input_files)) # `sloppy` mode means that the interleaving is not exact. This adds # even more randomness to the training pipeline. d = d.apply( tf.contrib.data.parallel_interleave( tf.data.TFRecordDataset, sloppy=is_training, cycle_length=cycle_length)) d = d.shuffle(buffer_size=100) else: d = tf.data.TFRecordDataset(input_files) # Since we evaluate for a fixed number of steps we don't want to encounter # out-of-range exceptions. d = d.repeat() # We must `drop_remainder` on training because the TPU requires fixed # size dimensions. For eval, we assume we are evaluating on the CPU or GPU # and we *don't* want to drop the remainder, otherwise we wont cover # every sample. d = d.apply( tf.contrib.data.map_and_batch( lambda record: _decode_record(record, name_to_features), batch_size=batch_size, num_parallel_batches=num_cpu_threads, drop_remainder=True)) return d return input_fn def _decode_record(record, name_to_features): """Decodes a record to a TensorFlow example.""" example = tf.parse_single_example(record, name_to_features) # tf.Example only supports tf.int64, but the TPU only supports tf.int32. # So cast all int64 to int32. for name in list(example.keys()): t = example[name] if t.dtype == tf.int64: t = tf.to_int32(t) example[name] = t return example def main(_): tf.logging.set_verbosity(tf.logging.INFO) if not FLAGS.do_train and not FLAGS.do_eval: # 必须是训练或验证的类型 raise ValueError("At least one of `do_train` or `do_eval` must be True.") bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file) # 从json文件中获得配置信息 tf.gfile.MakeDirs(FLAGS.output_dir) input_files = [] # 输入可以是多个文件,以“逗号隔开”;可以是一个匹配形式的,如“input_x*” for input_pattern in FLAGS.input_file.split(","): input_files.extend(tf.gfile.Glob(input_pattern)) tf.logging.info("*** Input Files ***") for input_file in input_files: tf.logging.info(" %s" % input_file) tpu_cluster_resolver = None if FLAGS.use_tpu and FLAGS.tpu_name: tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( # TODO tpu=FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) print("###tpu_cluster_resolver:",tpu_cluster_resolver,";FLAGS.use_tpu:",FLAGS.use_tpu,";FLAGS.tpu_name:",FLAGS.tpu_name,";FLAGS.tpu_zone:",FLAGS.tpu_zone) # ###tpu_cluster_resolver: ;FLAGS.use_tpu: True ;FLAGS.tpu_name: grpc://10.240.1.83:8470 is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 run_config = tf.contrib.tpu.RunConfig( keep_checkpoint_max=20, # 10 cluster=tpu_cluster_resolver, master=FLAGS.master, model_dir=FLAGS.output_dir, save_checkpoints_steps=FLAGS.save_checkpoints_steps, tpu_config=tf.contrib.tpu.TPUConfig( iterations_per_loop=FLAGS.iterations_per_loop, num_shards=FLAGS.num_tpu_cores, per_host_input_for_training=is_per_host)) model_fn = model_fn_builder( bert_config=bert_config, init_checkpoint=FLAGS.init_checkpoint, learning_rate=FLAGS.learning_rate, num_train_steps=FLAGS.num_train_steps, num_warmup_steps=FLAGS.num_warmup_steps, use_tpu=FLAGS.use_tpu, use_one_hot_embeddings=FLAGS.use_tpu) # If TPU is not available, this will fall back to normal Estimator on CPU # or GPU. estimator = tf.contrib.tpu.TPUEstimator( use_tpu=FLAGS.use_tpu, model_fn=model_fn, config=run_config, train_batch_size=FLAGS.train_batch_size, eval_batch_size=FLAGS.eval_batch_size) if FLAGS.do_train: tf.logging.info("***** Running training *****") tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) train_input_fn = input_fn_builder( input_files=input_files, max_seq_length=FLAGS.max_seq_length, max_predictions_per_seq=FLAGS.max_predictions_per_seq, is_training=True) estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps) if FLAGS.do_eval: tf.logging.info("***** Running evaluation *****") tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) eval_input_fn = input_fn_builder( input_files=input_files, max_seq_length=FLAGS.max_seq_length, max_predictions_per_seq=FLAGS.max_predictions_per_seq, is_training=False) result = estimator.evaluate(input_fn=eval_input_fn, steps=FLAGS.max_eval_steps) output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt") with tf.gfile.GFile(output_eval_file, "w") as writer: tf.logging.info("***** Eval results *****") for key in sorted(result.keys()): tf.logging.info(" %s = %s", key, str(result[key])) writer.write("%s = %s\n" % (key, str(result[key]))) if __name__ == "__main__": flags.mark_flag_as_required("input_file") flags.mark_flag_as_required("bert_config_file") flags.mark_flag_as_required("output_dir") tf.app.run() ================================================ FILE: run_pretraining_google.py ================================================ # coding=utf-8 # Copyright 2019 The Google Research Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # Lint as: python2, python3 """Run masked LM/next sentence masked_lm pre-training for ALBERT.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import os import time from six.moves import range import tensorflow as tf import modeling_google as modeling import optimization_google as optimization flags = tf.flags FLAGS = flags.FLAGS ## Required parameters flags.DEFINE_string( "albert_config_file", None, "The config json file corresponding to the pre-trained ALBERT model. " "This specifies the model architecture.") flags.DEFINE_string( "input_file", None, "Input TF example files (can be a glob or comma separated).") flags.DEFINE_string( "output_dir", None, "The output directory where the model checkpoints will be written.") flags.DEFINE_string( "export_dir", None, "The output directory where the saved models will be written.") ## Other parameters flags.DEFINE_string( "init_checkpoint", None, "Initial checkpoint (usually from a pre-trained ALBERT model).") flags.DEFINE_integer( "max_seq_length", 512, "The maximum total input sequence length after WordPiece tokenization. " "Sequences longer than this will be truncated, and sequences shorter " "than this will be padded. Must match data generation.") flags.DEFINE_integer( "max_predictions_per_seq", 20, "Maximum number of masked LM predictions per sequence. " "Must match data generation.") flags.DEFINE_bool("do_train", True, "Whether to run training.") flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.") flags.DEFINE_integer("train_batch_size", 4096, "Total batch size for training.") flags.DEFINE_integer("eval_batch_size", 64, "Total batch size for eval.") flags.DEFINE_enum("optimizer", "lamb", ["adamw", "lamb"], "The optimizer for training.") flags.DEFINE_float("learning_rate", 0.00176, "The initial learning rate.") flags.DEFINE_float("poly_power", 1.0, "The power of poly decay.") flags.DEFINE_integer("num_train_steps", 125000, "Number of training steps.") flags.DEFINE_integer("num_warmup_steps", 3125, "Number of warmup steps.") flags.DEFINE_integer("start_warmup_step", 0, "The starting step of warmup.") flags.DEFINE_integer("save_checkpoints_steps", 5000, "How often to save the model checkpoint.") flags.DEFINE_integer("iterations_per_loop", 1000, "How many steps to make in each estimator call.") flags.DEFINE_integer("max_eval_steps", 100, "Maximum number of eval steps.") flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") flags.DEFINE_bool("init_from_group0", False, "Whether to initialize" "parameters of other groups from group 0") tf.flags.DEFINE_string( "tpu_name", None, "The Cloud TPU to use for training. This should be either the name " "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " "url.") tf.flags.DEFINE_string( "tpu_zone", None, "[Optional] GCE zone where the Cloud TPU is located in. If not " "specified, we will attempt to automatically detect the GCE project from " "metadata.") tf.flags.DEFINE_string( "gcp_project", None, "[Optional] Project name for the Cloud TPU-enabled project. If not " "specified, we will attempt to automatically detect the GCE project from " "metadata.") tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") flags.DEFINE_integer( "num_tpu_cores", 8, "Only used if `use_tpu` is True. Total number of TPU cores to use.") flags.DEFINE_float( "masked_lm_budget", 0, "If >0, the ratio of masked ngrams to unmasked ngrams. Default 0," "for offline masking") def model_fn_builder(albert_config, init_checkpoint, learning_rate, num_train_steps, num_warmup_steps, use_tpu, use_one_hot_embeddings, optimizer, poly_power, start_warmup_step): """Returns `model_fn` closure for TPUEstimator.""" def model_fn(features, labels, mode, params): # pylint: disable=unused-argument """The `model_fn` for TPUEstimator.""" tf.logging.info("*** Features ***") for name in sorted(features.keys()): tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) input_ids = features["input_ids"] input_mask = features["input_mask"] segment_ids = features["segment_ids"] masked_lm_positions = features["masked_lm_positions"] masked_lm_ids = features["masked_lm_ids"] masked_lm_weights = features["masked_lm_weights"] # Note: We keep this feature name `next_sentence_labels` to be compatible # with the original data created by lanzhzh@. However, in the ALBERT case # it does represent sentence_order_labels. sentence_order_labels = features["next_sentence_labels"] is_training = (mode == tf.estimator.ModeKeys.TRAIN) model = modeling.AlbertModel( config=albert_config, is_training=is_training, input_ids=input_ids, input_mask=input_mask, token_type_ids=segment_ids, use_one_hot_embeddings=use_one_hot_embeddings) (masked_lm_loss, masked_lm_example_loss, masked_lm_log_probs) = get_masked_lm_output(albert_config, model.get_sequence_output(), model.get_embedding_table(), masked_lm_positions, masked_lm_ids, masked_lm_weights) (sentence_order_loss, sentence_order_example_loss, sentence_order_log_probs) = get_sentence_order_output( albert_config, model.get_pooled_output(), sentence_order_labels) total_loss = masked_lm_loss + sentence_order_loss tvars = tf.trainable_variables() initialized_variable_names = {} scaffold_fn = None if init_checkpoint: tf.logging.info("number of hidden group %d to initialize", albert_config.num_hidden_groups) num_of_initialize_group = 1 if FLAGS.init_from_group0: num_of_initialize_group = albert_config.num_hidden_groups if albert_config.net_structure_type > 0: num_of_initialize_group = albert_config.num_hidden_layers (assignment_map, initialized_variable_names ) = modeling.get_assignment_map_from_checkpoint( tvars, init_checkpoint, num_of_initialize_group) if use_tpu: def tpu_scaffold(): for gid in range(num_of_initialize_group): tf.logging.info("initialize the %dth layer", gid) tf.logging.info(assignment_map[gid]) tf.train.init_from_checkpoint(init_checkpoint, assignment_map[gid]) return tf.train.Scaffold() scaffold_fn = tpu_scaffold else: for gid in range(num_of_initialize_group): tf.logging.info("initialize the %dth layer", gid) tf.logging.info(assignment_map[gid]) tf.train.init_from_checkpoint(init_checkpoint, assignment_map[gid]) tf.logging.info("**** Trainable Variables ****") for var in tvars: init_string = "" if var.name in initialized_variable_names: init_string = ", *INIT_FROM_CKPT*" tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, init_string) output_spec = None if mode == tf.estimator.ModeKeys.TRAIN: train_op = optimization.create_optimizer( total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu, optimizer, poly_power, start_warmup_step) output_spec = tf.contrib.tpu.TPUEstimatorSpec( mode=mode, loss=total_loss, train_op=train_op, scaffold_fn=scaffold_fn) elif mode == tf.estimator.ModeKeys.EVAL: def metric_fn(*args): """Computes the loss and accuracy of the model.""" (masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids, masked_lm_weights, sentence_order_example_loss, sentence_order_log_probs, sentence_order_labels) = args[:7] masked_lm_log_probs = tf.reshape(masked_lm_log_probs, [-1, masked_lm_log_probs.shape[-1]]) masked_lm_predictions = tf.argmax( masked_lm_log_probs, axis=-1, output_type=tf.int32) masked_lm_example_loss = tf.reshape(masked_lm_example_loss, [-1]) masked_lm_ids = tf.reshape(masked_lm_ids, [-1]) masked_lm_weights = tf.reshape(masked_lm_weights, [-1]) masked_lm_accuracy = tf.metrics.accuracy( labels=masked_lm_ids, predictions=masked_lm_predictions, weights=masked_lm_weights) masked_lm_mean_loss = tf.metrics.mean( values=masked_lm_example_loss, weights=masked_lm_weights) metrics = { "masked_lm_accuracy": masked_lm_accuracy, "masked_lm_loss": masked_lm_mean_loss, } sentence_order_log_probs = tf.reshape( sentence_order_log_probs, [-1, sentence_order_log_probs.shape[-1]]) sentence_order_predictions = tf.argmax( sentence_order_log_probs, axis=-1, output_type=tf.int32) sentence_order_labels = tf.reshape(sentence_order_labels, [-1]) sentence_order_accuracy = tf.metrics.accuracy( labels=sentence_order_labels, predictions=sentence_order_predictions) sentence_order_mean_loss = tf.metrics.mean( values=sentence_order_example_loss) metrics.update({ "sentence_order_accuracy": sentence_order_accuracy, "sentence_order_loss": sentence_order_mean_loss }) return metrics metric_values = [ masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids, masked_lm_weights, sentence_order_example_loss, sentence_order_log_probs, sentence_order_labels ] eval_metrics = (metric_fn, metric_values) output_spec = tf.contrib.tpu.TPUEstimatorSpec( mode=mode, loss=total_loss, eval_metrics=eval_metrics, scaffold_fn=scaffold_fn) else: raise ValueError("Only TRAIN and EVAL modes are supported: %s" % (mode)) return output_spec return model_fn def get_masked_lm_output(albert_config, input_tensor, output_weights, positions, label_ids, label_weights): """Get loss and log probs for the masked LM.""" input_tensor = gather_indexes(input_tensor, positions) with tf.variable_scope("cls/predictions"): # We apply one more non-linear transformation before the output layer. # This matrix is not used after pre-training. with tf.variable_scope("transform"): input_tensor = tf.layers.dense( input_tensor, units=albert_config.embedding_size, activation=modeling.get_activation(albert_config.hidden_act), kernel_initializer=modeling.create_initializer( albert_config.initializer_range)) input_tensor = modeling.layer_norm(input_tensor) # The output weights are the same as the input embeddings, but there is # an output-only bias for each token. output_bias = tf.get_variable( "output_bias", shape=[albert_config.vocab_size], initializer=tf.zeros_initializer()) logits = tf.matmul(input_tensor, output_weights, transpose_b=True) logits = tf.nn.bias_add(logits, output_bias) log_probs = tf.nn.log_softmax(logits, axis=-1) label_ids = tf.reshape(label_ids, [-1]) label_weights = tf.reshape(label_weights, [-1]) one_hot_labels = tf.one_hot( label_ids, depth=albert_config.vocab_size, dtype=tf.float32) # The `positions` tensor might be zero-padded (if the sequence is too # short to have the maximum number of predictions). The `label_weights` # tensor has a value of 1.0 for every real prediction and 0.0 for the # padding predictions. per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1]) numerator = tf.reduce_sum(label_weights * per_example_loss) denominator = tf.reduce_sum(label_weights) + 1e-5 loss = numerator / denominator return (loss, per_example_loss, log_probs) def get_sentence_order_output(albert_config, input_tensor, labels): """Get loss and log probs for the next sentence prediction.""" # Simple binary classification. Note that 0 is "next sentence" and 1 is # "random sentence". This weight matrix is not used after pre-training. with tf.variable_scope("cls/seq_relationship"): output_weights = tf.get_variable( "output_weights", shape=[2, albert_config.hidden_size], initializer=modeling.create_initializer( albert_config.initializer_range)) output_bias = tf.get_variable( "output_bias", shape=[2], initializer=tf.zeros_initializer()) logits = tf.matmul(input_tensor, output_weights, transpose_b=True) logits = tf.nn.bias_add(logits, output_bias) log_probs = tf.nn.log_softmax(logits, axis=-1) labels = tf.reshape(labels, [-1]) one_hot_labels = tf.one_hot(labels, depth=2, dtype=tf.float32) per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) loss = tf.reduce_mean(per_example_loss) return (loss, per_example_loss, log_probs) def gather_indexes(sequence_tensor, positions): """Gathers the vectors at the specific positions over a minibatch.""" sequence_shape = modeling.get_shape_list(sequence_tensor, expected_rank=3) batch_size = sequence_shape[0] seq_length = sequence_shape[1] width = sequence_shape[2] flat_offsets = tf.reshape( tf.range(0, batch_size, dtype=tf.int32) * seq_length, [-1, 1]) flat_positions = tf.reshape(positions + flat_offsets, [-1]) flat_sequence_tensor = tf.reshape(sequence_tensor, [batch_size * seq_length, width]) output_tensor = tf.gather(flat_sequence_tensor, flat_positions) return output_tensor def input_fn_builder(input_files, max_seq_length, max_predictions_per_seq, is_training, num_cpu_threads=4): """Creates an `input_fn` closure to be passed to TPUEstimator.""" def input_fn(params): """The actual input function.""" batch_size = params["batch_size"] name_to_features = { "input_ids": tf.FixedLenFeature([max_seq_length], tf.int64), "input_mask": tf.FixedLenFeature([max_seq_length], tf.int64), "segment_ids": tf.FixedLenFeature([max_seq_length], tf.int64), # Note: We keep this feature name `next_sentence_labels` to be # compatible with the original data created by lanzhzh@. However, in # the ALBERT case it does represent sentence_order_labels. "next_sentence_labels": tf.FixedLenFeature([1], tf.int64), } if FLAGS.masked_lm_budget: name_to_features.update({ "token_boundary": tf.FixedLenFeature([max_seq_length], tf.int64)}) else: name_to_features.update({ "masked_lm_positions": tf.FixedLenFeature([max_predictions_per_seq], tf.int64), "masked_lm_ids": tf.FixedLenFeature([max_predictions_per_seq], tf.int64), "masked_lm_weights": tf.FixedLenFeature([max_predictions_per_seq], tf.float32)}) # For training, we want a lot of parallel reading and shuffling. # For eval, we want no shuffling and parallel reading doesn't matter. if is_training: d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files)) d = d.repeat() d = d.shuffle(buffer_size=len(input_files)) # `cycle_length` is the number of parallel files that get read. cycle_length = min(num_cpu_threads, len(input_files)) # `sloppy` mode means that the interleaving is not exact. This adds # even more randomness to the training pipeline. d = d.apply( tf.contrib.data.parallel_interleave( tf.data.TFRecordDataset, sloppy=is_training, cycle_length=cycle_length)) d = d.shuffle(buffer_size=100) else: d = tf.data.TFRecordDataset(input_files) # Since we evaluate for a fixed number of steps we don't want to encounter # out-of-range exceptions. d = d.repeat() # We must `drop_remainder` on training because the TPU requires fixed # size dimensions. For eval, we assume we are evaluating on the CPU or GPU # and we *don't* want to drop the remainder, otherwise we wont cover # every sample. d = d.apply( tf.data.experimental.map_and_batch_with_legacy_function( lambda record: _decode_record(record, name_to_features), batch_size=batch_size, num_parallel_batches=num_cpu_threads, drop_remainder=True)) tf.logging.info(d) return d return input_fn def _decode_record(record, name_to_features): """Decodes a record to a TensorFlow example.""" example = tf.parse_single_example(record, name_to_features) # tf.Example only supports tf.int64, but the TPU only supports tf.int32. # So cast all int64 to int32. for name in list(example.keys()): t = example[name] if t.dtype == tf.int64: t = tf.to_int32(t) example[name] = t return example def main(_): tf.logging.set_verbosity(tf.logging.INFO) if not FLAGS.do_train and not FLAGS.do_eval: raise ValueError("At least one of `do_train` or `do_eval` must be True.") albert_config = modeling.AlbertConfig.from_json_file(FLAGS.albert_config_file) tf.gfile.MakeDirs(FLAGS.output_dir) input_files = [] for input_pattern in FLAGS.input_file.split(","): input_files.extend(tf.gfile.Glob(input_pattern)) tf.logging.info("*** Input Files ***") for input_file in input_files: tf.logging.info(" %s" % input_file) tpu_cluster_resolver = None if FLAGS.use_tpu and FLAGS.tpu_name: tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 run_config = tf.contrib.tpu.RunConfig( cluster=tpu_cluster_resolver, master=FLAGS.master, model_dir=FLAGS.output_dir, save_checkpoints_steps=FLAGS.save_checkpoints_steps, tpu_config=tf.contrib.tpu.TPUConfig( iterations_per_loop=FLAGS.iterations_per_loop, num_shards=FLAGS.num_tpu_cores, per_host_input_for_training=is_per_host)) model_fn = model_fn_builder( albert_config=albert_config, init_checkpoint=FLAGS.init_checkpoint, learning_rate=FLAGS.learning_rate, num_train_steps=FLAGS.num_train_steps, num_warmup_steps=FLAGS.num_warmup_steps, use_tpu=FLAGS.use_tpu, use_one_hot_embeddings=FLAGS.use_tpu, optimizer=FLAGS.optimizer, poly_power=FLAGS.poly_power, start_warmup_step=FLAGS.start_warmup_step) # If TPU is not available, this will fall back to normal Estimator on CPU # or GPU. estimator = tf.contrib.tpu.TPUEstimator( use_tpu=FLAGS.use_tpu, model_fn=model_fn, config=run_config, train_batch_size=FLAGS.train_batch_size, eval_batch_size=FLAGS.eval_batch_size) if FLAGS.do_train: tf.logging.info("***** Running training *****") tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) train_input_fn = input_fn_builder( input_files=input_files, max_seq_length=FLAGS.max_seq_length, max_predictions_per_seq=FLAGS.max_predictions_per_seq, is_training=True) estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps) if FLAGS.do_eval: tf.logging.info("***** Running evaluation *****") tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) global_step = -1 output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt") writer = tf.gfile.GFile(output_eval_file, "w") tf.gfile.MakeDirs(FLAGS.export_dir) eval_input_fn = input_fn_builder( input_files=input_files, max_seq_length=FLAGS.max_seq_length, max_predictions_per_seq=FLAGS.max_predictions_per_seq, is_training=False) while global_step < FLAGS.num_train_steps: if estimator.latest_checkpoint() is None: tf.logging.info("No checkpoint found yet. Sleeping.") time.sleep(1) else: result = estimator.evaluate( input_fn=eval_input_fn, steps=FLAGS.max_eval_steps) global_step = result["global_step"] tf.logging.info("***** Eval results *****") for key in sorted(result.keys()): tf.logging.info(" %s = %s", key, str(result[key])) writer.write("%s = %s\n" % (key, str(result[key]))) if __name__ == "__main__": flags.mark_flag_as_required("input_file") flags.mark_flag_as_required("albert_config_file") flags.mark_flag_as_required("output_dir") tf.app.run() ================================================ FILE: run_pretraining_google_fast.py ================================================ # coding=utf-8 # Copyright 2019 The Google Research Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # Lint as: python2, python3 """Run masked LM/next sentence masked_lm pre-training for ALBERT.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import os import time from six.moves import range import tensorflow as tf import modeling_google_fast as modeling import optimization_google as optimization flags = tf.flags FLAGS = flags.FLAGS ## Required parameters flags.DEFINE_string( "albert_config_file", None, "The config json file corresponding to the pre-trained ALBERT model. " "This specifies the model architecture.") flags.DEFINE_string( "input_file", None, "Input TF example files (can be a glob or comma separated).") flags.DEFINE_string( "output_dir", None, "The output directory where the model checkpoints will be written.") flags.DEFINE_string( "export_dir", None, "The output directory where the saved models will be written.") ## Other parameters flags.DEFINE_string( "init_checkpoint", None, "Initial checkpoint (usually from a pre-trained ALBERT model).") flags.DEFINE_integer( "max_seq_length", 512, "The maximum total input sequence length after WordPiece tokenization. " "Sequences longer than this will be truncated, and sequences shorter " "than this will be padded. Must match data generation.") flags.DEFINE_integer( "max_predictions_per_seq", 20, "Maximum number of masked LM predictions per sequence. " "Must match data generation.") flags.DEFINE_bool("do_train", True, "Whether to run training.") flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.") flags.DEFINE_integer("train_batch_size", 4096, "Total batch size for training.") flags.DEFINE_integer("eval_batch_size", 64, "Total batch size for eval.") flags.DEFINE_enum("optimizer", "lamb", ["adamw", "lamb"], "The optimizer for training.") flags.DEFINE_float("learning_rate", 0.00176, "The initial learning rate.") flags.DEFINE_float("poly_power", 1.0, "The power of poly decay.") flags.DEFINE_integer("num_train_steps", 125000, "Number of training steps.") flags.DEFINE_integer("num_warmup_steps", 3125, "Number of warmup steps.") flags.DEFINE_integer("start_warmup_step", 0, "The starting step of warmup.") flags.DEFINE_integer("save_checkpoints_steps", 5000, "How often to save the model checkpoint.") flags.DEFINE_integer("iterations_per_loop", 1000, "How many steps to make in each estimator call.") flags.DEFINE_integer("max_eval_steps", 100, "Maximum number of eval steps.") flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") flags.DEFINE_bool("init_from_group0", False, "Whether to initialize" "parameters of other groups from group 0") tf.flags.DEFINE_string( "tpu_name", None, "The Cloud TPU to use for training. This should be either the name " "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " "url.") tf.flags.DEFINE_string( "tpu_zone", None, "[Optional] GCE zone where the Cloud TPU is located in. If not " "specified, we will attempt to automatically detect the GCE project from " "metadata.") tf.flags.DEFINE_string( "gcp_project", None, "[Optional] Project name for the Cloud TPU-enabled project. If not " "specified, we will attempt to automatically detect the GCE project from " "metadata.") tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") flags.DEFINE_integer( "num_tpu_cores", 8, "Only used if `use_tpu` is True. Total number of TPU cores to use.") flags.DEFINE_float( "masked_lm_budget", 0, "If >0, the ratio of masked ngrams to unmasked ngrams. Default 0," "for offline masking") def model_fn_builder(albert_config, init_checkpoint, learning_rate, num_train_steps, num_warmup_steps, use_tpu, use_one_hot_embeddings, optimizer, poly_power, start_warmup_step): """Returns `model_fn` closure for TPUEstimator.""" def model_fn(features, labels, mode, params): # pylint: disable=unused-argument """The `model_fn` for TPUEstimator.""" tf.logging.info("*** Features ***") for name in sorted(features.keys()): tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) input_ids = features["input_ids"] input_mask = features["input_mask"] segment_ids = features["segment_ids"] masked_lm_positions = features["masked_lm_positions"] masked_lm_ids = features["masked_lm_ids"] masked_lm_weights = features["masked_lm_weights"] # Note: We keep this feature name `next_sentence_labels` to be compatible # with the original data created by lanzhzh@. However, in the ALBERT case # it does represent sentence_order_labels. sentence_order_labels = features["next_sentence_labels"] is_training = (mode == tf.estimator.ModeKeys.TRAIN) model = modeling.AlbertModel( config=albert_config, is_training=is_training, input_ids=input_ids, input_mask=input_mask, token_type_ids=segment_ids, use_one_hot_embeddings=use_one_hot_embeddings) (masked_lm_loss, masked_lm_example_loss, masked_lm_log_probs) = get_masked_lm_output(albert_config, model.get_sequence_output(), model.get_embedding_table(), masked_lm_positions, masked_lm_ids, masked_lm_weights) (sentence_order_loss, sentence_order_example_loss, sentence_order_log_probs) = get_sentence_order_output( albert_config, model.get_pooled_output(), sentence_order_labels) total_loss = masked_lm_loss + sentence_order_loss tvars = tf.trainable_variables() initialized_variable_names = {} scaffold_fn = None if init_checkpoint: tf.logging.info("number of hidden group %d to initialize", albert_config.num_hidden_groups) num_of_initialize_group = 1 if FLAGS.init_from_group0: num_of_initialize_group = albert_config.num_hidden_groups if albert_config.net_structure_type > 0: num_of_initialize_group = albert_config.num_hidden_layers (assignment_map, initialized_variable_names ) = modeling.get_assignment_map_from_checkpoint( tvars, init_checkpoint, num_of_initialize_group) if use_tpu: def tpu_scaffold(): for gid in range(num_of_initialize_group): tf.logging.info("initialize the %dth layer", gid) tf.logging.info(assignment_map[gid]) tf.train.init_from_checkpoint(init_checkpoint, assignment_map[gid]) return tf.train.Scaffold() scaffold_fn = tpu_scaffold else: for gid in range(num_of_initialize_group): tf.logging.info("initialize the %dth layer", gid) tf.logging.info(assignment_map[gid]) tf.train.init_from_checkpoint(init_checkpoint, assignment_map[gid]) tf.logging.info("**** Trainable Variables ****") for var in tvars: init_string = "" if var.name in initialized_variable_names: init_string = ", *INIT_FROM_CKPT*" tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, init_string) output_spec = None if mode == tf.estimator.ModeKeys.TRAIN: train_op = optimization.create_optimizer( total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu, optimizer, poly_power, start_warmup_step) output_spec = tf.contrib.tpu.TPUEstimatorSpec( mode=mode, loss=total_loss, train_op=train_op, scaffold_fn=scaffold_fn) elif mode == tf.estimator.ModeKeys.EVAL: def metric_fn(*args): """Computes the loss and accuracy of the model.""" (masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids, masked_lm_weights, sentence_order_example_loss, sentence_order_log_probs, sentence_order_labels) = args[:7] masked_lm_log_probs = tf.reshape(masked_lm_log_probs, [-1, masked_lm_log_probs.shape[-1]]) masked_lm_predictions = tf.argmax( masked_lm_log_probs, axis=-1, output_type=tf.int32) masked_lm_example_loss = tf.reshape(masked_lm_example_loss, [-1]) masked_lm_ids = tf.reshape(masked_lm_ids, [-1]) masked_lm_weights = tf.reshape(masked_lm_weights, [-1]) masked_lm_accuracy = tf.metrics.accuracy( labels=masked_lm_ids, predictions=masked_lm_predictions, weights=masked_lm_weights) masked_lm_mean_loss = tf.metrics.mean( values=masked_lm_example_loss, weights=masked_lm_weights) metrics = { "masked_lm_accuracy": masked_lm_accuracy, "masked_lm_loss": masked_lm_mean_loss, } sentence_order_log_probs = tf.reshape( sentence_order_log_probs, [-1, sentence_order_log_probs.shape[-1]]) sentence_order_predictions = tf.argmax( sentence_order_log_probs, axis=-1, output_type=tf.int32) sentence_order_labels = tf.reshape(sentence_order_labels, [-1]) sentence_order_accuracy = tf.metrics.accuracy( labels=sentence_order_labels, predictions=sentence_order_predictions) sentence_order_mean_loss = tf.metrics.mean( values=sentence_order_example_loss) metrics.update({ "sentence_order_accuracy": sentence_order_accuracy, "sentence_order_loss": sentence_order_mean_loss }) return metrics metric_values = [ masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids, masked_lm_weights, sentence_order_example_loss, sentence_order_log_probs, sentence_order_labels ] eval_metrics = (metric_fn, metric_values) output_spec = tf.contrib.tpu.TPUEstimatorSpec( mode=mode, loss=total_loss, eval_metrics=eval_metrics, scaffold_fn=scaffold_fn) else: raise ValueError("Only TRAIN and EVAL modes are supported: %s" % (mode)) return output_spec return model_fn def get_masked_lm_output(albert_config, input_tensor, output_weights, positions, label_ids, label_weights): """Get loss and log probs for the masked LM.""" input_tensor = gather_indexes(input_tensor, positions) with tf.variable_scope("cls/predictions"): # We apply one more non-linear transformation before the output layer. # This matrix is not used after pre-training. with tf.variable_scope("transform"): input_tensor = tf.layers.dense( input_tensor, units=albert_config.embedding_size, activation=modeling.get_activation(albert_config.hidden_act), kernel_initializer=modeling.create_initializer( albert_config.initializer_range)) input_tensor = modeling.layer_norm(input_tensor) # The output weights are the same as the input embeddings, but there is # an output-only bias for each token. output_bias = tf.get_variable( "output_bias", shape=[albert_config.vocab_size], initializer=tf.zeros_initializer()) logits = tf.matmul(input_tensor, output_weights, transpose_b=True) logits = tf.nn.bias_add(logits, output_bias) log_probs = tf.nn.log_softmax(logits, axis=-1) label_ids = tf.reshape(label_ids, [-1]) label_weights = tf.reshape(label_weights, [-1]) one_hot_labels = tf.one_hot( label_ids, depth=albert_config.vocab_size, dtype=tf.float32) # The `positions` tensor might be zero-padded (if the sequence is too # short to have the maximum number of predictions). The `label_weights` # tensor has a value of 1.0 for every real prediction and 0.0 for the # padding predictions. per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1]) numerator = tf.reduce_sum(label_weights * per_example_loss) denominator = tf.reduce_sum(label_weights) + 1e-5 loss = numerator / denominator return (loss, per_example_loss, log_probs) def get_sentence_order_output(albert_config, input_tensor, labels): """Get loss and log probs for the next sentence prediction.""" # Simple binary classification. Note that 0 is "next sentence" and 1 is # "random sentence". This weight matrix is not used after pre-training. with tf.variable_scope("cls/seq_relationship"): output_weights = tf.get_variable( "output_weights", shape=[2, albert_config.hidden_size], initializer=modeling.create_initializer( albert_config.initializer_range)) output_bias = tf.get_variable( "output_bias", shape=[2], initializer=tf.zeros_initializer()) logits = tf.matmul(input_tensor, output_weights, transpose_b=True) logits = tf.nn.bias_add(logits, output_bias) log_probs = tf.nn.log_softmax(logits, axis=-1) labels = tf.reshape(labels, [-1]) one_hot_labels = tf.one_hot(labels, depth=2, dtype=tf.float32) per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) loss = tf.reduce_mean(per_example_loss) return (loss, per_example_loss, log_probs) def gather_indexes(sequence_tensor, positions): """Gathers the vectors at the specific positions over a minibatch.""" sequence_shape = modeling.get_shape_list(sequence_tensor, expected_rank=3) batch_size = sequence_shape[0] seq_length = sequence_shape[1] width = sequence_shape[2] flat_offsets = tf.reshape( tf.range(0, batch_size, dtype=tf.int32) * seq_length, [-1, 1]) flat_positions = tf.reshape(positions + flat_offsets, [-1]) flat_sequence_tensor = tf.reshape(sequence_tensor, [batch_size * seq_length, width]) output_tensor = tf.gather(flat_sequence_tensor, flat_positions) return output_tensor def input_fn_builder(input_files, max_seq_length, max_predictions_per_seq, is_training, num_cpu_threads=4): """Creates an `input_fn` closure to be passed to TPUEstimator.""" def input_fn(params): """The actual input function.""" batch_size = params["batch_size"] name_to_features = { "input_ids": tf.FixedLenFeature([max_seq_length], tf.int64), "input_mask": tf.FixedLenFeature([max_seq_length], tf.int64), "segment_ids": tf.FixedLenFeature([max_seq_length], tf.int64), # Note: We keep this feature name `next_sentence_labels` to be # compatible with the original data created by lanzhzh@. However, in # the ALBERT case it does represent sentence_order_labels. "next_sentence_labels": tf.FixedLenFeature([1], tf.int64), } if FLAGS.masked_lm_budget: name_to_features.update({ "token_boundary": tf.FixedLenFeature([max_seq_length], tf.int64)}) else: name_to_features.update({ "masked_lm_positions": tf.FixedLenFeature([max_predictions_per_seq], tf.int64), "masked_lm_ids": tf.FixedLenFeature([max_predictions_per_seq], tf.int64), "masked_lm_weights": tf.FixedLenFeature([max_predictions_per_seq], tf.float32)}) # For training, we want a lot of parallel reading and shuffling. # For eval, we want no shuffling and parallel reading doesn't matter. if is_training: d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files)) d = d.repeat() d = d.shuffle(buffer_size=len(input_files)) # `cycle_length` is the number of parallel files that get read. cycle_length = min(num_cpu_threads, len(input_files)) # `sloppy` mode means that the interleaving is not exact. This adds # even more randomness to the training pipeline. d = d.apply( tf.contrib.data.parallel_interleave( tf.data.TFRecordDataset, sloppy=is_training, cycle_length=cycle_length)) d = d.shuffle(buffer_size=100) else: d = tf.data.TFRecordDataset(input_files) # Since we evaluate for a fixed number of steps we don't want to encounter # out-of-range exceptions. d = d.repeat() # We must `drop_remainder` on training because the TPU requires fixed # size dimensions. For eval, we assume we are evaluating on the CPU or GPU # and we *don't* want to drop the remainder, otherwise we wont cover # every sample. d = d.apply( tf.data.experimental.map_and_batch_with_legacy_function( lambda record: _decode_record(record, name_to_features), batch_size=batch_size, num_parallel_batches=num_cpu_threads, drop_remainder=True)) tf.logging.info(d) return d return input_fn def _decode_record(record, name_to_features): """Decodes a record to a TensorFlow example.""" example = tf.parse_single_example(record, name_to_features) # tf.Example only supports tf.int64, but the TPU only supports tf.int32. # So cast all int64 to int32. for name in list(example.keys()): t = example[name] if t.dtype == tf.int64: t = tf.to_int32(t) example[name] = t return example def main(_): tf.logging.set_verbosity(tf.logging.INFO) if not FLAGS.do_train and not FLAGS.do_eval: raise ValueError("At least one of `do_train` or `do_eval` must be True.") albert_config = modeling.AlbertConfig.from_json_file(FLAGS.albert_config_file) tf.gfile.MakeDirs(FLAGS.output_dir) input_files = [] for input_pattern in FLAGS.input_file.split(","): input_files.extend(tf.gfile.Glob(input_pattern)) tf.logging.info("*** Input Files ***") for input_file in input_files: tf.logging.info(" %s" % input_file) tpu_cluster_resolver = None if FLAGS.use_tpu and FLAGS.tpu_name: tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 run_config = tf.contrib.tpu.RunConfig( cluster=tpu_cluster_resolver, master=FLAGS.master, model_dir=FLAGS.output_dir, save_checkpoints_steps=FLAGS.save_checkpoints_steps, tpu_config=tf.contrib.tpu.TPUConfig( iterations_per_loop=FLAGS.iterations_per_loop, num_shards=FLAGS.num_tpu_cores, per_host_input_for_training=is_per_host)) model_fn = model_fn_builder( albert_config=albert_config, init_checkpoint=FLAGS.init_checkpoint, learning_rate=FLAGS.learning_rate, num_train_steps=FLAGS.num_train_steps, num_warmup_steps=FLAGS.num_warmup_steps, use_tpu=FLAGS.use_tpu, use_one_hot_embeddings=FLAGS.use_tpu, optimizer=FLAGS.optimizer, poly_power=FLAGS.poly_power, start_warmup_step=FLAGS.start_warmup_step) # If TPU is not available, this will fall back to normal Estimator on CPU # or GPU. estimator = tf.contrib.tpu.TPUEstimator( use_tpu=FLAGS.use_tpu, model_fn=model_fn, config=run_config, train_batch_size=FLAGS.train_batch_size, eval_batch_size=FLAGS.eval_batch_size) if FLAGS.do_train: tf.logging.info("***** Running training *****") tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) train_input_fn = input_fn_builder( input_files=input_files, max_seq_length=FLAGS.max_seq_length, max_predictions_per_seq=FLAGS.max_predictions_per_seq, is_training=True) estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps) if FLAGS.do_eval: tf.logging.info("***** Running evaluation *****") tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) global_step = -1 output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt") writer = tf.gfile.GFile(output_eval_file, "w") tf.gfile.MakeDirs(FLAGS.export_dir) eval_input_fn = input_fn_builder( input_files=input_files, max_seq_length=FLAGS.max_seq_length, max_predictions_per_seq=FLAGS.max_predictions_per_seq, is_training=False) while global_step < FLAGS.num_train_steps: if estimator.latest_checkpoint() is None: tf.logging.info("No checkpoint found yet. Sleeping.") time.sleep(1) else: result = estimator.evaluate( input_fn=eval_input_fn, steps=FLAGS.max_eval_steps) global_step = result["global_step"] tf.logging.info("***** Eval results *****") for key in sorted(result.keys()): tf.logging.info(" %s = %s", key, str(result[key])) writer.write("%s = %s\n" % (key, str(result[key]))) if __name__ == "__main__": flags.mark_flag_as_required("input_file") flags.mark_flag_as_required("albert_config_file") flags.mark_flag_as_required("output_dir") tf.app.run() ================================================ FILE: similarity.py ================================================ """ 进行文本相似度预测的示例。可以直接运行进行预测。 参考了项目:https://github.com/chdd/bert-utils """ import tensorflow as tf import args import tokenization import modeling from run_classifier import InputFeatures, InputExample, DataProcessor, create_model, convert_examples_to_features # os.environ['CUDA_VISIBLE_DEVICES'] = '1' class SimProcessor(DataProcessor): def get_sentence_examples(self, questions): examples = [] for index, data in enumerate(questions): guid = 'test-%d' % index text_a = tokenization.convert_to_unicode(str(data[0])) text_b = tokenization.convert_to_unicode(str(data[1])) label = str(0) examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) return examples def get_labels(self): return ['0', '1'] """ 模型类,负责载入checkpoint初始化模型 """ class BertSim: def __init__(self, batch_size=args.batch_size): self.mode = None self.max_seq_length = args.max_seq_len self.tokenizer = tokenization.FullTokenizer(vocab_file=args.vocab_file, do_lower_case=True) self.batch_size = batch_size self.estimator = None self.processor = SimProcessor() tf.logging.set_verbosity(tf.logging.INFO) #载入estimator,构造模型 def start_model(self): self.estimator = self.get_estimator() def model_fn_builder(self, bert_config, num_labels, init_checkpoint, learning_rate, num_train_steps, num_warmup_steps, use_one_hot_embeddings): """Returns `model_fn` closurimport_tfe for TPUEstimator.""" def model_fn(features, labels, mode, params): # pylint: disable=unused-argument from tensorflow.python.estimator.model_fn import EstimatorSpec tf.logging.info("*** Features ***") for name in sorted(features.keys()): tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) input_ids = features["input_ids"] input_mask = features["input_mask"] segment_ids = features["segment_ids"] label_ids = features["label_ids"] is_training = (mode == tf.estimator.ModeKeys.TRAIN) (total_loss, per_example_loss, logits, probabilities) = create_model( bert_config, is_training, input_ids, input_mask, segment_ids, label_ids, num_labels, use_one_hot_embeddings) tvars = tf.trainable_variables() initialized_variable_names = {} if init_checkpoint: (assignment_map, initialized_variable_names) \ = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) tf.train.init_from_checkpoint(init_checkpoint, assignment_map) tf.logging.info("**** Trainable Variables ****") for var in tvars: init_string = "" if var.name in initialized_variable_names: init_string = ", *INIT_FROM_CKPT*" tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, init_string) output_spec = EstimatorSpec(mode=mode, predictions=probabilities) return output_spec return model_fn def get_estimator(self): from tensorflow.python.estimator.estimator import Estimator from tensorflow.python.estimator.run_config import RunConfig bert_config = modeling.BertConfig.from_json_file(args.config_name) label_list = self.processor.get_labels() if self.mode == tf.estimator.ModeKeys.TRAIN: init_checkpoint = args.ckpt_name else: init_checkpoint = args.output_dir model_fn = self.model_fn_builder( bert_config=bert_config, num_labels=len(label_list), init_checkpoint=init_checkpoint, learning_rate=args.learning_rate, num_train_steps=None, num_warmup_steps=None, use_one_hot_embeddings=False) config = tf.ConfigProto() config.gpu_options.allow_growth = True config.gpu_options.per_process_gpu_memory_fraction = args.gpu_memory_fraction config.log_device_placement = False return Estimator(model_fn=model_fn, config=RunConfig(session_config=config), model_dir=args.output_dir, params={'batch_size': self.batch_size}) def predict_sentences(self,sentences): results= self.estimator.predict(input_fn=input_fn_builder(self,sentences), yield_single_examples=False) #打印预测结果 for i in results: print(i) def _truncate_seq_pair(self, tokens_a, tokens_b, max_length): """Truncates a sequence pair in place to the maximum length.""" # This is a simple heuristic which will always truncate the longer sequence # one token at a time. This makes more sense than truncating an equal percent # of tokens from each, since if one sequence is very short then each token # that's truncated likely contains more information than a longer sequence. while True: total_length = len(tokens_a) + len(tokens_b) if total_length <= max_length: break if len(tokens_a) > len(tokens_b): tokens_a.pop() else: tokens_b.pop() def convert_single_example(self, ex_index, example, label_list, max_seq_length, tokenizer): """Converts a single `InputExample` into a single `InputFeatures`.""" label_map = {} for (i, label) in enumerate(label_list): label_map[label] = i tokens_a = tokenizer.tokenize(example.text_a) tokens_b = None if example.text_b: tokens_b = tokenizer.tokenize(example.text_b) if tokens_b: # Modifies `tokens_a` and `tokens_b` in place so that the total # length is less than the specified length. # Account for [CLS], [SEP], [SEP] with "- 3" self._truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) else: # Account for [CLS] and [SEP] with "- 2" if len(tokens_a) > max_seq_length - 2: tokens_a = tokens_a[0:(max_seq_length - 2)] # The convention in BERT is: # (a) For sequence pairs: # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 # (b) For single sequences: # tokens: [CLS] the dog is hairy . [SEP] # type_ids: 0 0 0 0 0 0 0 # # Where "type_ids" are used to indicate whether this is the first # sequence or the second sequence. The embedding vectors for `type=0` and # `type=1` were learned during pre-training and are added to the wordpiece # embedding vector (and position vector). This is not *strictly* necessary # since the [SEP] token unambiguously separates the sequences, but it makes # it easier for the model to learn the concept of sequences. # # For classification tasks, the first vector (corresponding to [CLS]) is # used as as the "sentence vector". Note that this only makes sense because # the entire model is fine-tuned. tokens = [] segment_ids = [] tokens.append("[CLS]") segment_ids.append(0) for token in tokens_a: tokens.append(token) segment_ids.append(0) tokens.append("[SEP]") segment_ids.append(0) if tokens_b: for token in tokens_b: tokens.append(token) segment_ids.append(1) tokens.append("[SEP]") segment_ids.append(1) input_ids = tokenizer.convert_tokens_to_ids(tokens) # The mask has 1 for real tokens and 0 for padding tokens. Only real # tokens are attended to. input_mask = [1] * len(input_ids) # Zero-pad up to the sequence length. while len(input_ids) < max_seq_length: input_ids.append(0) input_mask.append(0) segment_ids.append(0) assert len(input_ids) == max_seq_length assert len(input_mask) == max_seq_length assert len(segment_ids) == max_seq_length label_id = label_map[example.label] if ex_index < 5: tf.logging.info("*** Example ***") tf.logging.info("guid: %s" % (example.guid)) tf.logging.info("tokens: %s" % " ".join( [tokenization.printable_text(x) for x in tokens])) tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) tf.logging.info("label: %s (id = %d)" % (example.label, label_id)) feature = InputFeatures( input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids, label_id=label_id) return feature def input_fn_builder(bertSim,sentences): def predict_input_fn(): return (tf.data.Dataset.from_generator( generate_from_input, output_types={ 'input_ids': tf.int32, 'input_mask': tf.int32, 'segment_ids': tf.int32, 'label_ids': tf.int32}, output_shapes={ 'input_ids': (None, bertSim.max_seq_length), 'input_mask': (None, bertSim.max_seq_length), 'segment_ids': (None, bertSim.max_seq_length), 'label_ids': (1,)}).prefetch(10)) def generate_from_input(): processor = bertSim.processor predict_examples = processor.get_sentence_examples(sentences) features = convert_examples_to_features(predict_examples, processor.get_labels(), args.max_seq_len, bertSim.tokenizer) yield { 'input_ids': [f.input_ids for f in features], 'input_mask': [f.input_mask for f in features], 'segment_ids': [f.segment_ids for f in features], 'label_ids': [f.label_id for f in features] } return predict_input_fn if __name__ == '__main__': sim = BertSim() sim.start_model() sim.predict_sentences([("我喜欢妈妈做的汤", "妈妈做的汤我很喜欢喝")]) ================================================ FILE: test_changes.py ================================================ # coding=utf-8 import tensorflow as tf from modeling import embedding_lookup_factorized,transformer_model import os """ 测试albert主要的改进点:词嵌入的因式分解、层间参数共享、段落间连贯性 test main change of albert from bert """ batch_size = 2048 sequence_length = 512 vocab_size = 30000 hidden_size = 1024 num_attention_heads = int(hidden_size / 64) def get_total_parameters(): """ get total parameters of a graph :return: """ total_parameters = 0 for variable in tf.trainable_variables(): # shape is an array of tf.Dimension shape = variable.get_shape() # print(shape) # print(len(shape)) variable_parameters = 1 for dim in shape: # print(dim) variable_parameters *= dim.value # print(variable_parameters) total_parameters += variable_parameters return total_parameters def test_factorized_embedding(): """ test of Factorized embedding parameterization :return: """ input_ids=tf.zeros((batch_size, sequence_length),dtype=tf.int32) output, embedding_table, embedding_table_2=embedding_lookup_factorized(input_ids,vocab_size,hidden_size) print("output:",output) def test_share_parameters(): """ test of share parameters across all layers: how many parameter after share parameter across layers of transformer. :return: """ def total_parameters_transformer(share_parameter_across_layers): input_tensor=tf.zeros((batch_size, sequence_length, hidden_size),dtype=tf.float32) print("transformer_model. input:",input_tensor) transformer_result=transformer_model(input_tensor,hidden_size=hidden_size,num_attention_heads=num_attention_heads,share_parameter_across_layers=share_parameter_across_layers) print("transformer_result:",transformer_result) total_parameters=get_total_parameters() print('total_parameters(not share):',total_parameters) share_parameter_across_layers=False total_parameters_transformer(share_parameter_across_layers) # total parameters, not share: 125,976,576 = 125 million tf.reset_default_graph() # Clears the default graph stack and resets the global default graph share_parameter_across_layers=True total_parameters_transformer(share_parameter_across_layers) # total parameters, share: 10,498,048 = 10.5 million def test_sentence_order_prediction(): """ sentence order prediction. check method of create_instances_from_document_albert from create_pretrining_data.py :return: """ # 添加运行权限 os.system("chmod +x create_pretrain_data.sh") os.system("./create_pretrain_data.sh") # 1.test of Factorized embedding parameterization #test_factorized_embedding() # 2. test of share parameters across all layers: how many parameter after share parameter across layers of transformer. # before share parameter: 125,976,576; after share parameter: #test_share_parameters() # 3. test of sentence order prediction(SOP) test_sentence_order_prediction() ================================================ FILE: tokenization.py ================================================ # coding=utf-8 # Copyright 2018 The Google AI Language Team Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Tokenization classes.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import collections import re import unicodedata import six import tensorflow as tf def validate_case_matches_checkpoint(do_lower_case, init_checkpoint): """Checks whether the casing config is consistent with the checkpoint name.""" # The casing has to be passed in by the user and there is no explicit check # as to whether it matches the checkpoint. The casing information probably # should have been stored in the bert_config.json file, but it's not, so # we have to heuristically detect it to validate. if not init_checkpoint: return m = re.match("^.*?([A-Za-z0-9_-]+)/bert_model.ckpt", init_checkpoint) if m is None: return model_name = m.group(1) lower_models = [ "uncased_L-24_H-1024_A-16", "uncased_L-12_H-768_A-12", "multilingual_L-12_H-768_A-12", "chinese_L-12_H-768_A-12" ] cased_models = [ "cased_L-12_H-768_A-12", "cased_L-24_H-1024_A-16", "multi_cased_L-12_H-768_A-12" ] is_bad_config = False if model_name in lower_models and not do_lower_case: is_bad_config = True actual_flag = "False" case_name = "lowercased" opposite_flag = "True" if model_name in cased_models and do_lower_case: is_bad_config = True actual_flag = "True" case_name = "cased" opposite_flag = "False" if is_bad_config: raise ValueError( "You passed in `--do_lower_case=%s` with `--init_checkpoint=%s`. " "However, `%s` seems to be a %s model, so you " "should pass in `--do_lower_case=%s` so that the fine-tuning matches " "how the model was pre-training. If this error is wrong, please " "just comment out this check." % (actual_flag, init_checkpoint, model_name, case_name, opposite_flag)) def convert_to_unicode(text): """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" if six.PY3: if isinstance(text, str): return text elif isinstance(text, bytes): return text.decode("utf-8", "ignore") else: raise ValueError("Unsupported string type: %s" % (type(text))) elif six.PY2: if isinstance(text, str): return text.decode("utf-8", "ignore") elif isinstance(text, unicode): return text else: raise ValueError("Unsupported string type: %s" % (type(text))) else: raise ValueError("Not running on Python2 or Python 3?") def printable_text(text): """Returns text encoded in a way suitable for print or `tf.logging`.""" # These functions want `str` for both Python2 and Python3, but in one case # it's a Unicode string and in the other it's a byte string. if six.PY3: if isinstance(text, str): return text elif isinstance(text, bytes): return text.decode("utf-8", "ignore") else: raise ValueError("Unsupported string type: %s" % (type(text))) elif six.PY2: if isinstance(text, str): return text elif isinstance(text, unicode): return text.encode("utf-8") else: raise ValueError("Unsupported string type: %s" % (type(text))) else: raise ValueError("Not running on Python2 or Python 3?") def load_vocab(vocab_file): """Loads a vocabulary file into a dictionary.""" vocab = collections.OrderedDict() index = 0 with tf.gfile.GFile(vocab_file, "r") as reader: while True: token = convert_to_unicode(reader.readline()) if not token: break token = token.strip() vocab[token] = index index += 1 return vocab def convert_by_vocab(vocab, items): """Converts a sequence of [tokens|ids] using the vocab.""" output = [] #print("items:",items) #['[CLS]', '日', '##期', ',', '但', '被', '##告', '金', '##东', '##福', '载', '##明', '[MASK]', 'U', '##N', '##K', ']', '保', '##证', '本', '##月', '1', '##4', '[MASK]', '到', '##位', ',', '2', '##0', '##1', '##5', '年', '6', '[MASK]', '1', '##1', '日', '[', 'U', '##N', '##K', ']', ',', '原', '##告', '[MASK]', '认', '##可', '于', '2', '##0', '##1', '##5', '[MASK]', '6', '月', '[MASK]', '[MASK]', '日', '##向', '被', '##告', '主', '##张', '权', '##利', '。', '而', '[MASK]', '[MASK]', '自', '[MASK]', '[MASK]', '[MASK]', '[MASK]', '年', '6', '月', '1', '##1', '日', '[SEP]', '原', '##告', '于', '2', '##0', '##1', '##6', '[MASK]', '6', '[MASK]', '2', '##4', '日', '起', '##诉', ',', '主', '##张', '保', '##证', '责', '##任', ',', '已', '超', '##过', '保', '##证', '期', '##限', '[MASK]', '保', '##证', '人', '依', '##法', '不', '##再', '承', '##担', '保', '##证', '[MASK]', '[MASK]', '[MASK]', '[SEP]'] for i,item in enumerate(items): #print(i,"item:",item) # ##期 output.append(vocab[item]) return output def convert_tokens_to_ids(vocab, tokens): return convert_by_vocab(vocab, tokens) def convert_ids_to_tokens(inv_vocab, ids): return convert_by_vocab(inv_vocab, ids) def whitespace_tokenize(text): """Runs basic whitespace cleaning and splitting on a piece of text.""" text = text.strip() if not text: return [] tokens = text.split() return tokens class FullTokenizer(object): """Runs end-to-end tokenziation.""" def __init__(self, vocab_file, do_lower_case=True): self.vocab = load_vocab(vocab_file) self.inv_vocab = {v: k for k, v in self.vocab.items()} self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case) self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) def tokenize(self, text): split_tokens = [] for token in self.basic_tokenizer.tokenize(text): for sub_token in self.wordpiece_tokenizer.tokenize(token): split_tokens.append(sub_token) return split_tokens def convert_tokens_to_ids(self, tokens): return convert_by_vocab(self.vocab, tokens) def convert_ids_to_tokens(self, ids): return convert_by_vocab(self.inv_vocab, ids) class BasicTokenizer(object): """Runs basic tokenization (punctuation splitting, lower casing, etc.).""" def __init__(self, do_lower_case=True): """Constructs a BasicTokenizer. Args: do_lower_case: Whether to lower case the input. """ self.do_lower_case = do_lower_case def tokenize(self, text): """Tokenizes a piece of text.""" text = convert_to_unicode(text) text = self._clean_text(text) # This was added on November 1st, 2018 for the multilingual and Chinese # models. This is also applied to the English models now, but it doesn't # matter since the English models were not trained on any Chinese data # and generally don't have any Chinese data in them (there are Chinese # characters in the vocabulary because Wikipedia does have some Chinese # words in the English Wikipedia.). text = self._tokenize_chinese_chars(text) orig_tokens = whitespace_tokenize(text) split_tokens = [] for token in orig_tokens: if self.do_lower_case: token = token.lower() token = self._run_strip_accents(token) split_tokens.extend(self._run_split_on_punc(token)) output_tokens = whitespace_tokenize(" ".join(split_tokens)) return output_tokens def _run_strip_accents(self, text): """Strips accents from a piece of text.""" text = unicodedata.normalize("NFD", text) output = [] for char in text: cat = unicodedata.category(char) if cat == "Mn": continue output.append(char) return "".join(output) def _run_split_on_punc(self, text): """Splits punctuation on a piece of text.""" chars = list(text) i = 0 start_new_word = True output = [] while i < len(chars): char = chars[i] if _is_punctuation(char): output.append([char]) start_new_word = True else: if start_new_word: output.append([]) start_new_word = False output[-1].append(char) i += 1 return ["".join(x) for x in output] def _tokenize_chinese_chars(self, text): """Adds whitespace around any CJK character.""" output = [] for char in text: cp = ord(char) if self._is_chinese_char(cp): output.append(" ") output.append(char) output.append(" ") else: output.append(char) return "".join(output) def _is_chinese_char(self, cp): """Checks whether CP is the codepoint of a CJK character.""" # This defines a "chinese character" as anything in the CJK Unicode block: # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) # # Note that the CJK Unicode block is NOT all Japanese and Korean characters, # despite its name. The modern Korean Hangul alphabet is a different block, # as is Japanese Hiragana and Katakana. Those alphabets are used to write # space-separated words, so they are not treated specially and handled # like the all of the other languages. if ((cp >= 0x4E00 and cp <= 0x9FFF) or # (cp >= 0x3400 and cp <= 0x4DBF) or # (cp >= 0x20000 and cp <= 0x2A6DF) or # (cp >= 0x2A700 and cp <= 0x2B73F) or # (cp >= 0x2B740 and cp <= 0x2B81F) or # (cp >= 0x2B820 and cp <= 0x2CEAF) or (cp >= 0xF900 and cp <= 0xFAFF) or # (cp >= 0x2F800 and cp <= 0x2FA1F)): # return True return False def _clean_text(self, text): """Performs invalid character removal and whitespace cleanup on text.""" output = [] for char in text: cp = ord(char) if cp == 0 or cp == 0xfffd or _is_control(char): continue if _is_whitespace(char): output.append(" ") else: output.append(char) return "".join(output) class WordpieceTokenizer(object): """Runs WordPiece tokenziation.""" def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200): self.vocab = vocab self.unk_token = unk_token self.max_input_chars_per_word = max_input_chars_per_word def tokenize(self, text): """Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. For example: input = "unaffable" output = ["un", "##aff", "##able"] Args: text: A single token or whitespace separated tokens. This should have already been passed through `BasicTokenizer. Returns: A list of wordpiece tokens. """ text = convert_to_unicode(text) output_tokens = [] for token in whitespace_tokenize(text): chars = list(token) if len(chars) > self.max_input_chars_per_word: output_tokens.append(self.unk_token) continue is_bad = False start = 0 sub_tokens = [] while start < len(chars): end = len(chars) cur_substr = None while start < end: substr = "".join(chars[start:end]) if start > 0: substr = "##" + substr if substr in self.vocab: cur_substr = substr break end -= 1 if cur_substr is None: is_bad = True break sub_tokens.append(cur_substr) start = end if is_bad: output_tokens.append(self.unk_token) else: output_tokens.extend(sub_tokens) return output_tokens def _is_whitespace(char): """Checks whether `chars` is a whitespace character.""" # \t, \n, and \r are technically contorl characters but we treat them # as whitespace since they are generally considered as such. if char == " " or char == "\t" or char == "\n" or char == "\r": return True cat = unicodedata.category(char) if cat == "Zs": return True return False def _is_control(char): """Checks whether `chars` is a control character.""" # These are technically control characters but we count them as whitespace # characters. if char == "\t" or char == "\n" or char == "\r": return False cat = unicodedata.category(char) if cat in ("Cc", "Cf"): return True return False def _is_punctuation(char): """Checks whether `chars` is a punctuation character.""" cp = ord(char) # We treat all non-letter/number ASCII as punctuation. # Characters such as "^", "$", and "`" are not in the Unicode # Punctuation class but we treat them as punctuation anyways, for # consistency. if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)): return True cat = unicodedata.category(char) if cat.startswith("P"): return True return False ================================================ FILE: tokenization_google.py ================================================ # coding=utf-8 # Copyright 2019 The Google Research Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # Lint as: python2, python3 # coding=utf-8 """Tokenization classes.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import collections import re import unicodedata import six from six.moves import range import tensorflow as tf import sentencepiece as spm SPIECE_UNDERLINE = u"▁".encode("utf-8") def validate_case_matches_checkpoint(do_lower_case, init_checkpoint): """Checks whether the casing config is consistent with the checkpoint name.""" # The casing has to be passed in by the user and there is no explicit check # as to whether it matches the checkpoint. The casing information probably # should have been stored in the bert_config.json file, but it's not, so # we have to heuristically detect it to validate. if not init_checkpoint: return m = re.match("^.*?([A-Za-z0-9_-]+)/bert_model.ckpt", six.ensure_str(init_checkpoint)) if m is None: return model_name = m.group(1) lower_models = [ "uncased_L-24_H-1024_A-16", "uncased_L-12_H-768_A-12", "multilingual_L-12_H-768_A-12", "chinese_L-12_H-768_A-12" ] cased_models = [ "cased_L-12_H-768_A-12", "cased_L-24_H-1024_A-16", "multi_cased_L-12_H-768_A-12" ] is_bad_config = False if model_name in lower_models and not do_lower_case: is_bad_config = True actual_flag = "False" case_name = "lowercased" opposite_flag = "True" if model_name in cased_models and do_lower_case: is_bad_config = True actual_flag = "True" case_name = "cased" opposite_flag = "False" if is_bad_config: raise ValueError( "You passed in `--do_lower_case=%s` with `--init_checkpoint=%s`. " "However, `%s` seems to be a %s model, so you " "should pass in `--do_lower_case=%s` so that the fine-tuning matches " "how the model was pre-training. If this error is wrong, please " "just comment out this check." % (actual_flag, init_checkpoint, model_name, case_name, opposite_flag)) def preprocess_text(inputs, remove_space=True, lower=False): """preprocess data by removing extra space and normalize data.""" outputs = inputs if remove_space: outputs = " ".join(inputs.strip().split()) if six.PY2 and isinstance(outputs, str): try: outputs = six.ensure_text(outputs, "utf-8") except UnicodeDecodeError: outputs = six.ensure_text(outputs, "latin-1") outputs = unicodedata.normalize("NFKD", outputs) outputs = "".join([c for c in outputs if not unicodedata.combining(c)]) if lower: outputs = outputs.lower() return outputs def encode_pieces(sp_model, text, return_unicode=True, sample=False): """turn sentences into word pieces.""" if six.PY2 and isinstance(text, six.text_type): text = six.ensure_binary(text, "utf-8") if not sample: pieces = sp_model.EncodeAsPieces(text) else: pieces = sp_model.SampleEncodeAsPieces(text, 64, 0.1) new_pieces = [] for piece in pieces: piece = printable_text(piece) if len(piece) > 1 and piece[-1] == "," and piece[-2].isdigit(): cur_pieces = sp_model.EncodeAsPieces( six.ensure_binary(piece[:-1]).replace(SPIECE_UNDERLINE, b"")) if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE: if len(cur_pieces[0]) == 1: cur_pieces = cur_pieces[1:] else: cur_pieces[0] = cur_pieces[0][1:] cur_pieces.append(piece[-1]) new_pieces.extend(cur_pieces) else: new_pieces.append(piece) # note(zhiliny): convert back to unicode for py2 if six.PY2 and return_unicode: ret_pieces = [] for piece in new_pieces: if isinstance(piece, str): piece = six.ensure_text(piece, "utf-8") ret_pieces.append(piece) new_pieces = ret_pieces return new_pieces def encode_ids(sp_model, text, sample=False): pieces = encode_pieces(sp_model, text, return_unicode=False, sample=sample) ids = [sp_model.PieceToId(piece) for piece in pieces] return ids def convert_to_unicode(text): """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" if six.PY3: if isinstance(text, str): return text elif isinstance(text, bytes): return six.ensure_text(text, "utf-8", "ignore") else: raise ValueError("Unsupported string type: %s" % (type(text))) elif six.PY2: if isinstance(text, str): return six.ensure_text(text, "utf-8", "ignore") elif isinstance(text, six.text_type): return text else: raise ValueError("Unsupported string type: %s" % (type(text))) else: raise ValueError("Not running on Python2 or Python 3?") def printable_text(text): """Returns text encoded in a way suitable for print or `tf.logging`.""" # These functions want `str` for both Python2 and Python3, but in one case # it's a Unicode string and in the other it's a byte string. if six.PY3: if isinstance(text, str): return text elif isinstance(text, bytes): return six.ensure_text(text, "utf-8", "ignore") else: raise ValueError("Unsupported string type: %s" % (type(text))) elif six.PY2: if isinstance(text, str): return text elif isinstance(text, six.text_type): return six.ensure_binary(text, "utf-8") else: raise ValueError("Unsupported string type: %s" % (type(text))) else: raise ValueError("Not running on Python2 or Python 3?") def load_vocab(vocab_file): """Loads a vocabulary file into a dictionary.""" vocab = collections.OrderedDict() with tf.gfile.GFile(vocab_file, "r") as reader: while True: token = convert_to_unicode(reader.readline()) if not token: break token = token.strip() # previous: token.strip().split()[0] if token not in vocab: vocab[token] = len(vocab) return vocab def convert_by_vocab(vocab, items): """Converts a sequence of [tokens|ids] using the vocab.""" output = [] for item in items: output.append(vocab[item]) return output def convert_tokens_to_ids(vocab, tokens): return convert_by_vocab(vocab, tokens) def convert_ids_to_tokens(inv_vocab, ids): return convert_by_vocab(inv_vocab, ids) def whitespace_tokenize(text): """Runs basic whitespace cleaning and splitting on a piece of text.""" text = text.strip() if not text: return [] tokens = text.split() return tokens class FullTokenizer(object): """Runs end-to-end tokenziation.""" def __init__(self, vocab_file, do_lower_case=True, spm_model_file=None): self.vocab = None self.sp_model = None print("spm_model_file:",spm_model_file,";vocab_file:",vocab_file) if spm_model_file: print("#Use spm_model_file") self.sp_model = spm.SentencePieceProcessor() tf.logging.info("loading sentence piece model") self.sp_model.Load(spm_model_file) # Note(mingdachen): For the purpose of consisent API, we are # generating a vocabulary for the sentence piece tokenizer. self.vocab = {self.sp_model.IdToPiece(i): i for i in range(self.sp_model.GetPieceSize())} else: print("#Use vocab_file") self.vocab = load_vocab(vocab_file) self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case) self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) self.inv_vocab = {v: k for k, v in self.vocab.items()} def tokenize(self, text): if self.sp_model: split_tokens = encode_pieces(self.sp_model, text, return_unicode=False) else: split_tokens = [] for token in self.basic_tokenizer.tokenize(text): for sub_token in self.wordpiece_tokenizer.tokenize(token): split_tokens.append(sub_token) return split_tokens def convert_tokens_to_ids(self, tokens): if self.sp_model: tf.logging.info("using sentence piece tokenzier.") return [self.sp_model.PieceToId( printable_text(token)) for token in tokens] else: return convert_by_vocab(self.vocab, tokens) def convert_ids_to_tokens(self, ids): if self.sp_model: tf.logging.info("using sentence piece tokenzier.") return [self.sp_model.IdToPiece(id_) for id_ in ids] else: return convert_by_vocab(self.inv_vocab, ids) class BasicTokenizer(object): """Runs basic tokenization (punctuation splitting, lower casing, etc.).""" def __init__(self, do_lower_case=True): """Constructs a BasicTokenizer. Args: do_lower_case: Whether to lower case the input. """ self.do_lower_case = do_lower_case def tokenize(self, text): """Tokenizes a piece of text.""" text = convert_to_unicode(text) text = self._clean_text(text) # This was added on November 1st, 2018 for the multilingual and Chinese # models. This is also applied to the English models now, but it doesn't # matter since the English models were not trained on any Chinese data # and generally don't have any Chinese data in them (there are Chinese # characters in the vocabulary because Wikipedia does have some Chinese # words in the English Wikipedia.). text = self._tokenize_chinese_chars(text) orig_tokens = whitespace_tokenize(text) split_tokens = [] for token in orig_tokens: if self.do_lower_case: token = token.lower() token = self._run_strip_accents(token) split_tokens.extend(self._run_split_on_punc(token)) output_tokens = whitespace_tokenize(" ".join(split_tokens)) return output_tokens def _run_strip_accents(self, text): """Strips accents from a piece of text.""" text = unicodedata.normalize("NFD", text) output = [] for char in text: cat = unicodedata.category(char) if cat == "Mn": continue output.append(char) return "".join(output) def _run_split_on_punc(self, text): """Splits punctuation on a piece of text.""" chars = list(text) i = 0 start_new_word = True output = [] while i < len(chars): char = chars[i] if _is_punctuation(char): output.append([char]) start_new_word = True else: if start_new_word: output.append([]) start_new_word = False output[-1].append(char) i += 1 return ["".join(x) for x in output] def _tokenize_chinese_chars(self, text): """Adds whitespace around any CJK character.""" output = [] for char in text: cp = ord(char) if self._is_chinese_char(cp): output.append(" ") output.append(char) output.append(" ") else: output.append(char) return "".join(output) def _is_chinese_char(self, cp): """Checks whether CP is the codepoint of a CJK character.""" # This defines a "chinese character" as anything in the CJK Unicode block: # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) # # Note that the CJK Unicode block is NOT all Japanese and Korean characters, # despite its name. The modern Korean Hangul alphabet is a different block, # as is Japanese Hiragana and Katakana. Those alphabets are used to write # space-separated words, so they are not treated specially and handled # like the all of the other languages. if ((cp >= 0x4E00 and cp <= 0x9FFF) or # (cp >= 0x3400 and cp <= 0x4DBF) or # (cp >= 0x20000 and cp <= 0x2A6DF) or # (cp >= 0x2A700 and cp <= 0x2B73F) or # (cp >= 0x2B740 and cp <= 0x2B81F) or # (cp >= 0x2B820 and cp <= 0x2CEAF) or (cp >= 0xF900 and cp <= 0xFAFF) or # (cp >= 0x2F800 and cp <= 0x2FA1F)): # return True return False def _clean_text(self, text): """Performs invalid character removal and whitespace cleanup on text.""" output = [] for char in text: cp = ord(char) if cp == 0 or cp == 0xfffd or _is_control(char): continue if _is_whitespace(char): output.append(" ") else: output.append(char) return "".join(output) class WordpieceTokenizer(object): """Runs WordPiece tokenziation.""" def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200): self.vocab = vocab self.unk_token = unk_token self.max_input_chars_per_word = max_input_chars_per_word def tokenize(self, text): """Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. For example: input = "unaffable" output = ["un", "##aff", "##able"] Args: text: A single token or whitespace separated tokens. This should have already been passed through `BasicTokenizer. Returns: A list of wordpiece tokens. """ text = convert_to_unicode(text) output_tokens = [] for token in whitespace_tokenize(text): chars = list(token) if len(chars) > self.max_input_chars_per_word: output_tokens.append(self.unk_token) continue is_bad = False start = 0 sub_tokens = [] while start < len(chars): end = len(chars) cur_substr = None while start < end: substr = "".join(chars[start:end]) if start > 0: substr = "##" + six.ensure_str(substr) if substr in self.vocab: cur_substr = substr break end -= 1 if cur_substr is None: is_bad = True break sub_tokens.append(cur_substr) start = end if is_bad: output_tokens.append(self.unk_token) else: output_tokens.extend(sub_tokens) return output_tokens def _is_whitespace(char): """Checks whether `chars` is a whitespace character.""" # \t, \n, and \r are technically control characters but we treat them # as whitespace since they are generally considered as such. if char == " " or char == "\t" or char == "\n" or char == "\r": return True cat = unicodedata.category(char) if cat == "Zs": return True return False def _is_control(char): """Checks whether `chars` is a control character.""" # These are technically control characters but we count them as whitespace # characters. if char == "\t" or char == "\n" or char == "\r": return False cat = unicodedata.category(char) if cat in ("Cc", "Cf"): return True return False def _is_punctuation(char): """Checks whether `chars` is a punctuation character.""" cp = ord(char) # We treat all non-letter/number ASCII as punctuation. # Characters such as "^", "$", and "`" are not in the Unicode # Punctuation class but we treat them as punctuation anyways, for # consistency. if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)): return True cat = unicodedata.category(char) if cat.startswith("P"): return True return False