The goal is to infer the LaTeX source that can be compiled to such an image:
```
d s _ { 1 1 } ^ { 2 } = d x ^ { + } d x ^ { - } + l _ { p } ^ { 9 } \frac { p _ { - } } { r ^ { 7 } } \delta ( x ^ { - } ) d x ^ { - } d x ^ { - } + d x _ { 1 } ^ { 2 } + \; \cdots \; + d x _ { 9 } ^ { 2 }
```
The paper [[What You Get Is What You See: A Visual Markup Decompiler]](https://arxiv.org/pdf/1609.04938.pdf) provides more technical details of this model.
### Dependencies
* `torchvision`: `conda install torchvision`
* `Pillow`: `pip install Pillow`
### Quick Start
To get started, we provide a toy Math-to-LaTex example. We assume that the working directory is `OpenNMT-py` throughout this document.
Im2Text consists of four commands:
0) Download the data.
```bash
wget -O data/im2text.tgz http://lstm.seas.harvard.edu/latex/im2text_small.tgz; tar zxf data/im2text.tgz -C data/
```
1) Preprocess the data.
```bash
onmt_preprocess -data_type img \
-src_dir data/im2text/images/ \
-train_src data/im2text/src-train.txt \
-train_tgt data/im2text/tgt-train.txt -valid_src data/im2text/src-val.txt \
-valid_tgt data/im2text/tgt-val.txt -save_data data/im2text/demo \
-tgt_seq_length 150 \
-tgt_words_min_frequency 2 \
-shard_size 500 \
-image_channel_size 1
```
2) Train the model.
```bash
onmt_train -model_type img \
-data data/im2text/demo \
-save_model demo-model \
-gpu_ranks 0 \
-batch_size 20 \
-max_grad_norm 20 \
-learning_rate 0.1 \
-word_vec_size 80 \
-encoder_type brnn \
-image_channel_size 1
```
3) Translate the images.
```bash
onmt_translate -data_type img \
-model demo-model_acc_x_ppl_x_e13.pt \
-src_dir data/im2text/images \
-src data/im2text/src-test.txt \
-output pred.txt \
-max_length 150 \
-beam_size 5 \
-gpu 0 \
-verbose
```
The above dataset is sampled from the [im2latex-100k-dataset](http://lstm.seas.harvard.edu/latex/im2text.tgz). We provide a trained model [[link]](http://lstm.seas.harvard.edu/latex/py-model.pt) on this dataset.
### Options
* `-src_dir`: The directory containing the images.
* `-train_tgt`: The file storing the tokenized labels, one label per line. It shall look like:
```
... ... ...
...
```
* `-train_src`: The file storing the paths of the images (relative to `src_dir`).
```
...
```
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/docs/source/index.md
================================================
.. toctree::
:maxdepth: 2
index.md
quickstart.md
extended.md
This portal provides a detailled documentation of the OpenNMT toolkit. It describes how to use the PyTorch project and how it works.
## Installation
1\. [Install PyTorch](http://pytorch.org/)
2\. Clone the OpenNMT-py repository:
```bash
git clone https://github.com/OpenNMT/OpenNMT-py
cd OpenNMT-py
```
3\. Install required libraries
```bash
pip install -r requirements.txt
```
And you are ready to go! Take a look at the [quickstart](quickstart.md) to familiarize yourself with the main training workflow.
Alternatively you can use Docker to install with `nvidia-docker`. The main Dockerfile is included
in the root directory.
## Citation
When using OpenNMT for research please cite our
[OpenNMT technical report](https://doi.org/10.18653/v1/P17-4012)
```
@inproceedings{opennmt,
author = {Guillaume Klein and
Yoon Kim and
Yuntian Deng and
Jean Senellart and
Alexander M. Rush},
title = {OpenNMT: Open-Source Toolkit for Neural Machine Translation},
booktitle = {Proc. ACL},
year = {2017},
url = {https://doi.org/10.18653/v1/P17-4012},
doi = {10.18653/v1/P17-4012}
}
```
## Additional resources
You can find additional help or tutorials in the following resources:
* [Gitter channel](https://gitter.im/OpenNMT/openmt-py)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/docs/source/index.rst
================================================
Contents
--------
.. toctree::
:caption: Getting Started
:maxdepth: 2
main.md
quickstart.md
FAQ.md
CONTRIBUTING.md
ref.rst
.. toctree::
:caption: Examples
:maxdepth: 2
Library.md
extended.md
Summarization.md
im2text.md
speech2text.md
vid2text.rst
.. toctree::
:caption: Scripts
:maxdepth: 2
options/preprocess.rst
options/train.rst
options/translate.rst
options/server.rst
.. toctree::
:caption: API
:maxdepth: 2
onmt.rst
onmt.modules.rst
onmt.translation.rst
onmt.translate.translation_server.rst
onmt.inputters.rst
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/docs/source/main.md
================================================
# Overview
This portal provides a detailed documentation of the OpenNMT toolkit. It describes how to use the PyTorch project and how it works.
## Installation
Install from `pip`:
Install `OpenNMT-py` from `pip`:
```bash
pip install OpenNMT-py
```
or from the sources:
```bash
git clone https://github.com/OpenNMT/OpenNMT-py.git
cd OpenNMT-py
python setup.py install
```
*(Optionnal)* some advanced features (e.g. working audio, image or pretrained models) requires extra packages, you can install it with:
```bash
pip install -r requirements.opt.txt
```
And you are ready to go! Take a look at the [quickstart](quickstart) to familiarize yourself with the main training workflow.
Alternatively you can use Docker to install with `nvidia-docker`. The main Dockerfile is included
in the root directory.
## Citation
When using OpenNMT for research please cite our
[OpenNMT technical report](https://doi.org/10.18653/v1/P17-4012)
```
@inproceedings{opennmt,
author = {Guillaume Klein and
Yoon Kim and
Yuntian Deng and
Jean Senellart and
Alexander M. Rush},
title = {OpenNMT: Open-Source Toolkit for Neural Machine Translation},
booktitle = {Proc. ACL},
year = {2017},
url = {https://doi.org/10.18653/v1/P17-4012},
doi = {10.18653/v1/P17-4012}
}
```
## Additional resources
You can find additional help or tutorials in the following resources:
* [Gitter channel](https://gitter.im/OpenNMT/openmt-py)
* [Forum](http://forum.opennmt.net/)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/docs/source/modules.rst
================================================
onmt
====
.. toctree::
:maxdepth: 4
onmt
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/docs/source/onmt.inputters.rst
================================================
Data Loaders
=================
Data Readers
-------------
.. autoexception:: onmt.inputters.datareader_base.MissingDependencyException
.. autoclass:: onmt.inputters.DataReaderBase
:members:
.. autoclass:: onmt.inputters.TextDataReader
:members:
.. autoclass:: onmt.inputters.ImageDataReader
:members:
.. autoclass:: onmt.inputters.AudioDataReader
:members:
Dataset
--------
.. autoclass:: onmt.inputters.Dataset
:members:
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/docs/source/onmt.modules.rst
================================================
Modules
=============
Core Modules
------------
.. autoclass:: onmt.modules.Embeddings
:members:
Encoders
---------
.. autoclass:: onmt.encoders.EncoderBase
:members:
.. autoclass:: onmt.encoders.MeanEncoder
:members:
.. autoclass:: onmt.encoders.RNNEncoder
:members:
Decoders
---------
.. autoclass:: onmt.decoders.DecoderBase
:members:
.. autoclass:: onmt.decoders.decoder.RNNDecoderBase
:members:
.. autoclass:: onmt.decoders.StdRNNDecoder
:members:
.. autoclass:: onmt.decoders.InputFeedRNNDecoder
:members:
Attention
----------
.. autoclass:: onmt.modules.AverageAttention
:members:
.. autoclass:: onmt.modules.GlobalAttention
:members:
Architecture: Transformer
----------------------------
.. autoclass:: onmt.modules.PositionalEncoding
:members:
.. autoclass:: onmt.modules.position_ffn.PositionwiseFeedForward
:members:
.. autoclass:: onmt.encoders.TransformerEncoder
:members:
.. autoclass:: onmt.decoders.TransformerDecoder
:members:
.. autoclass:: onmt.modules.MultiHeadedAttention
:members:
:undoc-members:
Architecture: Conv2Conv
----------------------------
(These methods are from a user contribution
and have not been thoroughly tested.)
.. autoclass:: onmt.encoders.CNNEncoder
:members:
.. autoclass:: onmt.decoders.CNNDecoder
:members:
.. autoclass:: onmt.modules.ConvMultiStepAttention
:members:
.. autoclass:: onmt.modules.WeightNormConv2d
:members:
Architecture: SRU
----------------------------
.. autoclass:: onmt.models.sru.SRU
:members:
Alternative Encoders
--------------------
onmt\.modules\.AudioEncoder
.. autoclass:: onmt.encoders.AudioEncoder
:members:
onmt\.modules\.ImageEncoder
.. autoclass:: onmt.encoders.ImageEncoder
:members:
Copy Attention
--------------
.. autoclass:: onmt.modules.CopyGenerator
:members:
Structured Attention
-------------------------------------------
.. autoclass:: onmt.modules.structured_attention.MatrixTree
:members:
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/docs/source/onmt.rst
================================================
Framework
=================
Model
-----
.. autoclass:: onmt.models.NMTModel
:members:
Trainer
-------
.. autoclass:: onmt.Trainer
:members:
.. autoclass:: onmt.utils.Statistics
:members:
Loss
----
.. autoclass:: onmt.utils.loss.LossComputeBase
:members:
Optimizer
-----
.. autoclass:: onmt.utils.Optimizer
:members:
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/docs/source/onmt.translate.translation_server.rst
================================================
Server
======
Models
-------------
.. autoclass:: onmt.translate.translation_server.ServerModel
:members:
Core Server
------------
.. autoexception:: onmt.translate.translation_server.ServerModelError
.. autoclass:: onmt.translate.translation_server.Timer
:members:
.. autoclass:: onmt.translate.translation_server.TranslationServer
:members:
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/docs/source/onmt.translation.rst
================================================
Translation
==================
Translations
-------------
.. autoclass:: onmt.translate.Translation
:members:
Translator Class
-----------------
.. autoclass:: onmt.translate.Translator
:members:
.. autoclass:: onmt.translate.TranslationBuilder
:members:
Decoding Strategies
--------------------
.. autoclass:: onmt.translate.DecodeStrategy
:members:
.. autoclass:: onmt.translate.BeamSearch
:members:
.. autofunction:: onmt.translate.greedy_search.sample_with_temperature
.. autoclass:: onmt.translate.GreedySearch
:members:
Scoring
--------
.. autoclass:: onmt.translate.penalties.PenaltyBuilder
:members:
.. autoclass:: onmt.translate.GNMTGlobalScorer
:members:
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/docs/source/options/preprocess.rst
================================================
Preprocess
==========
.. argparse::
:filename: ../onmt/bin/preprocess.py
:func: _get_parser
:prog: preprocess.py
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/docs/source/options/server.rst
================================================
Server
=========
.. argparse::
:filename: ../onmt/bin/server.py
:func: _get_parser
:prog: server.py
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/docs/source/options/train.rst
================================================
Train
=====
.. argparse::
:filename: ../onmt/bin/train.py
:func: _get_parser
:prog: train.py
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/docs/source/options/translate.rst
================================================
Translate
=========
.. argparse::
:filename: ../onmt/bin/translate.py
:func: _get_parser
:prog: translate.py
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/docs/source/quickstart.md
================================================
# Quickstart
### Step 1: Preprocess the data
```bash
onmt_preprocess -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo
```
We will be working with some example data in `data/` folder.
The data consists of parallel source (`src`) and target (`tgt`) data containing one sentence per line with tokens separated by a space:
* `src-train.txt`
* `tgt-train.txt`
* `src-val.txt`
* `tgt-val.txt`
Validation files are required and used to evaluate the convergence of the training. It usually contains no more than 5000 sentences.
```text
$ head -n 3 data/src-train.txt
It is not acceptable that , with the help of the national bureaucracies , Parliament 's legislative prerogative should be made null and void by means of implementing provisions whose content , purpose and extent are not laid down in advance .
Federal Master Trainer and Senior Instructor of the Italian Federation of Aerobic Fitness , Group Fitness , Postural Gym , Stretching and Pilates; from 2004 , he has been collaborating with Antiche Terme as personal Trainer and Instructor of Stretching , Pilates and Postural Gym .
" Two soldiers came up to me and told me that if I refuse to sleep with them , they will kill me . They beat me and ripped my clothes .
```
### Step 2: Train the model
```bash
onmt_train -data data/demo -save_model demo-model
```
The main train command is quite simple. Minimally it takes a data file
and a save file. This will run the default model, which consists of a
2-layer LSTM with 500 hidden units on both the encoder/decoder.
If you want to train on GPU, you need to set, as an example:
CUDA_VISIBLE_DEVICES=1,3
`-world_size 2 -gpu_ranks 0 1` to use (say) GPU 1 and 3 on this node only.
To know more about distributed training on single or multi nodes, read the FAQ section.
### Step 3: Translate
```bash
onmt_translate -model demo-model_XYZ.pt -src data/src-test.txt -output pred.txt -replace_unk -verbose
```
Now you have a model which you can use to predict on new data. We do this by running beam search. This will output predictions into `pred.txt`.
Note:
The predictions are going to be quite terrible, as the demo dataset is small. Try running on some larger datasets! For example you can download millions of parallel sentences for [translation](http://www.statmt.org/wmt16/translation-task.html) or [summarization](https://github.com/harvardnlp/sent-summary).
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/docs/source/ref.rst
================================================
==========
References
==========
References
.. bibliography:: refs.bib
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/docs/source/refs.bib
================================================
@article{DBLP:journals/corr/LiuL17d,
author = {Yang Liu and
Mirella Lapata},
title = {Learning Structured Text Representations},
journal = {CoRR},
volume = {abs/1705.09207},
year = {2017},
url = {http://arxiv.org/abs/1705.09207},
archivePrefix = {arXiv},
eprint = {1705.09207},
timestamp = {Wed, 07 Jun 2017 14:41:46 +0200},
biburl = {http://dblp.org/rec/bib/journals/corr/LiuL17d},
bibsource = {dblp computer science bibliography, http://dblp.org}
}
@article{sennrich2016linguistic,
title={Linguistic Input Features Improve Neural Machine Translation},
author={Sennrich, Rico and Haddow, Barry},
journal={arXiv preprint arXiv:1606.02892},
year={2016}
}
@inproceedings{Bahdanau2015,
archivePrefix = {arXiv},
arxivId = {1409.0473},
author = {Bahdanau, Dzmitry and Cho, Kyunghyun and Bengio, Yoshua},
booktitle = {ICLR},
doi = {10.1146/annurev.neuro.26.041002.131047},
eprint = {1409.0473},
isbn = {0147-006X (Print)},
issn = {0147-006X},
keywords = {Neural machine translation is a recently proposed,Unlike the traditional statistical machine transla,a source sentence into a fixed-length vector from,and propose to extend this by allowing a model to,bottleneck in improving the performance of this ba,for parts of a source sentence that are relevant t,having to form these parts as a hard segment expli,machine translation often belong to a family of en,maximize the translation performance. The models p,phrase-based system on the task of English-to-Fren,qualitative analysis reveals that the (soft-)align,the neural machine,translation aims at building a single neural netwo,translation. In this paper,we achieve a translation performance comparable to,we conjecture that the use of a fixed-length vecto,well with our intuition,without},
pages = {1--15},
pmid = {14527267},
title = {{Neural Machine Translation By Jointly Learning To Align and Translate}},
url = {http://arxiv.org/abs/1409.0473 http://arxiv.org/abs/1409.0473v3},
year = {2014}
}
@inproceedings{sutskever14sequence,
abstract = {Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT'14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous best result on this task. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.},
archivePrefix = {arXiv},
arxivId = {1409.3215},
author = {Sutskever, Ilya and Vinyals, Oriol and Le, Quoc V.},
booktitle = {NIPS},
eprint = {1409.3215},
isbn = {1409.3215},
pages = {9},
pmid = {2079951},
title = {{Sequence to Sequence Learning with Neural Networks}},
url = {http://arxiv.org/abs/1409.3215},
year = {2014}
}
@article{Xu2015,
abstract = {Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.},
archivePrefix = {arXiv},
arxivId = {1502.03044},
author = {Xu, Kelvin and Ba, Jimmy and Kiros, Ryan and Cho, Kyunghyun and Courville, Aaron and Salakhutdinov, Ruslan and Zemel, Richard and Bengio, Yoshua},
eprint = {1502.03044},
file = {:home/srush/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Xu et al. - 2015 - Show, Attend and Tell Neural Image Caption Generation with Visual Attention(2).pdf:pdf},
journal = {ICML},
month = {feb},
title = {{Show, Attend and Tell: Neural Image Caption Generation with Visual Attention}},
url = {http://arxiv.org/abs/1502.03044},
year = {2015}
}
@article{systran,
title={SYSTRAN's Pure Neural Machine Translation System},
author={Josep Crego and Jungi Kim and Jean Senellart},
journal={arXiv preprint arXiv:1602.06023},
year={2016}
}
@InProceedings{Cho2014,
title = {{L}earning {P}hrase {R}epresentations using {RNN} {E}ncoder-{D}ecoder for {S}tatistical {M}achine {T}ranslation},
author = {Kyunghyun Cho and Bart van Merrienboer and Caglar Gulcehre and Dzmitry Bahdanau and Fethi Bougares and Holger Schwenk and Yoshua Bengio},
booktitle = {Proc of EMNLP},
year = {2014}
}
@InProceedings{Luong2015,
title = {{E}ffective {A}pproaches to {A}ttention-based {N}eural {M}achine {T}ranslation},
author = {Minh-Thang Luong and Hieu Pham and Christopher D. Manning},
booktitle = {Proc of EMNLP},
year = {2015}
}
@InProceedings{Luong2015b,
title = {{A}ddressing the {R}are {W}ord {P}roblem in {N}eural {M}achine {T}ranslation},
author = {Minh-Thang Luong and Ilya Sutskever and Quoc Le and Oriol Vinyals and Wojciech Zaremba},
booktitle = {Proc of ACL},
year = {2015}
}
@article{wu2016google,
title={Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation},
author={Wu, Yonghui and Schuster, Mike and Chen, Zhifeng and Le, Quoc V and Norouzi, Mohammad and Macherey, Wolfgang and Krikun, Maxim and Cao, Yuan and Gao, Qin and Macherey, Klaus and others},
journal={arXiv preprint arXiv:1609.08144},
year={2016}
}
@inproceedings{dean2012large,
title={Large scale distributed deep networks},
author={Dean, Jeffrey and Corrado, Greg and Monga, Rajat and Chen, Kai and Devin, Matthieu and Mao, Mark and Senior, Andrew and Tucker, Paul and Yang, Ke and Le, Quoc V and others},
booktitle={Advances in neural information processing systems},
pages={1223--1231},
year={2012}
}
@inproceedings{koehn2007moses,
title={Moses: Open source toolkit for statistical machine translation},
author={Koehn, Philipp and Hoang, Hieu and Birch, Alexandra and Callison-Burch, Chris and Federico, Marcello and Bertoldi, Nicola and Cowan, Brooke and Shen, Wade and Moran, Christine and Zens, Richard and others},
booktitle={Proc ACL},
pages={177--180},
year={2007},
organization={Association for Computational Linguistics}
}
@inproceedings{dyer2010cdec,
title={cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models},
author={Dyer, Chris and Weese, Jonathan and Setiawan, Hendra and Lopez, Adam and Ture, Ferhan and Eidelman, Vladimir and Ganitkevitch, Juri and Blunsom, Phil and Resnik, Philip},
booktitle={Proc ACL},
pages={7--12},
year={2010},
organization={Association for Computational Linguistics}
}
@article{hochreiter1997long,
title={Long short-term memory},
author={Hochreiter, Sepp and Schmidhuber, J{\"u}rgen},
journal={Neural computation},
volume={9},
number={8},
pages={1735--1780},
year={1997},
publisher={MIT Press}
}
@article{chung2014empirical,
title={Empirical evaluation of gated recurrent neural networks on sequence modeling},
author={Chung, Junyoung and Gulcehre, Caglar and Cho, KyungHyun and Bengio, Yoshua},
journal={arXiv preprint arXiv:1412.3555},
year={2014}
}
@inproceedings{yang2016hierarchical,
title={Hierarchical attention networks for document classification},
author={Yang, Zichao and Yang, Diyi and Dyer, Chris and He, Xiaodong and Smola, Alex and Hovy, Eduard},
booktitle={Proc ACL},
year={2016}
}
@article{martins2016softmax,
title={From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification},
author={Martins, Andr{\'e} FT and Astudillo, Ram{\'o}n Fernandez},
journal={arXiv preprint arXiv:1602.02068},
year={2016}
}
@article{DBLP:journals/corr/LeonardWW15,
author = {Nicholas L{\'{e}}onard and
Sagar Waghmare and
Yang Wang and
Jin{-}Hwa Kim},
title = {rnn : Recurrent Library for Torch},
journal = {CoRR},
volume = {abs/1511.07889},
year = {2015},
url = {http://arxiv.org/abs/1511.07889},
timestamp = {Wed, 23 Dec 2015 08:46:28 +0100},
biburl = {http://dblp.uni-trier.de/rec/bib/journals/corr/LeonardWW15},
bibsource = {dblp computer science bibliography, http://dblp.org}
}
@inproceedings{DBLP:conf/conll/BowmanVVDJB16,
author = {Samuel R. Bowman and
Luke Vilnis and
Oriol Vinyals and
Andrew M. Dai and
Rafal J{\'{o}}zefowicz and
Samy Bengio},
title = {Generating Sentences from a Continuous Space},
booktitle = {Proceedings of the 20th {SIGNLL} Conference on Computational Natural
Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016},
pages = {10--21},
year = {2016},
crossref = {DBLP:conf/conll/2016},
url = {http://aclweb.org/anthology/K/K16/K16-1002.pdf},
timestamp = {Sun, 04 Sep 2016 10:01:12 +0200},
biburl = {http://dblp.uni-trier.de/rec/bib/conf/conll/BowmanVVDJB16},
bibsource = {dblp computer science bibliography, http://dblp.org}
}
@inproceedings{DBLP:conf/nips/VinyalsBLKW16,
author = {Oriol Vinyals and
Charles Blundell and
Tim Lillicrap and
Koray Kavukcuoglu and
Daan Wierstra},
title = {Matching Networks for One Shot Learning},
booktitle = {Advances in Neural Information Processing Systems 29: Annual Conference
on Neural Information Processing Systems 2016, December 5-10, 2016,
Barcelona, Spain},
pages = {3630--3638},
year = {2016},
crossref = {DBLP:conf/nips/2016},
url = {http://papers.nips.cc/paper/6385-matching-networks-for-one-shot-learning},
timestamp = {Fri, 16 Dec 2016 19:45:58 +0100},
biburl = {http://dblp.uni-trier.de/rec/bib/conf/nips/VinyalsBLKW16},
bibsource = {dblp computer science bibliography, http://dblp.org}
}
@article{DBLP:journals/corr/WestonCB14,
author = {Jason Weston and
Sumit Chopra and
Antoine Bordes},
title = {Memory Networks},
journal = {CoRR},
volume = {abs/1410.3916},
year = {2014},
url = {http://arxiv.org/abs/1410.3916},
timestamp = {Sun, 02 Nov 2014 11:25:59 +0100},
biburl = {http://dblp.uni-trier.de/rec/bib/journals/corr/WestonCB14},
bibsource = {dblp computer science bibliography, http://dblp.org}
}
@article{DBLP:journals/corr/XuBKCCSZB15,
author = {Kelvin Xu and
Jimmy Ba and
Ryan Kiros and
Kyunghyun Cho and
Aaron C. Courville and
Ruslan Salakhutdinov and
Richard S. Zemel and
Yoshua Bengio},
title = {Show, Attend and Tell: Neural Image Caption Generation with Visual
Attention},
journal = {CoRR},
volume = {abs/1502.03044},
year = {2015},
url = {http://arxiv.org/abs/1502.03044},
timestamp = {Mon, 02 Mar 2015 14:17:34 +0100},
biburl = {http://dblp.uni-trier.de/rec/bib/journals/corr/XuBKCCSZB15},
bibsource = {dblp computer science bibliography, http://dblp.org}
}
@article{DBLP:journals/corr/DengKR16,
author = {Yuntian Deng and
Anssi Kanervisto and
Alexander M. Rush},
title = {What You Get Is What You See: {A} Visual Markup Decompiler},
journal = {CoRR},
volume = {abs/1609.04938},
year = {2016},
url = {http://arxiv.org/abs/1609.04938},
timestamp = {Mon, 03 Oct 2016 17:51:10 +0200},
biburl = {http://dblp.uni-trier.de/rec/bib/journals/corr/DengKR16},
bibsource = {dblp computer science bibliography, http://dblp.org}
}
@article{DBLP:journals/corr/ChanJLV15,
author = {William Chan and
Navdeep Jaitly and
Quoc V. Le and
Oriol Vinyals},
title = {Listen, Attend and Spell},
journal = {CoRR},
volume = {abs/1508.01211},
year = {2015},
url = {http://arxiv.org/abs/1508.01211},
timestamp = {Tue, 01 Sep 2015 14:42:40 +0200},
biburl = {http://dblp.uni-trier.de/rec/bib/journals/corr/ChanJLV15},
bibsource = {dblp computer science bibliography, http://dblp.org}
}
@article{DBLP:journals/corr/SennrichHB15,
author = {Rico Sennrich and
Barry Haddow and
Alexandra Birch},
title = {Neural Machine Translation of Rare Words with Subword Units},
journal = {CoRR},
volume = {abs/1508.07909},
year = {2015},
url = {http://arxiv.org/abs/1508.07909},
timestamp = {Tue, 01 Sep 2015 14:42:40 +0200},
biburl = {http://dblp.uni-trier.de/rec/bib/journals/corr/SennrichHB15},
bibsource = {dblp computer science bibliography, http://dblp.org}
}
@article{chopra2016abstractive,
title={Abstractive sentence summarization with attentive recurrent neural networks},
author={Chopra, Sumit and Auli, Michael and Rush, Alexander M and Harvard, SEAS},
journal={Proceedings of NAACL-HLT16},
pages={93--98},
year={2016}
}
@article{vinyals2015neural,
title={A neural conversational model},
author={Vinyals, Oriol and Le, Quoc},
journal={arXiv preprint arXiv:1506.05869},
year={2015}
}
@inproceedings{neubig13travatar,
title = {Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers},
author = {Graham Neubig},
booktitle = {Proc ACL },
address = {Sofia, Bulgaria},
month = {August},
year = {2013}
}
@ARTICLE{2017arXiv170301619N,
author = {{Neubig}, G.},
title = "{Neural Machine Translation and Sequence-to-sequence Models: A Tutorial}",
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
eprint = {1703.01619},
primaryClass = "cs.CL",
keywords = {Computer Science - Computation and Language, Computer Science - Learning, Statistics - Machine Learning},
year = 2017,
month = mar,
adsurl = {http://adsabs.harvard.edu/abs/2017arXiv170301619N},
adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}
@article{DBLP:journals/corr/VaswaniSPUJGKP17,
author = {Ashish Vaswani and
Noam Shazeer and
Niki Parmar and
Jakob Uszkoreit and
Llion Jones and
Aidan N. Gomez and
Lukasz Kaiser and
Illia Polosukhin},
title = {Attention Is All You Need},
journal = {CoRR},
volume = {abs/1706.03762},
year = {2017},
url = {http://arxiv.org/abs/1706.03762},
archivePrefix = {arXiv},
eprint = {1706.03762},
timestamp = {Mon, 13 Aug 2018 16:48:37 +0200},
biburl = {https://dblp.org/rec/bib/journals/corr/VaswaniSPUJGKP17},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@article{DBLP:journals/corr/GehringAGYD17,
author = {Jonas Gehring and
Michael Auli and
David Grangier and
Denis Yarats and
Yann N. Dauphin},
title = {Convolutional Sequence to Sequence Learning},
journal = {CoRR},
volume = {abs/1705.03122},
year = {2017},
url = {http://arxiv.org/abs/1705.03122},
archivePrefix = {arXiv},
eprint = {1705.03122},
timestamp = {Mon, 13 Aug 2018 16:48:03 +0200},
biburl = {https://dblp.org/rec/bib/journals/corr/GehringAGYD17},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@article{DBLP:journals/corr/abs-1709-02755,
author = {Tao Lei and
Yu Zhang and
Yoav Artzi},
title = {Training RNNs as Fast as CNNs},
journal = {CoRR},
volume = {abs/1709.02755},
year = {2017},
url = {http://arxiv.org/abs/1709.02755},
archivePrefix = {arXiv},
eprint = {1709.02755},
timestamp = {Mon, 13 Aug 2018 16:46:29 +0200},
biburl = {https://dblp.org/rec/bib/journals/corr/abs-1709-02755},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@article{DBLP:journals/corr/SeeLM17,
author = {Abigail See and
Peter J. Liu and
Christopher D. Manning},
title = {Get To The Point: Summarization with Pointer-Generator Networks},
journal = {CoRR},
volume = {abs/1704.04368},
year = {2017},
url = {http://arxiv.org/abs/1704.04368},
archivePrefix = {arXiv},
eprint = {1704.04368},
timestamp = {Mon, 13 Aug 2018 16:46:08 +0200},
biburl = {https://dblp.org/rec/bib/journals/corr/SeeLM17},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@article{DBLP:journals/corr/abs-1805-00631,
author = {Biao Zhang and
Deyi Xiong and
Jinsong Su},
title = {Accelerating Neural Transformer via an Average Attention Network},
journal = {CoRR},
volume = {abs/1805.00631},
year = {2018},
url = {http://arxiv.org/abs/1805.00631},
archivePrefix = {arXiv},
eprint = {1805.00631},
timestamp = {Mon, 13 Aug 2018 16:46:01 +0200},
biburl = {https://dblp.org/rec/bib/journals/corr/abs-1805-00631},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@article{DBLP:journals/corr/MartinsA16,
author = {Andr{\'{e}} F. T. Martins and
Ram{\'{o}}n Fern{\'{a}}ndez Astudillo},
title = {From Softmax to Sparsemax: {A} Sparse Model of Attention and Multi-Label
Classification},
journal = {CoRR},
volume = {abs/1602.02068},
year = {2016},
url = {http://arxiv.org/abs/1602.02068},
archivePrefix = {arXiv},
eprint = {1602.02068},
timestamp = {Mon, 13 Aug 2018 16:49:13 +0200},
biburl = {https://dblp.org/rec/bib/journals/corr/MartinsA16},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@inproceedings{garg2019jointly,
title = {Jointly Learning to Align and Translate with Transformer Models},
author = {Garg, Sarthak and Peitz, Stephan and Nallasamy, Udhyakumar and Paulik, Matthias},
booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP)},
address = {Hong Kong},
month = {November},
url = {https://arxiv.org/abs/1909.02074},
year = {2019},
}
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/docs/source/speech2text.md
================================================
# Speech to Text
A deep learning-based approach to learning the speech-to-text conversion, built on top of the OpenNMT system.
Given raw audio, we first apply short-time Fourier transform (STFT), then apply Convolutional Neural Networks to get the source features. Based on this source representation, we use an LSTM decoder with attention to produce the text character by character.
### Dependencies
* `torchaudio`: `sudo apt-get install -y sox libsox-dev libsox-fmt-all; pip install git+https://github.com/pytorch/audio`
* `librosa`: `pip install librosa`
### Quick Start
To get started, we provide a toy speech-to-text example. We assume that the working directory is `OpenNMT-py` throughout this document.
0) Download the data.
```
wget -O data/speech.tgz http://lstm.seas.harvard.edu/latex/speech.tgz; tar zxf data/speech.tgz -C data/
```
1) Preprocess the data.
```
onmt_preprocess -data_type audio -src_dir data/speech/an4_dataset -train_src data/speech/src-train.txt -train_tgt data/speech/tgt-train.txt -valid_src data/speech/src-val.txt -valid_tgt data/speech/tgt-val.txt -shard_size 300 -save_data data/speech/demo
```
2) Train the model.
```
onmt_train -model_type audio -enc_rnn_size 512 -dec_rnn_size 512 -audio_enc_pooling 1,1,2,2 -dropout 0 -enc_layers 4 -dec_layers 1 -rnn_type LSTM -data data/speech/demo -save_model demo-model -global_attention mlp -gpu_ranks 0 -batch_size 8 -optim adam -max_grad_norm 100 -learning_rate 0.0003 -learning_rate_decay 0.8 -train_steps 100000
```
3) Translate the speechs.
```
onmt_translate -data_type audio -model demo-model_acc_x_ppl_x_e13.pt -src_dir data/speech/an4_dataset -src data/speech/src-val.txt -output pred.txt -gpu 0 -verbose
```
### Options
* `-src_dir`: The directory containing the audio files.
* `-train_tgt`: The file storing the tokenized labels, one label per line. It shall look like:
```
... ... ...
...
```
* `-train_src`: The file storing the paths of the audio files (relative to `src_dir`).
```
...
```
* `sample_rate`: Sample rate. Default: 16000.
* `window_size`: Window size for spectrogram in seconds. Default: 0.02.
* `window_stride`: Window stride for spectrogram in seconds. Default: 0.01.
* `window`: Window type for spectrogram generation. Default: hamming.
### Acknowledgement
Our preprocessing and CNN encoder is adapted from [deepspeech.pytorch](https://github.com/SeanNaren/deepspeech.pytorch).
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/docs/source/vid2text.rst
================================================
Video to Text
=============
Recurrent
---------
This tutorial shows how to replicate the results from
`"Describing Videos by Exploiting Temporal Structure" `_
[`code `_]
using OpenNMT-py.
Get `YouTubeClips.tar` from `here `_.
Use ``tar -xvf YouTubeClips.tar`` to decompress the archive.
Now, visit `this repo `_.
Follow the "preprocessed YouTube2Text download link."
We'll be throwing away the Googlenet features. We just need the captions.
Use ``unzip youtube2text_iccv15.zip`` to decompress the files.
Get to the following directory structure: ::
yt2t
|-YouTubeClips
|-youtube2text_iccv15
Change directories to `yt2t`. We'll rename the videos to follow the "vid#.avi" format:
.. code-block:: python
import pickle
import os
YT = "youtube2text_iccv15"
YTC = "YouTubeClips"
# load the YouTube hash -> vid### map.
with open(os.path.join(YT, "dict_youtube_mapping.pkl"), "rb") as f:
yt2vid = pickle.load(f, encoding="latin-1")
for f in os.listdir(YTC):
hashy, ext = os.path.splitext(f)
vid = yt2vid[hashy]
fpath_old = os.path.join(YTC, f)
f_new = vid + ext
fpath_new = os.path.join(YTC, f_new)
os.rename(fpath_old, fpath_new)
Make sure all the videos have the same (low) framerate by changing to the YouTubeClips directory and using
.. code-block:: bash
for fi in $( ls ); do ffmpeg -y -i $fi -r 2 $fi; done
Now we want to convert the frames into sequences of CNN feature vectors.
(We'll use the environment variable ``Y2T2`` to refer to the `yt2t` directory, so change directories back and use)
.. code-block:: bash
export YT2T=`pwd`
Then change directories back to the `OpenNMT-py` directory.
Use `tools/img_feature_extractor.py`.
Set the ``--world_size`` argument to the number of GPUs you have available
(You can use the environment variable ``CUDA_VISIBLE_DEVICES`` to restrict the GPUs used).
.. code-block:: bash
PYTHONPATH=$PWD:$PYTHONPATH python tools/vid_feature_extractor.py --root_dir $YT2T/YouTubeClips --out_dir $YT2T/r152
Ensure the count is equal to 1970.
You can use ``ls -1 $YT2T/r152 | wc -l``.
If not, rerun the script. It will only process on the missing feature vectors.
(Note this is unexpected behavior and consider opening an issue.)
Now we turn our attention to the annotations. Each video has multiple associated captions. We want to
train the model on each video + single caption pair. We'll collect all the captions per video, then we'll
flatten them into files listing the feature vector sequence filenames (repeating for each caption) and the
annotations. We skip the test videos since they are handled separately at translation time.
Change directories back to ``YT2T``:
.. code-block:: bash
cd $YT2T
.. code-block:: python
import pickle
import os
from random import shuffle
YT = "youtube2text_iccv15"
SHUFFLE = True
with open(os.path.join(YT, "CAP.pkl"), "rb") as f:
ann = pickle.load(f, encoding="latin-1")
vid2anns = {}
for vid_name, data in ann.items():
for d in data:
try:
vid2anns[vid_name].append(d["tokenized"])
except KeyError:
vid2anns[vid_name] = [d["tokenized"]]
with open(os.path.join(YT, "train.pkl"), "rb") as f:
train = pickle.load(f, encoding="latin-1")
with open(os.path.join(YT, "valid.pkl"), "rb") as f:
val = pickle.load(f, encoding="latin-1")
with open(os.path.join(YT, "test.pkl"), "rb") as f:
test = pickle.load(f, encoding="latin-1")
train_files = open("yt2t_train_files.txt", "w")
val_files = open("yt2t_val_files.txt", "w")
val_folded = open("yt2t_val_folded_files.txt", "w")
test_files = open("yt2t_test_files.txt", "w")
train_cap = open("yt2t_train_cap.txt", "w")
val_cap = open("yt2t_val_cap.txt", "w")
vid_names = vid2anns.keys()
if SHUFFLE:
vid_names = list(vid_names)
shuffle(vid_names)
for vid_name in vid_names:
anns = vid2anns[vid_name]
vid_path = vid_name + ".npy"
for i, an in enumerate(anns):
an = an.replace("\n", " ") # some caps have newlines
split_name = vid_name + "_" + str(i)
if split_name in train:
train_files.write(vid_path + "\n")
train_cap.write(an + "\n")
elif split_name in val:
if i == 0:
val_folded.write(vid_path + "\n")
val_files.write(vid_path + "\n")
val_cap.write(an + "\n")
else:
# Don't need to save out the test captions,
# just the files. And, don't need to repeat
# it for each caption
assert split_name in test
if i == 0:
test_files.write(vid_path + "\n")
Return to the `OpenNMT-py` directory. Now we preprocess the data for training.
We preprocess with a small shard size of 1000. This keeps the amount of data in memory (RAM) to a
manageable 10 G. If you have more RAM, you can increase the shard size.
Preprocess the data with
.. code-block:: bash
onmt_preprocess -data_type vec -train_src $YT2T/yt2t_train_files.txt -src_dir $YT2T/r152/ -train_tgt $YT2T/yt2t_train_cap.txt -valid_src $YT2T/yt2t_val_files.txt -valid_tgt $YT2T/yt2t_val_cap.txt -save_data data/yt2t --shard_size 1000
Train with
.. code-block:: bash
onmt_train -data data/yt2t -save_model yt2t-model -world_size 2 -gpu_ranks 0 1 -model_type vec -batch_size 64 -train_steps 10000 -valid_steps 500 -save_checkpoint_steps 500 -encoder_type brnn -optim adam -learning_rate .0001 -feat_vec_size 2048
Translate with
.. code-block::
onmt_translate -model yt2t-model_step_7200.pt -src $YT2T/yt2t_test_files.txt -output pred.txt -verbose -data_type vec -src_dir $YT2T/r152 -gpu 0 -batch_size 10
.. note::
Generally, you want to keep the model that has the lowest validation perplexity. That turned out to be
at step 7200, but choosing a different validation frequency or random seed could result in different results.
Then you can use `coco-caption `_ to evaluate the predictions.
(Note that the fork `flauted `_ can be used for Python 3 compatibility).
Install the git repository with pip using
.. code-block:: bash
pip install git+
Then use the following Python code to evaluate:
.. code-block:: python
import os
from pprint import pprint
from pycocoevalcap.bleu.bleu import Bleu
from pycocoevalcap.meteor.meteor import Meteor
from pycocoevalcap.rouge.rouge import Rouge
from pycocoevalcap.cider.cider import Cider
from pycocoevalcap.spice.spice import Spice
if __name__ == "__main__":
pred = open("pred.txt")
import pickle
import os
YT = os.path.join(os.environ["YT2T"], "youtube2text_iccv15")
with open(os.path.join(YT, "CAP.pkl"), "rb") as f:
ann = pickle.load(f, encoding="latin-1")
vid2anns = {}
for vid_name, data in ann.items():
for d in data:
try:
vid2anns[vid_name].append(d["tokenized"])
except KeyError:
vid2anns[vid_name] = [d["tokenized"]]
test_files = open(os.path.join(os.environ["YT2T"], "yt2t_test_files.txt"))
scorers = {
"Bleu": Bleu(4),
"Meteor": Meteor(),
"Rouge": Rouge(),
"Cider": Cider(),
"Spice": Spice()
}
gts = {}
res = {}
for outp, filename in zip(pred, test_files):
filename = filename.strip("\n")
outp = outp.strip("\n")
vid_id = os.path.splitext(filename)[0]
anns = vid2anns[vid_id]
gts[vid_id] = anns
res[vid_id] = [outp]
scores = {}
for name, scorer in scorers.items():
score, all_scores = scorer.compute_score(gts, res)
if isinstance(score, list):
for i, sc in enumerate(score, 1):
scores[name + str(i)] = sc
else:
scores[name] = score
pprint(scores)
Here are our results ::
{'Bleu1': 0.7888553878084233,
'Bleu2': 0.6729376621109295,
'Bleu3': 0.5778428507344473,
'Bleu4': 0.47633625833397897,
'Cider': 0.7122415518428051,
'Meteor': 0.31829562714082704,
'Rouge': 0.6811305229481235,
'Spice': 0.044147089472463576}
So how does this stack up against the paper? These results should be compared to the "Global (Temporal Attention)"
row in Table 1. The authors report BLEU4 0.4028, METEOR 0.2900, and CIDEr 0.4801. So, our results are a significant
improvement. Our architecture follows the general encoder + attentional decoder described in the paper, but the
actual attention implementation is slightly different. The paper downsamples by choosing 26 equally spaced frames from
the first 240, while we downsample the video to 2 fps. Also, we use ResNet features instead of GoogLeNet, and we
lowercase while the paper does not, so some improvement is expected.
Transformer
-----------
Now we will try to replicate the baseline transformer results from
`"TVT: Two-View Transformer Network for Video Captioning" `_
on the MSVD (YouTube2Text) dataset. See Table 3, Base model(R).
In Section 4.3, the authors report most of their preprocessing and hyperparameters.
Create a folder called *yt2t_2*. Copy *youtube2text_iccv15* directory and *YouTubeClips.tar* into
the new directory and untar *YouTubeClips*. Rerun the renaming code. Subssample at 5 FPS using
.. code-block:: bash
for fi in $( ls ); do ffmpeg -y -i $fi -r 5 $fi; done
Set the environment variable ``$YT2T`` to this new directory and change to the repo directory.
Run the feature extraction command again to extract ResNet features on the frames.
Then use this reprocessing code. Note that it shuffles the data differently, and it performs
tokenization similar to what the authors report.
.. code-block:: python
import pickle
import os
import random
import string
seed = 2345
random.seed(seed)
YT = "youtube2text_iccv15"
SHUFFLE = True
with open(os.path.join(YT, "CAP.pkl"), "rb") as f:
ann = pickle.load(f, encoding="latin-1")
def clean(caption):
caption = caption.lower()
caption = caption.replace("\n", " ").replace("\t", " ").replace("\r", " ")
# remove punctuation
caption = caption.translate(str.maketrans("", "", string.punctuation))
# multiple whitespace
caption = " ".join(caption.split())
return caption
with open(os.path.join(YT, "train.pkl"), "rb") as f:
train = pickle.load(f, encoding="latin-1")
with open(os.path.join(YT, "valid.pkl"), "rb") as f:
val = pickle.load(f, encoding="latin-1")
with open(os.path.join(YT, "test.pkl"), "rb") as f:
test = pickle.load(f, encoding="latin-1")
train_data = []
val_data = []
test_data = []
for vid_name, data in ann.items():
vid_path = vid_name + ".npy"
for i, d in enumerate(data):
split_name = vid_name + "_" + str(i)
datum = (vid_path, i, clean(d["caption"]))
if split_name in train:
train_data.append(datum)
elif split_name in val:
val_data.append(datum)
elif split_name in test:
test_data.append(datum)
else:
assert False
if SHUFFLE:
random.shuffle(train_data)
train_files = open("yt2t_train_files.txt", "w")
train_cap = open("yt2t_train_cap.txt", "w")
for vid_path, _, an in train_data:
train_files.write(vid_path + "\n")
train_cap.write(an + "\n")
train_files.close()
train_cap.close()
val_files = open("yt2t_val_files.txt", "w")
val_folded = open("yt2t_val_folded_files.txt", "w")
val_cap = open("yt2t_val_cap.txt", "w")
for vid_path, i, an in val_data:
if i == 0:
val_folded.write(vid_path + "\n")
val_files.write(vid_path + "\n")
val_cap.write(an + "\n")
val_files.close()
val_folded.close()
val_cap.close()
test_files = open("yt2t_test_files.txt", "w")
for vid_path, i, an in test_data:
# Don't need to save out the test captions,
# just the files. And, don't need to repeat
# it for each caption
if i == 0:
test_files.write(vid_path + "\n")
test_files.close()
Then preprocess the data with max-length filtering. (Note you will be prompted to remove the
old data. Do this, i.e. ``rm data/yt2t.*.pt.``)
.. code-block:: bash
onmt_preprocess -data_type vec -train_src $YT2T/yt2t_train_files.txt -src_dir $YT2T/r152/ -train_tgt $YT2T/yt2t_train_cap.txt -valid_src $YT2T/yt2t_val_files.txt -valid_tgt $YT2T/yt2t_val_cap.txt -save_data data/yt2t --shard_size 1000 --src_seq_length 50 --tgt_seq_length 20
Delete the old checkpoints and train a transformer model on this data.
.. code-block:: bash
rm -r yt2t-model_step_*.pt; onmt_train -data data/yt2t -save_model yt2t-model -world_size 2 -gpu_ranks 0 1 -model_type vec -batch_size 64 -train_steps 8000 -valid_steps 400 -save_checkpoint_steps 400 -optim adam -learning_rate .0001 -feat_vec_size 2048 -layers 4 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -dropout 0.3 -param_init 0 -param_init_glorot -report_every 400 --share_decoder_embedding --seed 7000
Note we use the hyperparameters described in the paper.
We estimate the length of 20 epochs with ``-train_steps``. Note that this depends on
using a world size of 2. If you use a different world size, scale the ``-train_steps`` (and
``-save_checkpoint_steps``, along with other parameters) accordingly.
The batch size is not specified in the paper, so we assume one checkpoint
per our estimated epoch. And, sharing
the decoder embeddings is not mentioned, although we find this helps performance. Like the paper, we perform
"early-stopping" with the COCO scores. We use beam search on the early stopping,
although this too is not mentioned. You can reproduce our early-stops with these scripts
(namely, running `find_val_stops.sh` and then `test_early_stops.sh` -
`process_results.py` is a dependency of `find_val_stops.sh`):
.. code-block:: python
:caption: `process_results.py`
import argparse
from collections import defaultdict
import pandas as pd
def load_results(fname="results.txt"):
index = []
data = []
with open(fname, "r") as f:
while True:
try:
filename = next(f).strip()
except:
break
step = int(filename.split("_")[-1].split(".")[0])
next(f) # blank
next(f) # spice junk
next(f) # length stats
next(f) # ratios
scores = {}
while True:
score_line = next(f).strip().strip("{").strip(",")
metric, score = score_line.split(": ")
metric = metric.strip("'")
score_num = float(score.strip("}").strip(","))
scores[metric] = float(score_num)
if score.endswith("}"):
break
next(f) # blank
next(f) # blank
next(f) # blank
index.append(step)
data.append(scores)
df = pd.DataFrame(data, index=index)
return df
def find_absolute_stops(df):
return df.idxmax()
def find_early_stops(df, stop_count):
maxes = defaultdict(lambda: 0)
argmaxes = {}
count_since_max = {}
ended_metrics = set()
for index, row in df.iterrows():
for metric, score in row.items():
if metric in ended_metrics:
continue
if score >= maxes[metric]:
maxes[metric] = score
argmaxes[metric] = index
count_since_max[metric] = 0
else:
count_since_max[metric] += 1
if count_since_max[metric] == stop_count:
ended_metrics.add(metric)
if len(ended_metrics) == len(row):
break
return pd.Series(argmaxes)
def find_stops(df, stop_count):
if stop_count > 0:
return find_early_stops(df, stop_count)
else:
return find_absolute_stops(df)
if __name__ == "__main__":
parser = argparse.ArgumentParser("Find locations of best scores")
parser.add_argument(
"-s", "--stop_count", type=int, default=0,
help="Stop after this many scores worse than running max (0 to disable).")
args = parser.parse_args()
df = load_results()
maxes = find_stops(df, args.stop_count)
for metric, idx in maxes.iteritems():
print(f"{metric} maxed @ {idx}")
print(df.loc[idx])
print()
.. code-block:: bash
:caption: `find_val_stops.sh`
rm results.txt
touch results.txt
for file in $( ls -1v yt2t-model_step*.pt )
do
echo $file
onmt_translate -model $file -src $YT2T/yt2t_val_folded_files.txt -output pred.txt -verbose -data_type vec -src_dir $YT2T/r152 -gpu 0 -batch_size 16 -max_length 20 >/dev/null 2>/dev/null
echo -e "$file\n" >> results.txt
python coco.py -s val >> results.txt
echo -e "\n\n" >> results.txt
done
python process_results.py -s 10 > val_stops.txt
.. code-block:: bash
:caption: `test_early_stops.sh`
rm test_results.txt
touch test_results.txt
while IFS='' read -r line || [[ -n "$line" ]]; do
if [[ $line == *"maxed"* ]]; then
metric=$(echo $line | awk '{print $1}')
step=$(echo $line | awk '{print $NF}')
echo $metric early stopped @ $step | tee -a test_results.txt
onmt_translate -model "yt2t-model_step_${step}.pt" -src $YT2T/yt2t_test_files.txt -output pred.txt -data_type vec -src_dir $YT2T/r152 -gpu 0 -batch_size 16 -max_length 20 >/dev/null 2>/dev/null
python coco.py -s 'test' >> test_results.txt
echo -e "\n\n" >> test_results.txt
fi
done < val_stops.txt
cat test_results.txt
Thus we test the checkpoint at step 2000 and find the following scores::
Meteor early stopped @ 2000
SPICE evaluation took: 2.522 s
{'testlen': 3410, 'reflen': 3417, 'guess': [3410, 2740, 2070, 1400], 'correct': [2664, 1562, 887, 386]}
ratio: 0.9979514193734276
{'Bleu1': 0.7796296150773093,
'Bleu2': 0.6659837622637965,
'Bleu3': 0.5745524496015597,
'Bleu4': 0.4779574102543823,
'Cider': 0.7541600090591118,
'Meteor': 0.3259497476899707,
'Rouge': 0.6800279518634998,
'Spice': 0.046435637924854}
Note our scores are an improvement over the recurrent approach.
The paper reports
BLEU4 50.25, CIDEr 72.11, METEOR 33.41, ROUGE 70.16.
The CIDEr score is higher than the paper (but, considering the sensitivity of this
metric, not by much), while the other metrics are slightly lower.
This could be indicative of an implementation difference. Note that Table 5 reports
24M parameters for a 2-layer transformer with ResNet inputs, while we find a few M less. This
could be due to generator or embedding differences, or perhaps linear layers on the
residual connections. Alternatively, the difference could be the initial tokenization.
The paper reports 9861 tokens, while we find fewer.
Part of this could be due to using
the annotations from the other repository, where perhaps some annotations have been
stripped. We also do not know the batch size or checkpoint frequency from the original
work.
Different random initializations could account for some of the difference, although
our random seed gives good results.
Overall, however, the scores are nearly reproduced
and the scores are favorable.
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/floyd.yml
================================================
env: pytorch-0.4
machine: cpu
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/floyd_requirements.txt
================================================
git+https://github.com/pytorch/text
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/github_deploy_key_opennmt_opennmt_py.enc
================================================
gAAAAABaPWC5LTHR5xMoviRbhWsMCxo0FPMTXwcm4DBbG2jYaTxuqdjT78PXu1XxcEfbRuZ-xX8723WjgJMaOVFRuB6k1Oow7Qw8YlO6CV5fyjU8jJFy0D4fSEE40P6A0GbvtMwj2uVKyhrCK341_8roVVegN96S40muebu0oi3cY0sDwLybAOBQYdf_J6gQgWIxf289hPMzmV4iy332V9gRN-cNbmpUYaVxINrxv0Ce6pw3NV99mNNK5izq-g4hlpErnF7LG60Jar7Vh7bw52C0PpEVJmUXWIJOtDGy6d_SuvR4SIj64J4IEDO78s7PyI8jAyP5Nu5emcH_eOV8z7C2nszkNbx6RwtDPh5qK0HILCgGmF4nzOTVK8mE9_8gD-tlWpS7jj7y_IJwNPJB3Gnqt383sg5NIQpgQqJMzmKtacXPF-sDvczsyf4t6GEURhPYobNociBQa3ZZBtJU0O_moUtwdSRsjkk1RdUbIgG3tcX73T_SYJqMLGtMKywmyDzv1CVqFCdCAhlVAcSnLLvP2xlJ1uJSKa46dtSDoUXleWCGMR9SoLz2UpPvtnJ1zZ8YKW7UD9iQfAsznBMSG4wKGEdZdFymvCuLnZQYWmJK9UFSyoYnrW1Jy1pmOJ8a25kfyI6_LiK52iC1zr9DZcn5MP2FGgrJnz0RfuvPtcgKFtvs731LzVycUT-u1I4WftPh_6b6fPxYuSnRPdJ39m7OnaGb5VOobleElaZMkh8niXM4K654i1dQA_ItuYeWjU3HPhwN86aOif6GeZSlq_Xjp3Z2DACSmYqyxKccVBWYBdZO8WSdSt07TEeWUboDDQTu_xCPEh-E8Z-Bb-xjTjVM99jkvZSrbqJn6TeY__nH2thfl9cVMj73o7wIp0EJgSpUuKEnJqPwenPwm-VEj_ODB8qbNYC3y4QkfHBL6nbdUt8Qx6P59i8C54st2v5OdZ31bF6bqbJxE5UElJRyuASmE92vu8QqQqPGjZqLhIE9Tl6EC4JFdwJMZI53gztzfKTYMAQLkbV0zYtSoBYavbBCTwQTlG49qDeZk6r5K4DPwZh9xM-M9j32Yr3NYE6QvS4sPaikPkAGoLqTAWVrfdLDc7IdIgAmZNt1D5E3Wm2n7wlQflrdLu6VgiGT1rZgsax_C1bTvsi7InkjuQuNphzXEn3_9FlWmnatDK0Nb0MFqGtEAd0S5SDGI2cf7drLVOtJzvNw9GUgdMoqn-hutvJNS1vpIZK2KektVoFMB-gBJj4oPp4gx8WDbvmkd88Jbitk3xuQp8JmoxPcVkZhPJYYMouMHnO982N9HiJ7AsvFmML_AEe72_qQCh5jcGpsbMq_U5Cu8S2L6MpaMmcn1Piup9ClCricSNEtJD-QS9EEyn-mCHnXnQ1_z6AQ-An5wwm2eNrsEN1F1DjqLcyO3ziE5pHKNXh5W1H3Ec1_ETpInJRBoZ7DEvPpI1KFyxSnwCCrONAIZwrZzMDHPsXgJbXZZfX8_bah36380_eecZOmeCVE1UsimA2MLE3K-ziv0YhXiyHkdzROSmXruXSmzr1NW1bn26Fwy3M3L3GDmHI4Wd62eiYlPAdiOOGO2rA1H37q47X-65BBdh9XXz0k_5YRPLtQeDUavLKzd9MIHc8Ef4g2PkHJTRp9jdkertDy1NkKg3rV-QZ12fCCce97ftcMJ4BSXLgEx_jvxISTo4mB8R0fAAWYJAYCd0vFc7Q4PRFHhyJsm_5BtrwEC5JFQF6sNQllkIRbixJ-kGaieAwRZ-JKzR7gzQ3MJVjArZKcZJV6N8YYRQvKcR8sEcgLv_lr_1hQNLjmGyFeZ1RYxagaddVLAxwp8W5_vofhnKCc5JpnVcAm4W-h_l7uZd42raso-7HeRYIacW9tuFhmUi7iZBHzsNz9G0XFdsdeD2FKJb2yt30Ze4VA1crIOWwVkHsfXid2tjV4wEkR1GGQXYJ2HSHeiH5W4_9vxyYlpum8swrEWY_vLywnv92Bqerk2pfBi6kJqE1ZyZR-8NQuZMxQO_l8pTurirI-nCeHY5Im-jhs4MmA4-zwthY6RKQqbijYCbEd3HeHHMS0k8c84NlMiVAlEd7cAQZYSvlrAxNsaUWmBazE6HAGhXlB0X5pYDYV0LDalIU4guqpVLx-B4iwvnQ7nA3EzsXSBSJDsYbtVQaOHabG_jTL-SDKpkMEdb1Fh0UAeflB02fSenwj1DmsZysiJDD16IxKq22XjGslQZKNvZqk2XivzbL7JfVkCDU6N8XgyOpImZmh28Cq5iyN0GfgzYBvUscrspXQd7QJmiatoGLA-nkCZae4XRfeEh9l0qj_jiLnDzDXxF8pz9A-2GMTUUiUFwehSw2haTZJ4Ndqj3ekItvVJZxwVPYs_Voim3orgFUKmT1SUWXy5lKWPuqpWpbhBs0W5EJ2gt5EzV_ejsnnMqyDoxS-R03-ZATHRaFtvf96Zz0qo7xP__UONT1c5l8FX4Tf_kBF5JlTFe3FbSk9fa38QJGqH3RiF1mx91VXOwXR4fw-vGy5CuZoCND3QVzrdwmYE3jqxClBo7AnAjLTXD-lUCf7gqFqHFU-on1zypAZaXhwMVmfuKeolQhPsuybzUWTlRQW5OT2rxnwI-xO_6s78sRIyBwtbQba6lcOUnNH5PF9TbGj4Z2ErzA7eBS6ZBlnEE_fx8QrHoF32x2KLbyX6ELgEG4pt6aWfroWTWWC2T1CjUrswmMEfF5F0aA0uvr-vikxFl62Ob2yIuyF39ytmr8mb_o4JBpd3Etj4m_T-5HwmrsNnAf8bUqf0hTHuQlS9ek5jJK-_pNWWL1Q3yQ7x-4eiJkppero7UYyOKXGLRqgWchry26edqEETCybJMvgjmN2kHqcrg3XBM4ItjOPw0s4XklG7YZzEVmq8O3hgp-fVozpX_RAaaFSGmDzuZcQl2R_-Yo13KzjLj8wu3KjBCfVhJoAjc4T2VZMGVL3T4AOZOEN_GXEKjT5rbrEo1E7eQUoKE_PKKxmyDeNZN3W3hULAS_FMKAURyCLT_nfQ-cKU7pg113AyV6juAS_DFnBPZkcwM-PJBKz69QsrN_D3s3M53rART78zbUAab-La7Q803g-eaSgxpGJgCZKqHHafE4OpMnhKJl1eXaO_YekbtNR-JNXxdMS5wMEA_BOpqu_ixwuw_vJx-tZxKJ1p_o75OVFK9YH9ZFT5_--ngM8G-kHZrV6u5XKc5Jymrq9m6nZaH__HdAMvQmRfMWbOsSXl3HrlyEoPK5nyBcKtlHLwANc_1WeMJp3HjpHi5HelTnqNDxi5I5Z0RWP1mU0f8mUMkTvGb5U1wW0pL0Aq_5vSfn5LQhH0QAt2JcHrFasMe_7dABIzMLb8_ph0yQQ57IAIfXUYleOwyD1ZpAFgysnh9V9duxPmg3yswRlJ9MZK9tYkwWcj_nOjq2407qR42aThqWYL4702HVycoQgErx6K4XSkF5mmJdfsZ515IIpqHJt-7Q5n_gzIPQa4Wq5ANgS5-2y97uN61NkoE9eIiLHZMY6OvuORvSdMeL6_84MuLBsKS_3OgXrOQFOgdK5mCn9Iv53UZiMkR0rLGHOLnb2hnTZGq4ao3yiNsauBqf0O4r6ecarYxGty4yWZBxB8aHLFcK-FAlFuoEL8PlRLChOEUqvUoaFs3jzyQY_iRZRyCMszPi0xPrvdiILk4VDaa0NR0XtCC-kA3tdcb_Xbdfv_Djw-wVLf7Dx6iBlPNwtjE4OzweqBaAkNkk5Ij35vk-6QQryHhAgiAHdXDGZoegdHZdKUeC_GSCMud0wpXloEPxDREskWu1VN310OXaa6VvpG0VB1B2CrUlFNvwzmal3PYCrb7XPAT1Lu5C4oSH3bTr6Hk9wtIEv0sAgt4B9RPhZ0Kq-lP85raW748Pkc0PDK1C4g4SzAxl_x7JTSTYUk_fjMnc7yEN0iBRJCMfmUq-ILtj2zOI7f3dazGCp9dXBOTVTYMVNRpcka7vWjlGHMMuVvid3Oz6GgBZl_I3csNzGXTZEvJurp3qXSaXL_THxHmDBDn7T_uY58uPaTC-qjdvkKNDUzg2kRtzejmO7TPEGIRAQghEkVK-ruZU5llxjMg1NOTeXfhXZlRK2Ri8F9QPs6FSFuiqLgOzgbl_rlecf3E6iJ9fgTsdE8OGgekAwmF5hi7Tp5DsGNlKXpWvc4TftLO7len-b9Tqa7XYPU5NKv1hVIIobSRjYuFuW1yDSWtXY0zzzqPsdhtrv97JoM71QL8fZ3tUDBDWhvlBmpXSSfjf4qYQ0PmP7pQWLjb_DuVBDO5EDV0xblgz_stLcNvxRIYChm0ytxN8B2jCaH1n_CLEWTvFloWBP72ovnRWcd1gqbZ4bD4KrI_Tb7VcepWqUg1CO-yTRHR4zQUSBBfM=
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/__init__.py
================================================
""" Main entry point of the ONMT library """
from __future__ import division, print_function
import onmt.inputters
import onmt.encoders
import onmt.decoders
import onmt.models
import onmt.utils
import onmt.modules
from onmt.trainer import Trainer
import sys
import onmt.utils.optimizers
onmt.utils.optimizers.Optim = onmt.utils.optimizers.Optimizer
sys.modules["onmt.Optim"] = onmt.utils.optimizers
# For Flake
__all__ = [onmt.inputters, onmt.encoders, onmt.decoders, onmt.models,
onmt.utils, onmt.modules, "Trainer"]
__version__ = "1.0.0.rc2"
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/bin/__init__.py
================================================
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/bin/average_models.py
================================================
#!/usr/bin/env python
import argparse
import torch
def average_models(model_files, fp32=False):
vocab = None
opt = None
avg_model = None
avg_generator = None
for i, model_file in enumerate(model_files):
m = torch.load(model_file, map_location='cpu')
model_weights = m['model']
generator_weights = m['generator']
if fp32:
for k, v in model_weights.items():
model_weights[k] = v.float()
for k, v in generator_weights.items():
generator_weights[k] = v.float()
if i == 0:
vocab, opt = m['vocab'], m['opt']
avg_model = model_weights
avg_generator = generator_weights
else:
for (k, v) in avg_model.items():
avg_model[k].mul_(i).add_(model_weights[k]).div_(i + 1)
for (k, v) in avg_generator.items():
avg_generator[k].mul_(i).add_(generator_weights[k]).div_(i + 1)
final = {"vocab": vocab, "opt": opt, "optim": None,
"generator": avg_generator, "model": avg_model}
return final
def main():
parser = argparse.ArgumentParser(description="")
parser.add_argument("-models", "-m", nargs="+", required=True,
help="List of models")
parser.add_argument("-output", "-o", required=True,
help="Output file")
parser.add_argument("-fp32", "-f", action="store_true",
help="Cast params to float32")
opt = parser.parse_args()
final = average_models(opt.models, opt.fp32)
torch.save(final, opt.output)
if __name__ == "__main__":
main()
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/bin/preprocess.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Pre-process Data / features files and build vocabulary
"""
import codecs
import glob
import gc
import torch
from collections import Counter, defaultdict
from onmt.utils.logging import init_logger, logger
from onmt.utils.misc import split_corpus
import onmt.inputters as inputters
import onmt.opts as opts
from onmt.utils.parse import ArgumentParser
from onmt.inputters.inputter import _build_fields_vocab,\
_load_vocab
from functools import partial
from multiprocessing import Pool
def check_existing_pt_files(opt, corpus_type, ids, existing_fields):
""" Check if there are existing .pt files to avoid overwriting them """
existing_shards = []
for maybe_id in ids:
if maybe_id:
shard_base = corpus_type + "_" + maybe_id
else:
shard_base = corpus_type
pattern = opt.save_data + '.{}.*.pt'.format(shard_base)
if glob.glob(pattern):
if opt.overwrite:
maybe_overwrite = ("will be overwritten because "
"`-overwrite` option is set.")
else:
maybe_overwrite = ("won't be overwritten, pass the "
"`-overwrite` option if you want to.")
logger.warning("Shards for corpus {} already exist, {}"
.format(shard_base, maybe_overwrite))
existing_shards += [maybe_id]
return existing_shards
def process_one_shard(corpus_params, params):
corpus_type, fields, src_reader, tgt_reader, align_reader, opt,\
existing_fields, src_vocab, tgt_vocab = corpus_params
i, (src_shard, tgt_shard, align_shard, maybe_id, filter_pred) = params
# create one counter per shard
sub_sub_counter = defaultdict(Counter)
assert len(src_shard) == len(tgt_shard)
logger.info("Building shard %d." % i)
src_data = {"reader": src_reader, "data": src_shard, "dir": opt.src_dir}
tgt_data = {"reader": tgt_reader, "data": tgt_shard, "dir": None}
align_data = {"reader": align_reader, "data": align_shard, "dir": None}
_readers, _data, _dir = inputters.Dataset.config(
[('src', src_data), ('tgt', tgt_data), ('align', align_data)])
dataset = inputters.Dataset(
fields, readers=_readers, data=_data, dirs=_dir,
sort_key=inputters.str2sortkey[opt.data_type],
filter_pred=filter_pred
)
if corpus_type == "train" and existing_fields is None:
for ex in dataset.examples:
for name, field in fields.items():
if ((opt.data_type == "audio") and (name == "src")):
continue
try:
f_iter = iter(field)
except TypeError:
f_iter = [(name, field)]
all_data = [getattr(ex, name, None)]
else:
all_data = getattr(ex, name)
for (sub_n, sub_f), fd in zip(
f_iter, all_data):
has_vocab = (sub_n == 'src' and
src_vocab is not None) or \
(sub_n == 'tgt' and
tgt_vocab is not None)
if (hasattr(sub_f, 'sequential')
and sub_f.sequential and not has_vocab):
val = fd
sub_sub_counter[sub_n].update(val)
if maybe_id:
shard_base = corpus_type + "_" + maybe_id
else:
shard_base = corpus_type
data_path = "{:s}.{:s}.{:d}.pt".\
format(opt.save_data, shard_base, i)
logger.info(" * saving %sth %s data shard to %s."
% (i, shard_base, data_path))
dataset.save(data_path)
del dataset.examples
gc.collect()
del dataset
gc.collect()
return sub_sub_counter
def maybe_load_vocab(corpus_type, counters, opt):
src_vocab = None
tgt_vocab = None
existing_fields = None
if corpus_type == "train":
if opt.src_vocab != "":
try:
logger.info("Using existing vocabulary...")
existing_fields = torch.load(opt.src_vocab)
except torch.serialization.pickle.UnpicklingError:
logger.info("Building vocab from text file...")
src_vocab, src_vocab_size = _load_vocab(
opt.src_vocab, "src", counters,
opt.src_words_min_frequency)
if opt.tgt_vocab != "":
tgt_vocab, tgt_vocab_size = _load_vocab(
opt.tgt_vocab, "tgt", counters,
opt.tgt_words_min_frequency)
return src_vocab, tgt_vocab, existing_fields
def build_save_dataset(corpus_type, fields, src_reader, tgt_reader,
align_reader, opt):
assert corpus_type in ['train', 'valid']
if corpus_type == 'train':
counters = defaultdict(Counter)
srcs = opt.train_src
tgts = opt.train_tgt
ids = opt.train_ids
aligns = opt.train_align
elif corpus_type == 'valid':
counters = None
srcs = [opt.valid_src]
tgts = [opt.valid_tgt]
ids = [None]
aligns = [opt.valid_align]
src_vocab, tgt_vocab, existing_fields = maybe_load_vocab(
corpus_type, counters, opt)
existing_shards = check_existing_pt_files(
opt, corpus_type, ids, existing_fields)
# every corpus has shards, no new one
if existing_shards == ids and not opt.overwrite:
return
def shard_iterator(srcs, tgts, ids, aligns, existing_shards,
existing_fields, corpus_type, opt):
"""
Builds a single iterator yielding every shard of every corpus.
"""
for src, tgt, maybe_id, maybe_align in zip(srcs, tgts, ids, aligns):
if maybe_id in existing_shards:
if opt.overwrite:
logger.warning("Overwrite shards for corpus {}"
.format(maybe_id))
else:
if corpus_type == "train":
assert existing_fields is not None,\
("A 'vocab.pt' file should be passed to "
"`-src_vocab` when adding a corpus to "
"a set of already existing shards.")
logger.warning("Ignore corpus {} because "
"shards already exist"
.format(maybe_id))
continue
if ((corpus_type == "train" or opt.filter_valid)
and tgt is not None):
filter_pred = partial(
inputters.filter_example,
use_src_len=opt.data_type == "text",
max_src_len=opt.src_seq_length,
max_tgt_len=opt.tgt_seq_length)
else:
filter_pred = None
src_shards = split_corpus(src, opt.shard_size)
tgt_shards = split_corpus(tgt, opt.shard_size)
align_shards = split_corpus(maybe_align, opt.shard_size)
for i, (ss, ts, a_s) in enumerate(
zip(src_shards, tgt_shards, align_shards)):
yield (i, (ss, ts, a_s, maybe_id, filter_pred))
shard_iter = shard_iterator(srcs, tgts, ids, aligns, existing_shards,
existing_fields, corpus_type, opt)
with Pool(opt.num_threads) as p:
dataset_params = (corpus_type, fields, src_reader, tgt_reader,
align_reader, opt, existing_fields,
src_vocab, tgt_vocab)
func = partial(process_one_shard, dataset_params)
for sub_counter in p.imap(func, shard_iter):
if sub_counter is not None:
for key, value in sub_counter.items():
counters[key].update(value)
if corpus_type == "train":
vocab_path = opt.save_data + '.vocab.pt'
if existing_fields is None:
fields = _build_fields_vocab(
fields, counters, opt.data_type,
opt.share_vocab, opt.vocab_size_multiple,
opt.src_vocab_size, opt.src_words_min_frequency,
opt.tgt_vocab_size, opt.tgt_words_min_frequency)
else:
fields = existing_fields
torch.save(fields, vocab_path)
def build_save_vocab(train_dataset, fields, opt):
fields = inputters.build_vocab(
train_dataset, fields, opt.data_type, opt.share_vocab,
opt.src_vocab, opt.src_vocab_size, opt.src_words_min_frequency,
opt.tgt_vocab, opt.tgt_vocab_size, opt.tgt_words_min_frequency,
vocab_size_multiple=opt.vocab_size_multiple
)
vocab_path = opt.save_data + '.vocab.pt'
torch.save(fields, vocab_path)
def count_features(path):
"""
path: location of a corpus file with whitespace-delimited tokens and
│-delimited features within the token
returns: the number of features in the dataset
"""
with codecs.open(path, "r", "utf-8") as f:
first_tok = f.readline().split(None, 1)[0]
return len(first_tok.split(u"│")) - 1
def preprocess(opt):
ArgumentParser.validate_preprocess_args(opt)
torch.manual_seed(opt.seed)
init_logger(opt.log_file)
logger.info("Extracting features...")
src_nfeats = 0
tgt_nfeats = 0
for src, tgt in zip(opt.train_src, opt.train_tgt):
src_nfeats += count_features(src) if opt.data_type == 'text' \
else 0
tgt_nfeats += count_features(tgt) # tgt always text so far
logger.info(" * number of source features: %d." % src_nfeats)
logger.info(" * number of target features: %d." % tgt_nfeats)
logger.info("Building `Fields` object...")
fields = inputters.get_fields(
opt.data_type,
src_nfeats,
tgt_nfeats,
dynamic_dict=opt.dynamic_dict,
with_align=opt.train_align[0] is not None,
src_truncate=opt.src_seq_length_trunc,
tgt_truncate=opt.tgt_seq_length_trunc)
src_reader = inputters.str2reader[opt.data_type].from_opt(opt)
tgt_reader = inputters.str2reader["text"].from_opt(opt)
align_reader = inputters.str2reader["text"].from_opt(opt)
logger.info("Building & saving training data...")
build_save_dataset(
'train', fields, src_reader, tgt_reader, align_reader, opt)
if opt.valid_src and opt.valid_tgt:
logger.info("Building & saving validation data...")
build_save_dataset(
'valid', fields, src_reader, tgt_reader, align_reader, opt)
def _get_parser():
parser = ArgumentParser(description='preprocess.py')
opts.config_opts(parser)
opts.preprocess_opts(parser)
return parser
def main():
parser = _get_parser()
opt = parser.parse_args()
preprocess(opt)
if __name__ == "__main__":
main()
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/bin/server.py
================================================
#!/usr/bin/env python
import configargparse
from flask import Flask, jsonify, request
from onmt.translate import TranslationServer, ServerModelError
STATUS_OK = "ok"
STATUS_ERROR = "error"
def start(config_file,
url_root="./translator",
host="0.0.0.0",
port=5000,
debug=True):
def prefix_route(route_function, prefix='', mask='{0}{1}'):
def newroute(route, *args, **kwargs):
return route_function(mask.format(prefix, route), *args, **kwargs)
return newroute
app = Flask(__name__)
app.route = prefix_route(app.route, url_root)
translation_server = TranslationServer()
translation_server.start(config_file)
@app.route('/models', methods=['GET'])
def get_models():
out = translation_server.list_models()
return jsonify(out)
@app.route('/health', methods=['GET'])
def health():
out = {}
out['status'] = STATUS_OK
return jsonify(out)
@app.route('/clone_model/', methods=['POST'])
def clone_model(model_id):
out = {}
data = request.get_json(force=True)
timeout = -1
if 'timeout' in data:
timeout = data['timeout']
del data['timeout']
opt = data.get('opt', None)
try:
model_id, load_time = translation_server.clone_model(
model_id, opt, timeout)
except ServerModelError as e:
out['status'] = STATUS_ERROR
out['error'] = str(e)
else:
out['status'] = STATUS_OK
out['model_id'] = model_id
out['load_time'] = load_time
return jsonify(out)
@app.route('/unload_model/', methods=['GET'])
def unload_model(model_id):
out = {"model_id": model_id}
try:
translation_server.unload_model(model_id)
out['status'] = STATUS_OK
except Exception as e:
out['status'] = STATUS_ERROR
out['error'] = str(e)
return jsonify(out)
@app.route('/translate', methods=['POST'])
def translate():
inputs = request.get_json(force=True)
out = {}
try:
trans, scores, n_best, _, aligns = translation_server.run(inputs)
assert len(trans) == len(inputs) * n_best
assert len(scores) == len(inputs) * n_best
assert len(aligns) == len(inputs) * n_best
out = [[] for _ in range(n_best)]
for i in range(len(trans)):
response = {"src": inputs[i // n_best]['src'], "tgt": trans[i],
"n_best": n_best, "pred_score": scores[i]}
if aligns[i] is not None:
response["align"] = aligns[i]
out[i % n_best].append(response)
except ServerModelError as e:
out['error'] = str(e)
out['status'] = STATUS_ERROR
return jsonify(out)
@app.route('/to_cpu/', methods=['GET'])
def to_cpu(model_id):
out = {'model_id': model_id}
translation_server.models[model_id].to_cpu()
out['status'] = STATUS_OK
return jsonify(out)
@app.route('/to_gpu/', methods=['GET'])
def to_gpu(model_id):
out = {'model_id': model_id}
translation_server.models[model_id].to_gpu()
out['status'] = STATUS_OK
return jsonify(out)
app.run(debug=debug, host=host, port=port, use_reloader=False,
threaded=True)
def _get_parser():
parser = configargparse.ArgumentParser(
config_file_parser_class=configargparse.YAMLConfigFileParser,
description="OpenNMT-py REST Server")
parser.add_argument("--ip", type=str, default="0.0.0.0")
parser.add_argument("--port", type=int, default="5000")
parser.add_argument("--url_root", type=str, default="/translator")
parser.add_argument("--debug", "-d", action="store_true")
parser.add_argument("--config", "-c", type=str,
default="./available_models/conf.json")
return parser
def main():
parser = _get_parser()
args = parser.parse_args()
start(args.config, url_root=args.url_root, host=args.ip, port=args.port,
debug=args.debug)
if __name__ == "__main__":
main()
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/bin/train.py
================================================
#!/usr/bin/env python
"""Train models."""
import os
import signal
import torch
import onmt.opts as opts
import onmt.utils.distributed
from onmt.utils.misc import set_random_seed
from onmt.utils.logging import init_logger, logger
from onmt.train_single import main as single_main
from onmt.utils.parse import ArgumentParser
from onmt.inputters.inputter import build_dataset_iter, \
load_old_vocab, old_style_vocab, build_dataset_iter_multiple
from itertools import cycle
def train(opt):
ArgumentParser.validate_train_opts(opt)
ArgumentParser.update_model_opts(opt)
ArgumentParser.validate_model_opts(opt)
set_random_seed(opt.seed, False)
# Load checkpoint if we resume from a previous training.
if opt.train_from:
logger.info('Loading checkpoint from %s' % opt.train_from)
checkpoint = torch.load(opt.train_from,
map_location=lambda storage, loc: storage)
logger.info('Loading vocab from checkpoint at %s.' % opt.train_from)
vocab = checkpoint['vocab']
else:
vocab = torch.load(opt.data + '.vocab.pt')
# check for code where vocab is saved instead of fields
# (in the future this will be done in a smarter way)
if old_style_vocab(vocab):
fields = load_old_vocab(
vocab, opt.model_type, dynamic_dict=opt.copy_attn)
else:
fields = vocab
if len(opt.data_ids) > 1: ##zida In case there are several corpora.
train_shards = []
for train_id in opt.data_ids:
shard_base = "train_" + train_id
train_shards.append(shard_base)
train_iter = build_dataset_iter_multiple(train_shards, fields, opt)
else:
if opt.data_ids[0] is not None:
shard_base = "train_" + opt.data_ids[0]
else:
shard_base = "train"
train_iter = build_dataset_iter(shard_base, fields, opt)
nb_gpu = len(opt.gpu_ranks)
if opt.world_size > 1:
queues = []
mp = torch.multiprocessing.get_context('spawn')
semaphore = mp.Semaphore(opt.world_size * opt.queue_size)
# Create a thread to listen for errors in the child processes.
error_queue = mp.SimpleQueue()
error_handler = ErrorHandler(error_queue)
# Train with multiprocessing.
procs = []
for device_id in range(nb_gpu):
q = mp.Queue(opt.queue_size)
queues += [q]
procs.append(mp.Process(target=run, args=(
opt, device_id, error_queue, q, semaphore), daemon=True))
procs[device_id].start()
logger.info(" Starting process pid: %d " % procs[device_id].pid)
error_handler.add_child(procs[device_id].pid)
producer = mp.Process(target=batch_producer,
args=(train_iter, queues, semaphore, opt,),
daemon=True)
producer.start()
error_handler.add_child(producer.pid)
for p in procs:
p.join()
producer.terminate()
elif nb_gpu == 1: # case 1 GPU only
single_main(opt, 0)
else: # case only CPU
single_main(opt, -1)
def batch_producer(generator_to_serve, queues, semaphore, opt):
init_logger(opt.log_file)
set_random_seed(opt.seed, False)
# generator_to_serve = iter(generator_to_serve)
def pred(x):
"""
Filters batches that belong only
to gpu_ranks of current node
"""
for rank in opt.gpu_ranks:
if x[0] % opt.world_size == rank:
return True
generator_to_serve = filter(
pred, enumerate(generator_to_serve))
def next_batch(device_id):
new_batch = next(generator_to_serve)
semaphore.acquire()
return new_batch[1]
b = next_batch(0)
for device_id, q in cycle(enumerate(queues)):
b.dataset = None
if isinstance(b.src, tuple):
b.src = tuple([_.to(torch.device(device_id))
for _ in b.src])
else:
b.src = b.src.to(torch.device(device_id))
b.tgt = b.tgt.to(torch.device(device_id))
b.indices = b.indices.to(torch.device(device_id))
b.alignment = b.alignment.to(torch.device(device_id)) \
if hasattr(b, 'alignment') else None
b.src_map = b.src_map.to(torch.device(device_id)) \
if hasattr(b, 'src_map') else None
b.align = b.align.to(torch.device(device_id)) \
if hasattr(b, 'align') else None
# hack to dodge unpicklable `dict_keys`
b.fields = list(b.fields)
q.put(b)
b = next_batch(device_id)
def run(opt, device_id, error_queue, batch_queue, semaphore):
""" run process """
try:
gpu_rank = onmt.utils.distributed.multi_init(opt, device_id)
if gpu_rank != opt.gpu_ranks[device_id]:
raise AssertionError("An error occurred in \
Distributed initialization")
single_main(opt, device_id, batch_queue, semaphore)
except KeyboardInterrupt:
pass # killed by parent, do nothing
except Exception:
# propagate exception to parent process, keeping original traceback
import traceback
error_queue.put((opt.gpu_ranks[device_id], traceback.format_exc()))
class ErrorHandler(object):
"""A class that listens for exceptions in children processes and propagates
the tracebacks to the parent process."""
def __init__(self, error_queue):
""" init error handler """
import signal
import threading
self.error_queue = error_queue
self.children_pids = []
self.error_thread = threading.Thread(
target=self.error_listener, daemon=True)
self.error_thread.start()
signal.signal(signal.SIGUSR1, self.signal_handler)
def add_child(self, pid):
""" error handler """
self.children_pids.append(pid)
def error_listener(self):
""" error listener """
(rank, original_trace) = self.error_queue.get()
self.error_queue.put((rank, original_trace))
os.kill(os.getpid(), signal.SIGUSR1)
def signal_handler(self, signalnum, stackframe):
""" signal handler """
for pid in self.children_pids:
os.kill(pid, signal.SIGINT) # kill children processes
(rank, original_trace) = self.error_queue.get()
msg = """\n\n-- Tracebacks above this line can probably
be ignored --\n\n"""
msg += original_trace
raise Exception(msg)
def _get_parser():
parser = ArgumentParser(description='train.py')
opts.config_opts(parser)
opts.model_opts(parser)
opts.train_opts(parser)
return parser
def main():
parser = _get_parser()
opt = parser.parse_args()
train(opt)
if __name__ == "__main__":
main()
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/bin/translate.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
from onmt.utils.logging import init_logger
from onmt.utils.misc import split_corpus
from onmt.translate.translator import build_translator
import onmt.opts as opts
from onmt.utils.parse import ArgumentParser
def translate(opt):
ArgumentParser.validate_translate_opts(opt)
logger = init_logger(opt.log_file)
translator = build_translator(opt, report_score=True)
src_shards = split_corpus(opt.src, opt.shard_size)
tgt_shards = split_corpus(opt.tgt, opt.shard_size)
shard_pairs = zip(src_shards, tgt_shards)
for i, (src_shard, tgt_shard) in enumerate(shard_pairs):
logger.info("Translating shard %d." % i)
translator.translate(
src=src_shard,
tgt=tgt_shard,
src_dir=opt.src_dir,
batch_size=opt.batch_size,
batch_type=opt.batch_type,
attn_debug=opt.attn_debug,
align_debug=opt.align_debug
)
def _get_parser():
parser = ArgumentParser(description='translate.py')
opts.config_opts(parser)
opts.translate_opts(parser)
return parser
def main():
parser = _get_parser()
opt = parser.parse_args()
translate(opt)
if __name__ == "__main__":
main()
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/decoders/__init__.py
================================================
"""Module defining decoders."""
from onmt.decoders.decoder import DecoderBase, InputFeedRNNDecoder, \
StdRNNDecoder
from onmt.decoders.transformer import TransformerDecoder
from onmt.decoders.cnn_decoder import CNNDecoder
str2dec = {"rnn": StdRNNDecoder, "ifrnn": InputFeedRNNDecoder,
"cnn": CNNDecoder, "transformer": TransformerDecoder}
__all__ = ["DecoderBase", "TransformerDecoder", "StdRNNDecoder", "CNNDecoder",
"InputFeedRNNDecoder", "str2dec"]
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/decoders/cnn_decoder.py
================================================
"""Implementation of the CNN Decoder part of
"Convolutional Sequence to Sequence Learning"
"""
import torch
import torch.nn as nn
from onmt.modules import ConvMultiStepAttention, GlobalAttention
from onmt.utils.cnn_factory import shape_transform, GatedConv
from onmt.decoders.decoder import DecoderBase
SCALE_WEIGHT = 0.5 ** 0.5
class CNNDecoder(DecoderBase):
"""Decoder based on "Convolutional Sequence to Sequence Learning"
:cite:`DBLP:journals/corr/GehringAGYD17`.
Consists of residual convolutional layers, with ConvMultiStepAttention.
"""
def __init__(self, num_layers, hidden_size, attn_type,
copy_attn, cnn_kernel_width, dropout, embeddings,
copy_attn_type):
super(CNNDecoder, self).__init__()
self.cnn_kernel_width = cnn_kernel_width
self.embeddings = embeddings
# Decoder State
self.state = {}
input_size = self.embeddings.embedding_size
self.linear = nn.Linear(input_size, hidden_size)
self.conv_layers = nn.ModuleList(
[GatedConv(hidden_size, cnn_kernel_width, dropout, True)
for i in range(num_layers)]
)
self.attn_layers = nn.ModuleList(
[ConvMultiStepAttention(hidden_size) for i in range(num_layers)]
)
# CNNDecoder has its own attention mechanism.
# Set up a separate copy attention layer if needed.
assert not copy_attn, "Copy mechanism not yet tested in conv2conv"
if copy_attn:
self.copy_attn = GlobalAttention(
hidden_size, attn_type=copy_attn_type)
else:
self.copy_attn = None
@classmethod
def from_opt(cls, opt, embeddings):
"""Alternate constructor."""
return cls(
opt.dec_layers,
opt.dec_rnn_size,
opt.global_attention,
opt.copy_attn,
opt.cnn_kernel_width,
opt.dropout[0] if type(opt.dropout) is list else opt.dropout,
embeddings,
opt.copy_attn_type)
def init_state(self, _, memory_bank, enc_hidden):
"""Init decoder state."""
self.state["src"] = (memory_bank + enc_hidden) * SCALE_WEIGHT
self.state["previous_input"] = None
def map_state(self, fn):
self.state["src"] = fn(self.state["src"], 1)
if self.state["previous_input"] is not None:
self.state["previous_input"] = fn(self.state["previous_input"], 1)
def detach_state(self):
self.state["previous_input"] = self.state["previous_input"].detach()
def forward(self, tgt, memory_bank, step=None, **kwargs):
""" See :obj:`onmt.modules.RNNDecoderBase.forward()`"""
if self.state["previous_input"] is not None:
tgt = torch.cat([self.state["previous_input"], tgt], 0)
dec_outs = []
attns = {"std": []}
if self.copy_attn is not None:
attns["copy"] = []
emb = self.embeddings(tgt)
assert emb.dim() == 3 # len x batch x embedding_dim
tgt_emb = emb.transpose(0, 1).contiguous()
# The output of CNNEncoder.
src_memory_bank_t = memory_bank.transpose(0, 1).contiguous()
# The combination of output of CNNEncoder and source embeddings.
src_memory_bank_c = self.state["src"].transpose(0, 1).contiguous()
emb_reshape = tgt_emb.contiguous().view(
tgt_emb.size(0) * tgt_emb.size(1), -1)
linear_out = self.linear(emb_reshape)
x = linear_out.view(tgt_emb.size(0), tgt_emb.size(1), -1)
x = shape_transform(x)
pad = torch.zeros(x.size(0), x.size(1), self.cnn_kernel_width - 1, 1)
pad = pad.type_as(x)
base_target_emb = x
for conv, attention in zip(self.conv_layers, self.attn_layers):
new_target_input = torch.cat([pad, x], 2)
out = conv(new_target_input)
c, attn = attention(base_target_emb, out,
src_memory_bank_t, src_memory_bank_c)
x = (x + (c + out) * SCALE_WEIGHT) * SCALE_WEIGHT
output = x.squeeze(3).transpose(1, 2)
# Process the result and update the attentions.
dec_outs = output.transpose(0, 1).contiguous()
if self.state["previous_input"] is not None:
dec_outs = dec_outs[self.state["previous_input"].size(0):]
attn = attn[:, self.state["previous_input"].size(0):].squeeze()
attn = torch.stack([attn])
attns["std"] = attn
if self.copy_attn is not None:
attns["copy"] = attn
# Update the state.
self.state["previous_input"] = tgt
# TODO change the way attns is returned dict => list or tuple (onnx)
return dec_outs, attns
def update_dropout(self, dropout):
for layer in self.conv_layers:
layer.dropout.p = dropout
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/decoders/decoder.py
================================================
import torch
import torch.nn as nn
from onmt.models.stacked_rnn import StackedLSTM, StackedGRU
from onmt.modules import context_gate_factory, GlobalAttention
from onmt.utils.rnn_factory import rnn_factory
from onmt.utils.misc import aeq
class DecoderBase(nn.Module):
"""Abstract class for decoders.
Args:
attentional (bool): The decoder returns non-empty attention.
"""
def __init__(self, attentional=True):
super(DecoderBase, self).__init__()
self.attentional = attentional
@classmethod
def from_opt(cls, opt, embeddings):
"""Alternate constructor.
Subclasses should override this method.
"""
raise NotImplementedError
class RNNDecoderBase(DecoderBase):
"""Base recurrent attention-based decoder class.
Specifies the interface used by different decoder types
and required by :class:`~onmt.models.NMTModel`.
.. mermaid::
graph BT
A[Input]
subgraph RNN
C[Pos 1]
D[Pos 2]
E[Pos N]
end
G[Decoder State]
H[Decoder State]
I[Outputs]
F[memory_bank]
A--emb-->C
A--emb-->D
A--emb-->E
H-->C
C-- attn --- F
D-- attn --- F
E-- attn --- F
C-->I
D-->I
E-->I
E-->G
F---I
Args:
rnn_type (str):
style of recurrent unit to use, one of [RNN, LSTM, GRU, SRU]
bidirectional_encoder (bool) : use with a bidirectional encoder
num_layers (int) : number of stacked layers
hidden_size (int) : hidden size of each layer
attn_type (str) : see :class:`~onmt.modules.GlobalAttention`
attn_func (str) : see :class:`~onmt.modules.GlobalAttention`
coverage_attn (str): see :class:`~onmt.modules.GlobalAttention`
context_gate (str): see :class:`~onmt.modules.ContextGate`
copy_attn (bool): setup a separate copy attention mechanism
dropout (float) : dropout value for :class:`torch.nn.Dropout`
embeddings (onmt.modules.Embeddings): embedding module to use
reuse_copy_attn (bool): reuse the attention for copying
copy_attn_type (str): The copy attention style. See
:class:`~onmt.modules.GlobalAttention`.
"""
def __init__(self, rnn_type, bidirectional_encoder, num_layers,
hidden_size, attn_type="general", attn_func="softmax",
coverage_attn=False, context_gate=None,
copy_attn=False, dropout=0.0, embeddings=None,
reuse_copy_attn=False, copy_attn_type="general"):
super(RNNDecoderBase, self).__init__(
attentional=attn_type != "none" and attn_type is not None)
self.bidirectional_encoder = bidirectional_encoder
self.num_layers = num_layers
self.hidden_size = hidden_size
self.embeddings = embeddings
self.dropout = nn.Dropout(dropout)
# Decoder state
self.state = {}
# Build the RNN.
self.rnn = self._build_rnn(rnn_type,
input_size=self._input_size,
hidden_size=hidden_size,
num_layers=num_layers,
dropout=dropout)
# Set up the context gate.
self.context_gate = None
if context_gate is not None:
self.context_gate = context_gate_factory(
context_gate, self._input_size,
hidden_size, hidden_size, hidden_size
)
# Set up the standard attention.
self._coverage = coverage_attn
if not self.attentional:
if self._coverage:
raise ValueError("Cannot use coverage term with no attention.")
self.attn = None
else:
self.attn = GlobalAttention(
hidden_size, coverage=coverage_attn,
attn_type=attn_type, attn_func=attn_func
)
if copy_attn and not reuse_copy_attn:
if copy_attn_type == "none" or copy_attn_type is None:
raise ValueError(
"Cannot use copy_attn with copy_attn_type none")
self.copy_attn = GlobalAttention(
hidden_size, attn_type=copy_attn_type, attn_func=attn_func
)
else:
self.copy_attn = None
self._reuse_copy_attn = reuse_copy_attn and copy_attn
if self._reuse_copy_attn and not self.attentional:
raise ValueError("Cannot reuse copy attention with no attention.")
@classmethod
def from_opt(cls, opt, embeddings):
"""Alternate constructor."""
return cls(
opt.rnn_type,
opt.brnn,
opt.dec_layers,
opt.dec_rnn_size,
opt.global_attention,
opt.global_attention_function,
opt.coverage_attn,
opt.context_gate,
opt.copy_attn,
opt.dropout[0] if type(opt.dropout) is list
else opt.dropout,
embeddings,
opt.reuse_copy_attn,
opt.copy_attn_type)
def init_state(self, src, memory_bank, encoder_final):
"""Initialize decoder state with last state of the encoder."""
def _fix_enc_hidden(hidden):
# The encoder hidden is (layers*directions) x batch x dim.
# We need to convert it to layers x batch x (directions*dim).
if self.bidirectional_encoder:
hidden = torch.cat([hidden[0:hidden.size(0):2],
hidden[1:hidden.size(0):2]], 2)
return hidden
if isinstance(encoder_final, tuple): # LSTM
self.state["hidden"] = tuple(_fix_enc_hidden(enc_hid)
for enc_hid in encoder_final)
else: # GRU
self.state["hidden"] = (_fix_enc_hidden(encoder_final), )
# Init the input feed.
batch_size = self.state["hidden"][0].size(1)
h_size = (batch_size, self.hidden_size)
self.state["input_feed"] = \
self.state["hidden"][0].data.new(*h_size).zero_().unsqueeze(0)
self.state["coverage"] = None
def map_state(self, fn):
self.state["hidden"] = tuple(fn(h, 1) for h in self.state["hidden"])
self.state["input_feed"] = fn(self.state["input_feed"], 1)
if self._coverage and self.state["coverage"] is not None:
self.state["coverage"] = fn(self.state["coverage"], 1)
def detach_state(self):
self.state["hidden"] = tuple(h.detach() for h in self.state["hidden"])
self.state["input_feed"] = self.state["input_feed"].detach()
def forward(self, tgt, memory_bank, memory_lengths=None, step=None,
**kwargs):
"""
Args:
tgt (LongTensor): sequences of padded tokens
``(tgt_len, batch, nfeats)``.
memory_bank (FloatTensor): vectors from the encoder
``(src_len, batch, hidden)``.
memory_lengths (LongTensor): the padded source lengths
``(batch,)``.
Returns:
(FloatTensor, dict[str, FloatTensor]):
* dec_outs: output from the decoder (after attn)
``(tgt_len, batch, hidden)``.
* attns: distribution over src at each tgt
``(tgt_len, batch, src_len)``.
"""
dec_state, dec_outs, attns = self._run_forward_pass(
tgt, memory_bank, memory_lengths=memory_lengths)
# Update the state with the result.
if not isinstance(dec_state, tuple):
dec_state = (dec_state,)
self.state["hidden"] = dec_state
self.state["input_feed"] = dec_outs[-1].unsqueeze(0)
self.state["coverage"] = None
if "coverage" in attns:
self.state["coverage"] = attns["coverage"][-1].unsqueeze(0)
# Concatenates sequence of tensors along a new dimension.
# NOTE: v0.3 to 0.4: dec_outs / attns[*] may not be list
# (in particular in case of SRU) it was not raising error in 0.3
# since stack(Variable) was allowed.
# In 0.4, SRU returns a tensor that shouldn't be stacke
if type(dec_outs) == list:
dec_outs = torch.stack(dec_outs)
for k in attns:
if type(attns[k]) == list:
attns[k] = torch.stack(attns[k])
return dec_outs, attns
def update_dropout(self, dropout):
self.dropout.p = dropout
self.embeddings.update_dropout(dropout)
class StdRNNDecoder(RNNDecoderBase):
"""Standard fully batched RNN decoder with attention.
Faster implementation, uses CuDNN for implementation.
See :class:`~onmt.decoders.decoder.RNNDecoderBase` for options.
Based around the approach from
"Neural Machine Translation By Jointly Learning To Align and Translate"
:cite:`Bahdanau2015`
Implemented without input_feeding and currently with no `coverage_attn`
or `copy_attn` support.
"""
def _run_forward_pass(self, tgt, memory_bank, memory_lengths=None):
"""
Private helper for running the specific RNN forward pass.
Must be overriden by all subclasses.
Args:
tgt (LongTensor): a sequence of input tokens tensors
``(len, batch, nfeats)``.
memory_bank (FloatTensor): output(tensor sequence) from the
encoder RNN of size ``(src_len, batch, hidden_size)``.
memory_lengths (LongTensor): the source memory_bank lengths.
Returns:
(Tensor, List[FloatTensor], Dict[str, List[FloatTensor]):
* dec_state: final hidden state from the decoder.
* dec_outs: an array of output of every time
step from the decoder.
* attns: a dictionary of different
type of attention Tensor array of every time
step from the decoder.
"""
assert self.copy_attn is None # TODO, no support yet.
assert not self._coverage # TODO, no support yet.
attns = {}
emb = self.embeddings(tgt)
if isinstance(self.rnn, nn.GRU):
rnn_output, dec_state = self.rnn(emb, self.state["hidden"][0])
else:
rnn_output, dec_state = self.rnn(emb, self.state["hidden"])
# Check
tgt_len, tgt_batch, _ = tgt.size()
output_len, output_batch, _ = rnn_output.size()
aeq(tgt_len, output_len)
aeq(tgt_batch, output_batch)
# Calculate the attention.
if not self.attentional:
dec_outs = rnn_output
else:
dec_outs, p_attn = self.attn(
rnn_output.transpose(0, 1).contiguous(),
memory_bank.transpose(0, 1),
memory_lengths=memory_lengths
)
attns["std"] = p_attn
# Calculate the context gate.
if self.context_gate is not None:
dec_outs = self.context_gate(
emb.view(-1, emb.size(2)),
rnn_output.view(-1, rnn_output.size(2)),
dec_outs.view(-1, dec_outs.size(2))
)
dec_outs = dec_outs.view(tgt_len, tgt_batch, self.hidden_size)
dec_outs = self.dropout(dec_outs)
return dec_state, dec_outs, attns
def _build_rnn(self, rnn_type, **kwargs):
rnn, _ = rnn_factory(rnn_type, **kwargs)
return rnn
@property
def _input_size(self):
return self.embeddings.embedding_size
class InputFeedRNNDecoder(RNNDecoderBase):
"""Input feeding based decoder.
See :class:`~onmt.decoders.decoder.RNNDecoderBase` for options.
Based around the input feeding approach from
"Effective Approaches to Attention-based Neural Machine Translation"
:cite:`Luong2015`
.. mermaid::
graph BT
A[Input n-1]
AB[Input n]
subgraph RNN
E[Pos n-1]
F[Pos n]
E --> F
end
G[Encoder]
H[memory_bank n-1]
A --> E
AB --> F
E --> H
G --> H
"""
def _run_forward_pass(self, tgt, memory_bank, memory_lengths=None):
"""
See StdRNNDecoder._run_forward_pass() for description
of arguments and return values.
"""
# Additional args check.
input_feed = self.state["input_feed"].squeeze(0)
input_feed_batch, _ = input_feed.size()
_, tgt_batch, _ = tgt.size()
aeq(tgt_batch, input_feed_batch)
# END Additional args check.
dec_outs = []
attns = {}
if self.attn is not None:
attns["std"] = []
if self.copy_attn is not None or self._reuse_copy_attn:
attns["copy"] = []
if self._coverage:
attns["coverage"] = []
emb = self.embeddings(tgt)
assert emb.dim() == 3 # len x batch x embedding_dim
dec_state = self.state["hidden"]
coverage = self.state["coverage"].squeeze(0) \
if self.state["coverage"] is not None else None
# Input feed concatenates hidden state with
# input at every time step.
for emb_t in emb.split(1):
decoder_input = torch.cat([emb_t.squeeze(0), input_feed], 1)
rnn_output, dec_state = self.rnn(decoder_input, dec_state)
if self.attentional:
decoder_output, p_attn = self.attn(
rnn_output,
memory_bank.transpose(0, 1),
memory_lengths=memory_lengths)
attns["std"].append(p_attn)
else:
decoder_output = rnn_output
if self.context_gate is not None:
# TODO: context gate should be employed
# instead of second RNN transform.
decoder_output = self.context_gate(
decoder_input, rnn_output, decoder_output
)
decoder_output = self.dropout(decoder_output)
input_feed = decoder_output
dec_outs += [decoder_output]
# Update the coverage attention.
if self._coverage:
coverage = p_attn if coverage is None else p_attn + coverage
attns["coverage"] += [coverage]
if self.copy_attn is not None:
_, copy_attn = self.copy_attn(
decoder_output, memory_bank.transpose(0, 1))
attns["copy"] += [copy_attn]
elif self._reuse_copy_attn:
attns["copy"] = attns["std"]
return dec_state, dec_outs, attns
def _build_rnn(self, rnn_type, input_size,
hidden_size, num_layers, dropout):
assert rnn_type != "SRU", "SRU doesn't support input feed! " \
"Please set -input_feed 0!"
stacked_cell = StackedLSTM if rnn_type == "LSTM" else StackedGRU
return stacked_cell(num_layers, input_size, hidden_size, dropout)
@property
def _input_size(self):
"""Using input feed by concatenating input with attention vectors."""
return self.embeddings.embedding_size + self.hidden_size
def update_dropout(self, dropout):
self.dropout.p = dropout
self.rnn.dropout.p = dropout
self.embeddings.update_dropout(dropout)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/decoders/ensemble.py
================================================
"""Ensemble decoding.
Decodes using multiple models simultaneously,
combining their prediction distributions by averaging.
All models in the ensemble must share a target vocabulary.
"""
import torch
import torch.nn as nn
from onmt.encoders.encoder import EncoderBase
from onmt.decoders.decoder import DecoderBase
from onmt.models import NMTModel
import onmt.model_builder
class EnsembleDecoderOutput(object):
"""Wrapper around multiple decoder final hidden states."""
def __init__(self, model_dec_outs):
self.model_dec_outs = tuple(model_dec_outs)
def squeeze(self, dim=None):
"""Delegate squeeze to avoid modifying
:func:`onmt.translate.translator.Translator.translate_batch()`
"""
return EnsembleDecoderOutput([
x.squeeze(dim) for x in self.model_dec_outs])
def __getitem__(self, index):
return self.model_dec_outs[index]
class EnsembleEncoder(EncoderBase):
"""Dummy Encoder that delegates to individual real Encoders."""
def __init__(self, model_encoders):
super(EnsembleEncoder, self).__init__()
self.model_encoders = nn.ModuleList(model_encoders)
def forward(self, src, lengths=None):
enc_hidden, memory_bank, _ = zip(*[
model_encoder(src, lengths)
for model_encoder in self.model_encoders])
return enc_hidden, memory_bank, lengths
class EnsembleDecoder(DecoderBase):
"""Dummy Decoder that delegates to individual real Decoders."""
def __init__(self, model_decoders):
model_decoders = nn.ModuleList(model_decoders)
attentional = any([dec.attentional for dec in model_decoders])
super(EnsembleDecoder, self).__init__(attentional)
self.model_decoders = model_decoders
def forward(self, tgt, memory_bank, memory_lengths=None, step=None,
**kwargs):
"""See :func:`onmt.decoders.decoder.DecoderBase.forward()`."""
# Memory_lengths is a single tensor shared between all models.
# This assumption will not hold if Translator is modified
# to calculate memory_lengths as something other than the length
# of the input.
dec_outs, attns = zip(*[
model_decoder(
tgt, memory_bank[i],
memory_lengths=memory_lengths, step=step)
for i, model_decoder in enumerate(self.model_decoders)])
mean_attns = self.combine_attns(attns)
return EnsembleDecoderOutput(dec_outs), mean_attns
def combine_attns(self, attns):
result = {}
for key in attns[0].keys():
result[key] = torch.stack(
[attn[key] for attn in attns if attn[key] is not None]).mean(0)
return result
def init_state(self, src, memory_bank, enc_hidden):
""" See :obj:`RNNDecoderBase.init_state()` """
for i, model_decoder in enumerate(self.model_decoders):
model_decoder.init_state(src, memory_bank[i], enc_hidden[i])
def map_state(self, fn):
for model_decoder in self.model_decoders:
model_decoder.map_state(fn)
class EnsembleGenerator(nn.Module):
"""
Dummy Generator that delegates to individual real Generators,
and then averages the resulting target distributions.
"""
def __init__(self, model_generators, raw_probs=False):
super(EnsembleGenerator, self).__init__()
self.model_generators = nn.ModuleList(model_generators)
self._raw_probs = raw_probs
def forward(self, hidden, attn=None, src_map=None):
"""
Compute a distribution over the target dictionary
by averaging distributions from models in the ensemble.
All models in the ensemble must share a target vocabulary.
"""
distributions = torch.stack(
[mg(h) if attn is None else mg(h, attn, src_map)
for h, mg in zip(hidden, self.model_generators)]
)
if self._raw_probs:
return torch.log(torch.exp(distributions).mean(0))
else:
return distributions.mean(0)
class EnsembleModel(NMTModel):
"""Dummy NMTModel wrapping individual real NMTModels."""
def __init__(self, models, raw_probs=False):
encoder = EnsembleEncoder(model.encoder for model in models)
decoder = EnsembleDecoder(model.decoder for model in models)
super(EnsembleModel, self).__init__(encoder, decoder)
self.generator = EnsembleGenerator(
[model.generator for model in models], raw_probs)
self.models = nn.ModuleList(models)
def load_test_model(opt):
"""Read in multiple models for ensemble."""
shared_fields = None
shared_model_opt = None
models = []
for model_path in opt.models:
fields, model, model_opt = \
onmt.model_builder.load_test_model(opt, model_path=model_path)
if shared_fields is None:
shared_fields = fields
else:
for key, field in fields.items():
try:
f_iter = iter(field)
except TypeError:
f_iter = [(key, field)]
for sn, sf in f_iter:
if sf is not None and 'vocab' in sf.__dict__:
sh_field = shared_fields[key]
try:
sh_f_iter = iter(sh_field)
except TypeError:
sh_f_iter = [(key, sh_field)]
sh_f_dict = dict(sh_f_iter)
assert sf.vocab.stoi == sh_f_dict[sn].vocab.stoi, \
"Ensemble models must use the same " \
"preprocessed data"
models.append(model)
if shared_model_opt is None:
shared_model_opt = model_opt
ensemble_model = EnsembleModel(models, opt.avg_raw_probs)
return shared_fields, ensemble_model, shared_model_opt
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/decoders/transformer.py
================================================
"""
Implementation of "Attention is All You Need"
"""
import torch
import torch.nn as nn
from onmt.decoders.decoder import DecoderBase
from onmt.modules import MultiHeadedAttention, AverageAttention
from onmt.modules.position_ffn import PositionwiseFeedForward
from onmt.utils.misc import sequence_mask
class TransformerDecoderLayer(nn.Module):
"""
Args:
d_model (int): the dimension of keys/values/queries in
:class:`MultiHeadedAttention`, also the input size of
the first-layer of the :class:`PositionwiseFeedForward`.
heads (int): the number of heads for MultiHeadedAttention.
d_ff (int): the second-layer of the :class:`PositionwiseFeedForward`.
dropout (float): dropout probability.
self_attn_type (string): type of self-attention scaled-dot, average
"""
def __init__(self, d_model, heads, d_ff, dropout, attention_dropout,
self_attn_type="scaled-dot", max_relative_positions=0,
aan_useffn=False, full_context_alignment=False,
alignment_heads=None):
super(TransformerDecoderLayer, self).__init__()
if self_attn_type == "scaled-dot":
self.self_attn = MultiHeadedAttention(
heads, d_model, dropout=dropout,
max_relative_positions=max_relative_positions)
elif self_attn_type == "average":
self.self_attn = AverageAttention(d_model,
dropout=attention_dropout,
aan_useffn=aan_useffn)
self.context_attn = MultiHeadedAttention(
heads, d_model, dropout=attention_dropout)
self.feed_forward = PositionwiseFeedForward(d_model, d_ff, dropout)
self.layer_norm_1 = nn.LayerNorm(d_model, eps=1e-6)
self.layer_norm_2 = nn.LayerNorm(d_model, eps=1e-6)
self.drop = nn.Dropout(dropout)
self.full_context_alignment = full_context_alignment
self.alignment_heads = alignment_heads
def forward(self, *args, **kwargs):
""" Extend _forward for (possibly) multiple decoder pass:
1. Always a default (future masked) decoder forward pass,
2. Possibly a second future aware decoder pass for joint learn
full context alignement.
Args:
* All arguments of _forward.
with_align (bool): whether return alignment attention.
Returns:
(FloatTensor, FloatTensor, FloatTensor or None):
* output ``(batch_size, 1, model_dim)``
* top_attn ``(batch_size, 1, src_len)``
* attn_align ``(batch_size, 1, src_len)`` or None
"""
with_align = kwargs.pop('with_align', False)
output, attns = self._forward(*args, **kwargs)
top_attn = attns[:, 0, :, :].contiguous()
attn_align = None
if with_align:
if self.full_context_alignment:
# return _, (B, Q_len, K_len)
_, attns = self._forward(*args, **kwargs, future=True)
if self.alignment_heads is not None:
attns = attns[:, :self.alignment_heads, :, :].contiguous()
# layer average attention across heads, get ``(B, Q, K)``
# Case 1: no full_context, no align heads -> layer avg baseline
# Case 2: no full_context, 1 align heads -> guided align
# Case 3: full_context, 1 align heads -> full cte guided align
attn_align = attns.mean(dim=1)
return output, top_attn, attn_align
def _forward(self, inputs, memory_bank, src_pad_mask, tgt_pad_mask,
layer_cache=None, step=None, future=False):
""" A naive forward pass for transformer decoder.
# TODO: change 1 to T as T could be 1 or tgt_len
Args:
inputs (FloatTensor): ``(batch_size, 1, model_dim)``
memory_bank (FloatTensor): ``(batch_size, src_len, model_dim)``
src_pad_mask (LongTensor): ``(batch_size, 1, src_len)``
tgt_pad_mask (LongTensor): ``(batch_size, 1, 1)``
Returns:
(FloatTensor, FloatTensor):
* output ``(batch_size, 1, model_dim)``
* attns ``(batch_size, head, 1, src_len)``
"""
dec_mask = None
if step is None:
tgt_len = tgt_pad_mask.size(-1)
if not future: # apply future_mask, result mask in (B, T, T)
future_mask = torch.ones(
[tgt_len, tgt_len],
device=tgt_pad_mask.device,
dtype=torch.uint8)
future_mask = future_mask.triu_(1).view(1, tgt_len, tgt_len)
# BoolTensor was introduced in pytorch 1.2
try:
future_mask = future_mask.bool()
except AttributeError:
pass
dec_mask = torch.gt(tgt_pad_mask + future_mask, 0)
else: # only mask padding, result mask in (B, 1, T)
dec_mask = tgt_pad_mask
input_norm = self.layer_norm_1(inputs)
if isinstance(self.self_attn, MultiHeadedAttention):
query, _ = self.self_attn(input_norm, input_norm, input_norm,
mask=dec_mask,
layer_cache=layer_cache,
attn_type="self")
elif isinstance(self.self_attn, AverageAttention):
query, _ = self.self_attn(input_norm, mask=dec_mask,
layer_cache=layer_cache, step=step)
query = self.drop(query) + inputs
query_norm = self.layer_norm_2(query)
mid, attns = self.context_attn(memory_bank, memory_bank, query_norm,
mask=src_pad_mask,
layer_cache=layer_cache,
attn_type="context")
output = self.feed_forward(self.drop(mid) + query)
return output, attns
def update_dropout(self, dropout, attention_dropout):
self.self_attn.update_dropout(attention_dropout)
self.context_attn.update_dropout(attention_dropout)
self.feed_forward.update_dropout(dropout)
self.drop.p = dropout
class TransformerDecoder(DecoderBase):
"""The Transformer decoder from "Attention is All You Need".
:cite:`DBLP:journals/corr/VaswaniSPUJGKP17`
.. mermaid::
graph BT
A[input]
B[multi-head self-attn]
BB[multi-head src-attn]
C[feed forward]
O[output]
A --> B
B --> BB
BB --> C
C --> O
Args:
num_layers (int): number of encoder layers.
d_model (int): size of the model
heads (int): number of heads
d_ff (int): size of the inner FF layer
copy_attn (bool): if using a separate copy attention
self_attn_type (str): type of self-attention scaled-dot, average
dropout (float): dropout parameters
embeddings (onmt.modules.Embeddings):
embeddings to use, should have positional encodings
"""
def __init__(self, num_layers, d_model, heads, d_ff,
copy_attn, self_attn_type, dropout, attention_dropout,
embeddings, max_relative_positions, aan_useffn,
full_context_alignment, alignment_layer,
alignment_heads=None):
super(TransformerDecoder, self).__init__()
self.embeddings = embeddings
# Decoder State
self.state = {}
self.transformer_layers = nn.ModuleList(
[TransformerDecoderLayer(d_model, heads, d_ff, dropout,
attention_dropout, self_attn_type=self_attn_type,
max_relative_positions=max_relative_positions,
aan_useffn=aan_useffn,
full_context_alignment=full_context_alignment,
alignment_heads=alignment_heads)
for i in range(num_layers)])
# previously, there was a GlobalAttention module here for copy
# attention. But it was never actually used -- the "copy" attention
# just reuses the context attention.
self._copy = copy_attn
self.layer_norm = nn.LayerNorm(d_model, eps=1e-6)
self.alignment_layer = alignment_layer
@classmethod
def from_opt(cls, opt, embeddings):
"""Alternate constructor."""
return cls(
opt.dec_layers,
opt.dec_rnn_size,
opt.heads,
opt.transformer_ff,
opt.copy_attn,
opt.self_attn_type,
opt.dropout[0] if type(opt.dropout) is list else opt.dropout,
opt.attention_dropout[0] if type(opt.attention_dropout)
is list else opt.dropout,
embeddings,
opt.max_relative_positions,
opt.aan_useffn,
opt.full_context_alignment,
opt.alignment_layer,
alignment_heads=opt.alignment_heads)
def init_state(self, src, memory_bank, enc_hidden):
"""Initialize decoder state."""
self.state["src"] = src
self.state["cache"] = None
def map_state(self, fn):
def _recursive_map(struct, batch_dim=0):
for k, v in struct.items():
if v is not None:
if isinstance(v, dict):
_recursive_map(v)
else:
struct[k] = fn(v, batch_dim)
self.state["src"] = fn(self.state["src"], 1)
if self.state["cache"] is not None:
_recursive_map(self.state["cache"])
def detach_state(self):
self.state["src"] = self.state["src"].detach()
def forward(self, tgt, memory_bank, step=None, **kwargs):
"""Decode, possibly stepwise."""
if step == 0:
self._init_cache(memory_bank)
tgt_words = tgt[:, :, 0].transpose(0, 1)
emb = self.embeddings(tgt, step=step)
assert emb.dim() == 3 # len x batch x embedding_dim
output = emb.transpose(0, 1).contiguous()
src_memory_bank = memory_bank.transpose(0, 1).contiguous()
pad_idx = self.embeddings.word_padding_idx
src_lens = kwargs["memory_lengths"]
src_max_len = self.state["src"].shape[0]
src_pad_mask = ~sequence_mask(src_lens, src_max_len).unsqueeze(1)
tgt_pad_mask = tgt_words.data.eq(pad_idx).unsqueeze(1) # [B, 1, T_tgt]
with_align = kwargs.pop('with_align', False)
attn_aligns = []
for i, layer in enumerate(self.transformer_layers):
layer_cache = self.state["cache"]["layer_{}".format(i)] \
if step is not None else None
output, attn, attn_align = layer(
output,
src_memory_bank,
src_pad_mask,
tgt_pad_mask,
layer_cache=layer_cache,
step=step,
with_align=with_align)
if attn_align is not None:
attn_aligns.append(attn_align)
output = self.layer_norm(output)
dec_outs = output.transpose(0, 1).contiguous()
attn = attn.transpose(0, 1).contiguous()
attns = {"std": attn}
if self._copy:
attns["copy"] = attn
if with_align:
attns["align"] = attn_aligns[self.alignment_layer] # `(B, Q, K)`
# attns["align"] = torch.stack(attn_aligns, 0).mean(0) # All avg
# TODO change the way attns is returned dict => list or tuple (onnx)
return dec_outs, attns
def _init_cache(self, memory_bank):
self.state["cache"] = {}
batch_size = memory_bank.size(1)
depth = memory_bank.size(-1)
for i, layer in enumerate(self.transformer_layers):
layer_cache = {"memory_keys": None, "memory_values": None}
if isinstance(layer.self_attn, AverageAttention):
layer_cache["prev_g"] = torch.zeros((batch_size, 1, depth),
device=memory_bank.device)
else:
layer_cache["self_keys"] = None
layer_cache["self_values"] = None
self.state["cache"]["layer_{}".format(i)] = layer_cache
def update_dropout(self, dropout, attention_dropout):
self.embeddings.update_dropout(dropout)
for layer in self.transformer_layers:
layer.update_dropout(dropout, attention_dropout)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/encoders/__init__.py
================================================
"""Module defining encoders."""
from onmt.encoders.encoder import EncoderBase
from onmt.encoders.transformer import TransformerEncoder
from onmt.encoders.rnn_encoder import RNNEncoder
from onmt.encoders.cnn_encoder import CNNEncoder
from onmt.encoders.mean_encoder import MeanEncoder
from onmt.encoders.audio_encoder import AudioEncoder
from onmt.encoders.image_encoder import ImageEncoder
str2enc = {"rnn": RNNEncoder, "brnn": RNNEncoder, "cnn": CNNEncoder,
"transformer": TransformerEncoder, "img": ImageEncoder,
"audio": AudioEncoder, "mean": MeanEncoder}
__all__ = ["EncoderBase", "TransformerEncoder", "RNNEncoder", "CNNEncoder",
"MeanEncoder", "str2enc"]
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/encoders/audio_encoder.py
================================================
"""Audio encoder"""
import math
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence as pack
from torch.nn.utils.rnn import pad_packed_sequence as unpack
from onmt.utils.rnn_factory import rnn_factory
from onmt.encoders.encoder import EncoderBase
class AudioEncoder(EncoderBase):
"""A simple encoder CNN -> RNN for audio input.
Args:
rnn_type (str): Type of RNN (e.g. GRU, LSTM, etc).
enc_layers (int): Number of encoder layers.
dec_layers (int): Number of decoder layers.
brnn (bool): Bidirectional encoder.
enc_rnn_size (int): Size of hidden states of the rnn.
dec_rnn_size (int): Size of the decoder hidden states.
enc_pooling (str): A comma separated list either of length 1
or of length ``enc_layers`` specifying the pooling amount.
dropout (float): dropout probablity.
sample_rate (float): input spec
window_size (int): input spec
"""
def __init__(self, rnn_type, enc_layers, dec_layers, brnn,
enc_rnn_size, dec_rnn_size, enc_pooling, dropout,
sample_rate, window_size):
super(AudioEncoder, self).__init__()
self.enc_layers = enc_layers
self.rnn_type = rnn_type
self.dec_layers = dec_layers
num_directions = 2 if brnn else 1
self.num_directions = num_directions
assert enc_rnn_size % num_directions == 0
enc_rnn_size_real = enc_rnn_size // num_directions
assert dec_rnn_size % num_directions == 0
self.dec_rnn_size = dec_rnn_size
dec_rnn_size_real = dec_rnn_size // num_directions
self.dec_rnn_size_real = dec_rnn_size_real
self.dec_rnn_size = dec_rnn_size
input_size = int(math.floor((sample_rate * window_size) / 2) + 1)
enc_pooling = enc_pooling.split(',')
assert len(enc_pooling) == enc_layers or len(enc_pooling) == 1
if len(enc_pooling) == 1:
enc_pooling = enc_pooling * enc_layers
enc_pooling = [int(p) for p in enc_pooling]
self.enc_pooling = enc_pooling
if type(dropout) is not list:
dropout = [dropout]
if max(dropout) > 0:
self.dropout = nn.Dropout(dropout[0])
else:
self.dropout = None
self.W = nn.Linear(enc_rnn_size, dec_rnn_size, bias=False)
self.batchnorm_0 = nn.BatchNorm1d(enc_rnn_size, affine=True)
self.rnn_0, self.no_pack_padded_seq = \
rnn_factory(rnn_type,
input_size=input_size,
hidden_size=enc_rnn_size_real,
num_layers=1,
dropout=dropout[0],
bidirectional=brnn)
self.pool_0 = nn.MaxPool1d(enc_pooling[0])
for l in range(enc_layers - 1):
batchnorm = nn.BatchNorm1d(enc_rnn_size, affine=True)
rnn, _ = \
rnn_factory(rnn_type,
input_size=enc_rnn_size,
hidden_size=enc_rnn_size_real,
num_layers=1,
dropout=dropout[0],
bidirectional=brnn)
setattr(self, 'rnn_%d' % (l + 1), rnn)
setattr(self, 'pool_%d' % (l + 1),
nn.MaxPool1d(enc_pooling[l + 1]))
setattr(self, 'batchnorm_%d' % (l + 1), batchnorm)
@classmethod
def from_opt(cls, opt, embeddings=None):
"""Alternate constructor."""
if embeddings is not None:
raise ValueError("Cannot use embeddings with AudioEncoder.")
return cls(
opt.rnn_type,
opt.enc_layers,
opt.dec_layers,
opt.brnn,
opt.enc_rnn_size,
opt.dec_rnn_size,
opt.audio_enc_pooling,
opt.dropout,
opt.sample_rate,
opt.window_size)
def forward(self, src, lengths=None):
"""See :func:`onmt.encoders.encoder.EncoderBase.forward()`"""
batch_size, _, nfft, t = src.size()
src = src.transpose(0, 1).transpose(0, 3).contiguous() \
.view(t, batch_size, nfft)
orig_lengths = lengths
lengths = lengths.view(-1).tolist()
for l in range(self.enc_layers):
rnn = getattr(self, 'rnn_%d' % l)
pool = getattr(self, 'pool_%d' % l)
batchnorm = getattr(self, 'batchnorm_%d' % l)
stride = self.enc_pooling[l]
packed_emb = pack(src, lengths)
memory_bank, tmp = rnn(packed_emb)
memory_bank = unpack(memory_bank)[0]
t, _, _ = memory_bank.size()
memory_bank = memory_bank.transpose(0, 2)
memory_bank = pool(memory_bank)
lengths = [int(math.floor((length - stride) / stride + 1))
for length in lengths]
memory_bank = memory_bank.transpose(0, 2)
src = memory_bank
t, _, num_feat = src.size()
src = batchnorm(src.contiguous().view(-1, num_feat))
src = src.view(t, -1, num_feat)
if self.dropout and l + 1 != self.enc_layers:
src = self.dropout(src)
memory_bank = memory_bank.contiguous().view(-1, memory_bank.size(2))
memory_bank = self.W(memory_bank).view(-1, batch_size,
self.dec_rnn_size)
state = memory_bank.new_full((self.dec_layers * self.num_directions,
batch_size, self.dec_rnn_size_real), 0)
if self.rnn_type == 'LSTM':
# The encoder hidden is (layers*directions) x batch x dim.
encoder_final = (state, state)
else:
encoder_final = state
return encoder_final, memory_bank, orig_lengths.new_tensor(lengths)
def update_dropout(self, dropout):
self.dropout.p = dropout
for i in range(self.enc_layers - 1):
getattr(self, 'rnn_%d' % i).dropout = dropout
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/encoders/cnn_encoder.py
================================================
"""
Implementation of "Convolutional Sequence to Sequence Learning"
"""
import torch.nn as nn
from onmt.encoders.encoder import EncoderBase
from onmt.utils.cnn_factory import shape_transform, StackedCNN
SCALE_WEIGHT = 0.5 ** 0.5
class CNNEncoder(EncoderBase):
"""Encoder based on "Convolutional Sequence to Sequence Learning"
:cite:`DBLP:journals/corr/GehringAGYD17`.
"""
def __init__(self, num_layers, hidden_size,
cnn_kernel_width, dropout, embeddings):
super(CNNEncoder, self).__init__()
self.embeddings = embeddings
input_size = embeddings.embedding_size
self.linear = nn.Linear(input_size, hidden_size)
self.cnn = StackedCNN(num_layers, hidden_size,
cnn_kernel_width, dropout)
@classmethod
def from_opt(cls, opt, embeddings):
"""Alternate constructor."""
return cls(
opt.enc_layers,
opt.enc_rnn_size,
opt.cnn_kernel_width,
opt.dropout[0] if type(opt.dropout) is list else opt.dropout,
embeddings)
def forward(self, input, lengths=None, hidden=None):
"""See :class:`onmt.modules.EncoderBase.forward()`"""
self._check_args(input, lengths, hidden)
emb = self.embeddings(input)
# s_len, batch, emb_dim = emb.size()
emb = emb.transpose(0, 1).contiguous()
emb_reshape = emb.view(emb.size(0) * emb.size(1), -1)
emb_remap = self.linear(emb_reshape)
emb_remap = emb_remap.view(emb.size(0), emb.size(1), -1)
emb_remap = shape_transform(emb_remap)
out = self.cnn(emb_remap)
return emb_remap.squeeze(3).transpose(0, 1).contiguous(), \
out.squeeze(3).transpose(0, 1).contiguous(), lengths
def update_dropout(self, dropout):
self.cnn.dropout.p = dropout
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/encoders/encoder.py
================================================
"""Base class for encoders and generic multi encoders."""
import torch.nn as nn
from onmt.utils.misc import aeq
class EncoderBase(nn.Module):
"""
Base encoder class. Specifies the interface used by different encoder types
and required by :class:`onmt.Models.NMTModel`.
.. mermaid::
graph BT
A[Input]
subgraph RNN
C[Pos 1]
D[Pos 2]
E[Pos N]
end
F[Memory_Bank]
G[Final]
A-->C
A-->D
A-->E
C-->F
D-->F
E-->F
E-->G
"""
@classmethod
def from_opt(cls, opt, embeddings=None):
raise NotImplementedError
def _check_args(self, src, lengths=None, hidden=None):
n_batch = src.size(1)
if lengths is not None:
n_batch_, = lengths.size()
aeq(n_batch, n_batch_)
def forward(self, src, lengths=None):
"""
Args:
src (LongTensor):
padded sequences of sparse indices ``(src_len, batch, nfeat)``
lengths (LongTensor): length of each sequence ``(batch,)``
Returns:
(FloatTensor, FloatTensor):
* final encoder state, used to initialize decoder
* memory bank for attention, ``(src_len, batch, hidden)``
"""
raise NotImplementedError
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/encoders/image_encoder.py
================================================
"""Image Encoder."""
import torch.nn as nn
import torch.nn.functional as F
import torch
from onmt.encoders.encoder import EncoderBase
class ImageEncoder(EncoderBase):
"""A simple encoder CNN -> RNN for image src.
Args:
num_layers (int): number of encoder layers.
bidirectional (bool): bidirectional encoder.
rnn_size (int): size of hidden states of the rnn.
dropout (float): dropout probablity.
"""
def __init__(self, num_layers, bidirectional, rnn_size, dropout,
image_chanel_size=3):
super(ImageEncoder, self).__init__()
self.num_layers = num_layers
self.num_directions = 2 if bidirectional else 1
self.hidden_size = rnn_size
self.layer1 = nn.Conv2d(image_chanel_size, 64, kernel_size=(3, 3),
padding=(1, 1), stride=(1, 1))
self.layer2 = nn.Conv2d(64, 128, kernel_size=(3, 3),
padding=(1, 1), stride=(1, 1))
self.layer3 = nn.Conv2d(128, 256, kernel_size=(3, 3),
padding=(1, 1), stride=(1, 1))
self.layer4 = nn.Conv2d(256, 256, kernel_size=(3, 3),
padding=(1, 1), stride=(1, 1))
self.layer5 = nn.Conv2d(256, 512, kernel_size=(3, 3),
padding=(1, 1), stride=(1, 1))
self.layer6 = nn.Conv2d(512, 512, kernel_size=(3, 3),
padding=(1, 1), stride=(1, 1))
self.batch_norm1 = nn.BatchNorm2d(256)
self.batch_norm2 = nn.BatchNorm2d(512)
self.batch_norm3 = nn.BatchNorm2d(512)
src_size = 512
dropout = dropout[0] if type(dropout) is list else dropout
self.rnn = nn.LSTM(src_size, int(rnn_size / self.num_directions),
num_layers=num_layers,
dropout=dropout,
bidirectional=bidirectional)
self.pos_lut = nn.Embedding(1000, src_size)
@classmethod
def from_opt(cls, opt, embeddings=None):
"""Alternate constructor."""
if embeddings is not None:
raise ValueError("Cannot use embeddings with ImageEncoder.")
# why is the model_opt.__dict__ check necessary?
if "image_channel_size" not in opt.__dict__:
image_channel_size = 3
else:
image_channel_size = opt.image_channel_size
return cls(
opt.enc_layers,
opt.brnn,
opt.enc_rnn_size,
opt.dropout[0] if type(opt.dropout) is list else opt.dropout,
image_channel_size
)
def load_pretrained_vectors(self, opt):
"""Pass in needed options only when modify function definition."""
pass
def forward(self, src, lengths=None):
"""See :func:`onmt.encoders.encoder.EncoderBase.forward()`"""
batch_size = src.size(0)
# (batch_size, 64, imgH, imgW)
# layer 1
src = F.relu(self.layer1(src[:, :, :, :] - 0.5), True)
# (batch_size, 64, imgH/2, imgW/2)
src = F.max_pool2d(src, kernel_size=(2, 2), stride=(2, 2))
# (batch_size, 128, imgH/2, imgW/2)
# layer 2
src = F.relu(self.layer2(src), True)
# (batch_size, 128, imgH/2/2, imgW/2/2)
src = F.max_pool2d(src, kernel_size=(2, 2), stride=(2, 2))
# (batch_size, 256, imgH/2/2, imgW/2/2)
# layer 3
# batch norm 1
src = F.relu(self.batch_norm1(self.layer3(src)), True)
# (batch_size, 256, imgH/2/2, imgW/2/2)
# layer4
src = F.relu(self.layer4(src), True)
# (batch_size, 256, imgH/2/2/2, imgW/2/2)
src = F.max_pool2d(src, kernel_size=(1, 2), stride=(1, 2))
# (batch_size, 512, imgH/2/2/2, imgW/2/2)
# layer 5
# batch norm 2
src = F.relu(self.batch_norm2(self.layer5(src)), True)
# (batch_size, 512, imgH/2/2/2, imgW/2/2/2)
src = F.max_pool2d(src, kernel_size=(2, 1), stride=(2, 1))
# (batch_size, 512, imgH/2/2/2, imgW/2/2/2)
src = F.relu(self.batch_norm3(self.layer6(src)), True)
# # (batch_size, 512, H, W)
all_outputs = []
for row in range(src.size(2)):
inp = src[:, :, row, :].transpose(0, 2) \
.transpose(1, 2)
row_vec = torch.Tensor(batch_size).type_as(inp.data) \
.long().fill_(row)
pos_emb = self.pos_lut(row_vec)
with_pos = torch.cat(
(pos_emb.view(1, pos_emb.size(0), pos_emb.size(1)), inp), 0)
outputs, hidden_t = self.rnn(with_pos)
all_outputs.append(outputs)
out = torch.cat(all_outputs, 0)
return hidden_t, out, lengths
def update_dropout(self, dropout):
self.rnn.dropout = dropout
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/encoders/mean_encoder.py
================================================
"""Define a minimal encoder."""
from onmt.encoders.encoder import EncoderBase
from onmt.utils.misc import sequence_mask
import torch
class MeanEncoder(EncoderBase):
"""A trivial non-recurrent encoder. Simply applies mean pooling.
Args:
num_layers (int): number of replicated layers
embeddings (onmt.modules.Embeddings): embedding module to use
"""
def __init__(self, num_layers, embeddings):
super(MeanEncoder, self).__init__()
self.num_layers = num_layers
self.embeddings = embeddings
@classmethod
def from_opt(cls, opt, embeddings):
"""Alternate constructor."""
return cls(
opt.enc_layers,
embeddings)
def forward(self, src, lengths=None):
"""See :func:`EncoderBase.forward()`"""
self._check_args(src, lengths)
emb = self.embeddings(src)
_, batch, emb_dim = emb.size()
if lengths is not None:
# we avoid padding while mean pooling
mask = sequence_mask(lengths).float()
mask = mask / lengths.unsqueeze(1).float()
mean = torch.bmm(mask.unsqueeze(1), emb.transpose(0, 1)).squeeze(1)
else:
mean = emb.mean(0)
mean = mean.expand(self.num_layers, batch, emb_dim)
memory_bank = emb
encoder_final = (mean, mean)
return encoder_final, memory_bank, lengths
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/encoders/rnn_encoder.py
================================================
"""Define RNN-based encoders."""
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.utils.rnn import pack_padded_sequence as pack
from torch.nn.utils.rnn import pad_packed_sequence as unpack
from onmt.encoders.encoder import EncoderBase
from onmt.utils.rnn_factory import rnn_factory
class RNNEncoder(EncoderBase):
""" A generic recurrent neural network encoder.
Args:
rnn_type (str):
style of recurrent unit to use, one of [RNN, LSTM, GRU, SRU]
bidirectional (bool) : use a bidirectional RNN
num_layers (int) : number of stacked layers
hidden_size (int) : hidden size of each layer
dropout (float) : dropout value for :class:`torch.nn.Dropout`
embeddings (onmt.modules.Embeddings): embedding module to use
"""
def __init__(self, rnn_type, bidirectional, num_layers,
hidden_size, dropout=0.0, embeddings=None,
use_bridge=False):
super(RNNEncoder, self).__init__()
assert embeddings is not None
num_directions = 2 if bidirectional else 1
assert hidden_size % num_directions == 0
hidden_size = hidden_size // num_directions
self.embeddings = embeddings
self.rnn, self.no_pack_padded_seq = \
rnn_factory(rnn_type,
input_size=embeddings.embedding_size,
hidden_size=hidden_size,
num_layers=num_layers,
dropout=dropout,
bidirectional=bidirectional)
# Initialize the bridge layer
self.use_bridge = use_bridge
if self.use_bridge:
self._initialize_bridge(rnn_type,
hidden_size,
num_layers)
@classmethod
def from_opt(cls, opt, embeddings):
"""Alternate constructor."""
return cls(
opt.rnn_type,
opt.brnn,
opt.enc_layers,
opt.enc_rnn_size,
opt.dropout[0] if type(opt.dropout) is list else opt.dropout,
embeddings,
opt.bridge)
def forward(self, src, lengths=None):
"""See :func:`EncoderBase.forward()`"""
self._check_args(src, lengths)
emb = self.embeddings(src)
# s_len, batch, emb_dim = emb.size()
packed_emb = emb
if lengths is not None and not self.no_pack_padded_seq:
# Lengths data is wrapped inside a Tensor.
lengths_list = lengths.view(-1).tolist()
packed_emb = pack(emb, lengths_list)
memory_bank, encoder_final = self.rnn(packed_emb)
if lengths is not None and not self.no_pack_padded_seq:
memory_bank = unpack(memory_bank)[0]
if self.use_bridge:
encoder_final = self._bridge(encoder_final)
return encoder_final, memory_bank, lengths
def _initialize_bridge(self, rnn_type,
hidden_size,
num_layers):
# LSTM has hidden and cell state, other only one
number_of_states = 2 if rnn_type == "LSTM" else 1
# Total number of states
self.total_hidden_dim = hidden_size * num_layers
# Build a linear layer for each
self.bridge = nn.ModuleList([nn.Linear(self.total_hidden_dim,
self.total_hidden_dim,
bias=True)
for _ in range(number_of_states)])
def _bridge(self, hidden):
"""Forward hidden state through bridge."""
def bottle_hidden(linear, states):
"""
Transform from 3D to 2D, apply linear and return initial size
"""
size = states.size()
result = linear(states.view(-1, self.total_hidden_dim))
return F.relu(result).view(size)
if isinstance(hidden, tuple): # LSTM
outs = tuple([bottle_hidden(layer, hidden[ix])
for ix, layer in enumerate(self.bridge)])
else:
outs = bottle_hidden(self.bridge[0], hidden)
return outs
def update_dropout(self, dropout):
self.rnn.dropout = dropout
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/encoders/transformer.py
================================================
"""
Implementation of "Attention is All You Need"
"""
import torch.nn as nn
from onmt.encoders.encoder import EncoderBase
from onmt.modules import MultiHeadedAttention
from onmt.modules.position_ffn import PositionwiseFeedForward
from onmt.utils.misc import sequence_mask
class TransformerEncoderLayer(nn.Module):
"""
A single layer of the transformer encoder.
Args:
d_model (int): the dimension of keys/values/queries in
MultiHeadedAttention, also the input size of
the first-layer of the PositionwiseFeedForward.
heads (int): the number of head for MultiHeadedAttention.
d_ff (int): the second-layer of the PositionwiseFeedForward.
dropout (float): dropout probability(0-1.0).
"""
def __init__(self, d_model, heads, d_ff, dropout, attention_dropout,
max_relative_positions=0):
super(TransformerEncoderLayer, self).__init__()
self.self_attn = MultiHeadedAttention(
heads, d_model, dropout=attention_dropout,
max_relative_positions=max_relative_positions)
self.feed_forward = PositionwiseFeedForward(d_model, d_ff, dropout)
self.layer_norm = nn.LayerNorm(d_model, eps=1e-6)
self.dropout = nn.Dropout(dropout)
def forward(self, inputs, mask):
"""
Args:
inputs (FloatTensor): ``(batch_size, src_len, model_dim)``
mask (LongTensor): ``(batch_size, 1, src_len)``
Returns:
(FloatTensor):
* outputs ``(batch_size, src_len, model_dim)``
"""
input_norm = self.layer_norm(inputs)
context, _ = self.self_attn(input_norm, input_norm, input_norm,
mask=mask, attn_type="self")
out = self.dropout(context) + inputs
return self.feed_forward(out)
def update_dropout(self, dropout, attention_dropout):
self.self_attn.update_dropout(attention_dropout)
self.feed_forward.update_dropout(dropout)
self.dropout.p = dropout
class TransformerEncoder(EncoderBase):
"""The Transformer encoder from "Attention is All You Need"
:cite:`DBLP:journals/corr/VaswaniSPUJGKP17`
.. mermaid::
graph BT
A[input]
B[multi-head self-attn]
C[feed forward]
O[output]
A --> B
B --> C
C --> O
Args:
num_layers (int): number of encoder layers
d_model (int): size of the model
heads (int): number of heads
d_ff (int): size of the inner FF layer
dropout (float): dropout parameters
embeddings (onmt.modules.Embeddings):
embeddings to use, should have positional encodings
Returns:
(torch.FloatTensor, torch.FloatTensor):
* embeddings ``(src_len, batch_size, model_dim)``
* memory_bank ``(src_len, batch_size, model_dim)``
"""
def __init__(self, num_layers, d_model, heads, d_ff, dropout,
attention_dropout, embeddings, max_relative_positions):
super(TransformerEncoder, self).__init__()
self.embeddings = embeddings
self.transformer = nn.ModuleList(
[TransformerEncoderLayer(
d_model, heads, d_ff, dropout, attention_dropout,
max_relative_positions=max_relative_positions)
for i in range(num_layers)])
self.layer_norm = nn.LayerNorm(d_model, eps=1e-6)
@classmethod
def from_opt(cls, opt, embeddings):
"""Alternate constructor."""
return cls(
opt.enc_layers,
opt.enc_rnn_size,
opt.heads,
opt.transformer_ff,
opt.dropout[0] if type(opt.dropout) is list else opt.dropout,
opt.attention_dropout[0] if type(opt.attention_dropout)
is list else opt.attention_dropout,
embeddings,
opt.max_relative_positions)
def forward(self, src, lengths=None):
"""See :func:`EncoderBase.forward()`"""
self._check_args(src, lengths)
emb = self.embeddings(src)
out = emb.transpose(0, 1).contiguous()
mask = ~sequence_mask(lengths).unsqueeze(1)
# Run the forward pass of every layer of the tranformer.
for layer in self.transformer:
out = layer(out, mask)
out = self.layer_norm(out)
return emb, out.transpose(0, 1).contiguous(), lengths
def update_dropout(self, dropout, attention_dropout):
self.embeddings.update_dropout(dropout)
for layer in self.transformer:
layer.update_dropout(dropout, attention_dropout)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/inputters/__init__.py
================================================
"""Module defining inputters.
Inputters implement the logic of transforming raw data to vectorized inputs,
e.g., from a line of text to a sequence of embeddings.
"""
from onmt.inputters.inputter import \
load_old_vocab, get_fields, OrderedIterator, \
build_vocab, old_style_vocab, filter_example
from onmt.inputters.dataset_base import Dataset
from onmt.inputters.text_dataset import text_sort_key, TextDataReader
from onmt.inputters.image_dataset import img_sort_key, ImageDataReader
from onmt.inputters.audio_dataset import audio_sort_key, AudioDataReader
from onmt.inputters.vec_dataset import vec_sort_key, VecDataReader
from onmt.inputters.datareader_base import DataReaderBase
str2reader = {
"text": TextDataReader, "img": ImageDataReader, "audio": AudioDataReader,
"vec": VecDataReader}
str2sortkey = {
'text': text_sort_key, 'img': img_sort_key, 'audio': audio_sort_key,
'vec': vec_sort_key}
__all__ = ['Dataset', 'load_old_vocab', 'get_fields', 'DataReaderBase',
'filter_example', 'old_style_vocab',
'build_vocab', 'OrderedIterator',
'text_sort_key', 'img_sort_key', 'audio_sort_key', 'vec_sort_key',
'TextDataReader', 'ImageDataReader', 'AudioDataReader',
'VecDataReader']
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/inputters/audio_dataset.py
================================================
# -*- coding: utf-8 -*-
import os
from tqdm import tqdm
import torch
from torchtext.data import Field
from onmt.inputters.datareader_base import DataReaderBase
# imports of datatype-specific dependencies
try:
import torchaudio
import librosa
import numpy as np
except ImportError:
torchaudio, librosa, np = None, None, None
class AudioDataReader(DataReaderBase):
"""Read audio data from disk.
Args:
sample_rate (int): sample_rate.
window_size (float) : window size for spectrogram in seconds.
window_stride (float): window stride for spectrogram in seconds.
window (str): window type for spectrogram generation. See
:func:`librosa.stft()` ``window`` for more details.
normalize_audio (bool): subtract spectrogram by mean and divide
by std or not.
truncate (int or NoneType): maximum audio length
(0 or None for unlimited).
Raises:
onmt.inputters.datareader_base.MissingDependencyException: If
importing any of ``torchaudio``, ``librosa``, or ``numpy`` fail.
"""
def __init__(self, sample_rate=0, window_size=0, window_stride=0,
window=None, normalize_audio=True, truncate=None):
self._check_deps()
self.sample_rate = sample_rate
self.window_size = window_size
self.window_stride = window_stride
self.window = window
self.normalize_audio = normalize_audio
self.truncate = truncate
@classmethod
def from_opt(cls, opt):
return cls(sample_rate=opt.sample_rate, window_size=opt.window_size,
window_stride=opt.window_stride, window=opt.window)
@classmethod
def _check_deps(cls):
if any([torchaudio is None, librosa is None, np is None]):
cls._raise_missing_dep(
"torchaudio", "librosa", "numpy")
def extract_features(self, audio_path):
# torchaudio loading options recently changed. It's probably
# straightforward to rewrite the audio handling to make use of
# up-to-date torchaudio, but in the meantime there is a legacy
# method which uses the old defaults
sound, sample_rate_ = torchaudio.legacy.load(audio_path)
if self.truncate and self.truncate > 0:
if sound.size(0) > self.truncate:
sound = sound[:self.truncate]
assert sample_rate_ == self.sample_rate, \
'Sample rate of %s != -sample_rate (%d vs %d)' \
% (audio_path, sample_rate_, self.sample_rate)
sound = sound.numpy()
if len(sound.shape) > 1:
if sound.shape[1] == 1:
sound = sound.squeeze()
else:
sound = sound.mean(axis=1) # average multiple channels
n_fft = int(self.sample_rate * self.window_size)
win_length = n_fft
hop_length = int(self.sample_rate * self.window_stride)
# STFT
d = librosa.stft(sound, n_fft=n_fft, hop_length=hop_length,
win_length=win_length, window=self.window)
spect, _ = librosa.magphase(d)
spect = np.log1p(spect)
spect = torch.FloatTensor(spect)
if self.normalize_audio:
mean = spect.mean()
std = spect.std()
spect.add_(-mean)
spect.div_(std)
return spect
def read(self, data, side, src_dir=None):
"""Read data into dicts.
Args:
data (str or Iterable[str]): Sequence of audio paths or
path to file containing audio paths.
In either case, the filenames may be relative to ``src_dir``
(default behavior) or absolute.
side (str): Prefix used in return dict. Usually
``"src"`` or ``"tgt"``.
src_dir (str): Location of source audio files. See ``data``.
Yields:
A dictionary containing audio data for each line.
"""
assert src_dir is not None and os.path.exists(src_dir),\
"src_dir must be a valid directory if data_type is audio"
if isinstance(data, str):
data = DataReaderBase._read_file(data)
for i, line in enumerate(tqdm(data)):
line = line.decode("utf-8").strip()
audio_path = os.path.join(src_dir, line)
if not os.path.exists(audio_path):
audio_path = line
assert os.path.exists(audio_path), \
'audio path %s not found' % line
spect = self.extract_features(audio_path)
yield {side: spect, side + '_path': line, 'indices': i}
def audio_sort_key(ex):
"""Sort using duration time of the sound spectrogram."""
return ex.src.size(1)
class AudioSeqField(Field):
"""Defines an audio datatype and instructions for converting to Tensor.
See :class:`Fields` for attribute descriptions.
"""
def __init__(self, preprocessing=None, postprocessing=None,
include_lengths=False, batch_first=False, pad_index=0,
is_target=False):
super(AudioSeqField, self).__init__(
sequential=True, use_vocab=False, init_token=None,
eos_token=None, fix_length=False, dtype=torch.float,
preprocessing=preprocessing, postprocessing=postprocessing,
lower=False, tokenize=None, include_lengths=include_lengths,
batch_first=batch_first, pad_token=pad_index, unk_token=None,
pad_first=False, truncate_first=False, stop_words=None,
is_target=is_target
)
def pad(self, minibatch):
"""Pad a batch of examples to the length of the longest example.
Args:
minibatch (List[torch.FloatTensor]): A list of audio data,
each having shape 1 x n_feats x len where len is variable.
Returns:
torch.FloatTensor or Tuple[torch.FloatTensor, List[int]]: The
padded tensor of shape ``(batch_size, 1, n_feats, max_len)``.
and a list of the lengths if `self.include_lengths` is `True`
else just returns the padded tensor.
"""
assert not self.pad_first and not self.truncate_first \
and not self.fix_length and self.sequential
minibatch = list(minibatch)
lengths = [x.size(1) for x in minibatch]
max_len = max(lengths)
nfft = minibatch[0].size(0)
sounds = torch.full((len(minibatch), 1, nfft, max_len), self.pad_token)
for i, (spect, len_) in enumerate(zip(minibatch, lengths)):
sounds[i, :, :, 0:len_] = spect
if self.include_lengths:
return (sounds, lengths)
return sounds
def numericalize(self, arr, device=None):
"""Turn a batch of examples that use this field into a Variable.
If the field has ``include_lengths=True``, a tensor of lengths will be
included in the return value.
Args:
arr (torch.FloatTensor or Tuple(torch.FloatTensor, List[int])):
List of tokenized and padded examples, or tuple of List of
tokenized and padded examples and List of lengths of each
example if self.include_lengths is True. Examples have shape
``(batch_size, 1, n_feats, max_len)`` if `self.batch_first`
else ``(max_len, batch_size, 1, n_feats)``.
device (str or torch.device): See `Field.numericalize`.
"""
assert self.use_vocab is False
if self.include_lengths and not isinstance(arr, tuple):
raise ValueError("Field has include_lengths set to True, but "
"input data is not a tuple of "
"(data batch, batch lengths).")
if isinstance(arr, tuple):
arr, lengths = arr
lengths = torch.tensor(lengths, dtype=torch.int, device=device)
if self.postprocessing is not None:
arr = self.postprocessing(arr, None)
if self.sequential and not self.batch_first:
arr = arr.permute(3, 0, 1, 2)
if self.sequential:
arr = arr.contiguous()
arr = arr.to(device)
if self.include_lengths:
return arr, lengths
return arr
def audio_fields(**kwargs):
audio = AudioSeqField(pad_index=0, batch_first=True, include_lengths=True)
return audio
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/inputters/datareader_base.py
================================================
# coding: utf-8
# several data readers need optional dependencies. There's no
# appropriate builtin exception
class MissingDependencyException(Exception):
pass
class DataReaderBase(object):
"""Read data from file system and yield as dicts.
Raises:
onmt.inputters.datareader_base.MissingDependencyException: A number
of DataReaders need specific additional packages.
If any are missing, this will be raised.
"""
@classmethod
def from_opt(cls, opt):
"""Alternative constructor.
Args:
opt (argparse.Namespace): The parsed arguments.
"""
return cls()
@classmethod
def _read_file(cls, path):
"""Line-by-line read a file as bytes."""
with open(path, "rb") as f:
for line in f:
yield line
@staticmethod
def _raise_missing_dep(*missing_deps):
"""Raise missing dep exception with standard error message."""
raise MissingDependencyException(
"Could not create reader. Be sure to install "
"the following dependencies: " + ", ".join(missing_deps))
def read(self, data, side, src_dir):
"""Read data from file system and yield as dicts."""
raise NotImplementedError()
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/inputters/dataset_base.py
================================================
# coding: utf-8
from itertools import chain, starmap
from collections import Counter
import torch
from torchtext.data import Dataset as TorchtextDataset
from torchtext.data import Example
from torchtext.vocab import Vocab
def _join_dicts(*args):
"""
Args:
dictionaries with disjoint keys.
Returns:
a single dictionary that has the union of these keys.
"""
return dict(chain(*[d.items() for d in args]))
def _dynamic_dict(example, src_field, tgt_field):
"""Create copy-vocab and numericalize with it.
In-place adds ``"src_map"`` to ``example``. That is the copy-vocab
numericalization of the tokenized ``example["src"]``. If ``example``
has a ``"tgt"`` key, adds ``"alignment"`` to example. That is the
copy-vocab numericalization of the tokenized ``example["tgt"]``. The
alignment has an initial and final UNK token to match the BOS and EOS
tokens.
Args:
example (dict): An example dictionary with a ``"src"`` key and
maybe a ``"tgt"`` key. (This argument changes in place!)
src_field (torchtext.data.Field): Field object.
tgt_field (torchtext.data.Field): Field object.
Returns:
torchtext.data.Vocab and ``example``, changed as described.
"""
src = src_field.tokenize(example["src"])
# make a small vocab containing just the tokens in the source sequence
unk = src_field.unk_token
pad = src_field.pad_token
src_ex_vocab = Vocab(Counter(src), specials=[unk, pad])
unk_idx = src_ex_vocab.stoi[unk]
# Map source tokens to indices in the dynamic dict.
src_map = torch.LongTensor([src_ex_vocab.stoi[w] for w in src])
example["src_map"] = src_map
example["src_ex_vocab"] = src_ex_vocab
if "tgt" in example:
tgt = tgt_field.tokenize(example["tgt"])
mask = torch.LongTensor(
[unk_idx] + [src_ex_vocab.stoi[w] for w in tgt] + [unk_idx])
example["alignment"] = mask
return src_ex_vocab, example
class Dataset(TorchtextDataset):
"""Contain data and process it.
A dataset is an object that accepts sequences of raw data (sentence pairs
in the case of machine translation) and fields which describe how this
raw data should be processed to produce tensors. When a dataset is
instantiated, it applies the fields' preprocessing pipeline (but not
the bit that numericalizes it or turns it into batch tensors) to the raw
data, producing a list of :class:`torchtext.data.Example` objects.
torchtext's iterators then know how to use these examples to make batches.
Args:
fields (dict[str, Field]): a dict with the structure
returned by :func:`onmt.inputters.get_fields()`. Usually
that means the dataset side, ``"src"`` or ``"tgt"``. Keys match
the keys of items yielded by the ``readers``, while values
are lists of (name, Field) pairs. An attribute with this
name will be created for each :class:`torchtext.data.Example`
object and its value will be the result of applying the Field
to the data that matches the key. The advantage of having
sequences of fields for each piece of raw input is that it allows
the dataset to store multiple "views" of each input, which allows
for easy implementation of token-level features, mixed word-
and character-level models, and so on. (See also
:class:`onmt.inputters.TextMultiField`.)
readers (Iterable[onmt.inputters.DataReaderBase]): Reader objects
for disk-to-dict. The yielded dicts are then processed
according to ``fields``.
data (Iterable[Tuple[str, Any]]): (name, ``data_arg``) pairs
where ``data_arg`` is passed to the ``read()`` method of the
reader in ``readers`` at that position. (See the reader object for
details on the ``Any`` type.)
dirs (Iterable[str or NoneType]): A list of directories where
data is contained. See the reader object for more details.
sort_key (Callable[[torchtext.data.Example], Any]): A function
for determining the value on which data is sorted (i.e. length).
filter_pred (Callable[[torchtext.data.Example], bool]): A function
that accepts Example objects and returns a boolean value
indicating whether to include that example in the dataset.
Attributes:
src_vocabs (List[torchtext.data.Vocab]): Used with dynamic dict/copy
attention. There is a very short vocab for each src example.
It contains just the source words, e.g. so that the generator can
predict to copy them.
"""
def __init__(self, fields, readers, data, dirs, sort_key,
filter_pred=None):
self.sort_key = sort_key
can_copy = 'src_map' in fields and 'alignment' in fields
read_iters = [r.read(dat[1], dat[0], dir_) for r, dat, dir_
in zip(readers, data, dirs)]
# self.src_vocabs is used in collapse_copy_scores and Translator.py
self.src_vocabs = []
examples = []
for ex_dict in starmap(_join_dicts, zip(*read_iters)):
if can_copy:
src_field = fields['src']
tgt_field = fields['tgt']
# this assumes src_field and tgt_field are both text
src_ex_vocab, ex_dict = _dynamic_dict(
ex_dict, src_field.base_field, tgt_field.base_field)
self.src_vocabs.append(src_ex_vocab)
ex_fields = {k: [(k, v)] for k, v in fields.items() if
k in ex_dict}
ex = Example.fromdict(ex_dict, ex_fields)
examples.append(ex)
# fields needs to have only keys that examples have as attrs
fields = []
for _, nf_list in ex_fields.items():
assert len(nf_list) == 1
fields.append(nf_list[0])
super(Dataset, self).__init__(examples, fields, filter_pred)
def __getattr__(self, attr):
# avoid infinite recursion when fields isn't defined
if 'fields' not in vars(self):
raise AttributeError
if attr in self.fields:
return (getattr(x, attr) for x in self.examples)
else:
raise AttributeError
def save(self, path, remove_fields=True):
if remove_fields:
self.fields = []
torch.save(self, path)
@staticmethod
def config(fields):
readers, data, dirs = [], [], []
for name, field in fields:
if field["data"] is not None:
readers.append(field["reader"])
data.append((name, field["data"]))
dirs.append(field["dir"])
return readers, data, dirs
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/inputters/image_dataset.py
================================================
# -*- coding: utf-8 -*-
import os
import torch
from torchtext.data import Field
from onmt.inputters.datareader_base import DataReaderBase
# domain specific dependencies
try:
from PIL import Image
from torchvision import transforms
import cv2
except ImportError:
Image, transforms, cv2 = None, None, None
class ImageDataReader(DataReaderBase):
"""Read image data from disk.
Args:
truncate (tuple[int] or NoneType): maximum img size. Use
``(0,0)`` or ``None`` for unlimited.
channel_size (int): Number of channels per image.
Raises:
onmt.inputters.datareader_base.MissingDependencyException: If
importing any of ``PIL``, ``torchvision``, or ``cv2`` fail.
"""
def __init__(self, truncate=None, channel_size=3):
self._check_deps()
self.truncate = truncate
self.channel_size = channel_size
@classmethod
def from_opt(cls, opt):
return cls(channel_size=opt.image_channel_size)
@classmethod
def _check_deps(cls):
if any([Image is None, transforms is None, cv2 is None]):
cls._raise_missing_dep(
"PIL", "torchvision", "cv2")
def read(self, images, side, img_dir=None):
"""Read data into dicts.
Args:
images (str or Iterable[str]): Sequence of image paths or
path to file containing audio paths.
In either case, the filenames may be relative to ``src_dir``
(default behavior) or absolute.
side (str): Prefix used in return dict. Usually
``"src"`` or ``"tgt"``.
img_dir (str): Location of source image files. See ``images``.
Yields:
a dictionary containing image data, path and index for each line.
"""
if isinstance(images, str):
images = DataReaderBase._read_file(images)
for i, filename in enumerate(images):
filename = filename.decode("utf-8").strip()
img_path = os.path.join(img_dir, filename)
if not os.path.exists(img_path):
img_path = filename
assert os.path.exists(img_path), \
'img path %s not found' % filename
if self.channel_size == 1:
img = transforms.ToTensor()(
Image.fromarray(cv2.imread(img_path, 0)))
else:
img = transforms.ToTensor()(Image.open(img_path))
if self.truncate and self.truncate != (0, 0):
if not (img.size(1) <= self.truncate[0]
and img.size(2) <= self.truncate[1]):
continue
yield {side: img, side + '_path': filename, 'indices': i}
def img_sort_key(ex):
"""Sort using the size of the image: (width, height)."""
return ex.src.size(2), ex.src.size(1)
def batch_img(data, vocab):
"""Pad and batch a sequence of images."""
c = data[0].size(0)
h = max([t.size(1) for t in data])
w = max([t.size(2) for t in data])
imgs = torch.zeros(len(data), c, h, w).fill_(1)
for i, img in enumerate(data):
imgs[i, :, 0:img.size(1), 0:img.size(2)] = img
return imgs
def image_fields(**kwargs):
img = Field(
use_vocab=False, dtype=torch.float,
postprocessing=batch_img, sequential=False)
return img
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/inputters/inputter.py
================================================
# -*- coding: utf-8 -*-
import glob
import os
import codecs
import math
from collections import Counter, defaultdict
from itertools import chain, cycle
import torch
import torchtext.data
from torchtext.data import Field, RawField, LabelField
from torchtext.vocab import Vocab
from torchtext.data.utils import RandomShuffler
from onmt.inputters.text_dataset import text_fields, TextMultiField
from onmt.inputters.image_dataset import image_fields
from onmt.inputters.audio_dataset import audio_fields
from onmt.inputters.vec_dataset import vec_fields
from onmt.utils.logging import logger
# backwards compatibility
from onmt.inputters.text_dataset import _feature_tokenize # noqa: F401
from onmt.inputters.image_dataset import ( # noqa: F401
batch_img as make_img)
import gc
# monkey-patch to make torchtext Vocab's pickleable
def _getstate(self):
return dict(self.__dict__, stoi=dict(self.stoi))
def _setstate(self, state):
self.__dict__.update(state)
self.stoi = defaultdict(lambda: 0, self.stoi)
Vocab.__getstate__ = _getstate
Vocab.__setstate__ = _setstate
def make_src(data, vocab):
src_size = max([t.size(0) for t in data])
src_vocab_size = max([t.max() for t in data]) + 1
alignment = torch.zeros(src_size, len(data), src_vocab_size)
for i, sent in enumerate(data):
for j, t in enumerate(sent):
alignment[j, i, t] = 1
return alignment
def make_tgt(data, vocab):
tgt_size = max([t.size(0) for t in data])
alignment = torch.zeros(tgt_size, len(data)).long()
for i, sent in enumerate(data):
alignment[:sent.size(0), i] = sent
return alignment
class AlignField(LabelField):
"""
Parse ['-', ...] into ['','', ...]
"""
def __init__(self, **kwargs):
kwargs['use_vocab'] = False
kwargs['preprocessing'] = parse_align_idx
super(AlignField, self).__init__(**kwargs)
def process(self, batch, device=None):
""" Turn a batch of align-idx to a sparse align idx Tensor"""
sparse_idx = []
for i, example in enumerate(batch):
for src, tgt in example:
# +1 for tgt side to keep coherent after "bos" padding,
# register ['N°_in_batch', 'tgt_id+1', 'src_id']
sparse_idx.append([i, tgt + 1, src])
align_idx = torch.tensor(sparse_idx, dtype=self.dtype, device=device)
return align_idx
def parse_align_idx(align_pharaoh):
"""
Parse Pharaoh alignment into [[, ], ...]
"""
align_list = align_pharaoh.strip().split(' ')
flatten_align_idx = []
for align in align_list:
try:
src_idx, tgt_idx = align.split('-')
except ValueError:
logger.warning("{} in `{}`".format(align, align_pharaoh))
logger.warning("Bad alignement line exists. Please check file!")
raise
flatten_align_idx.append([int(src_idx), int(tgt_idx)])
return flatten_align_idx
def get_fields(
src_data_type,
n_src_feats,
n_tgt_feats,
pad='',
bos='',
eos='',
dynamic_dict=False,
with_align=False,
src_truncate=None,
tgt_truncate=None
):
"""
Args:
src_data_type: type of the source input. Options are [text|img|audio].
n_src_feats (int): the number of source features (not counting tokens)
to create a :class:`torchtext.data.Field` for. (If
``src_data_type=="text"``, these fields are stored together
as a ``TextMultiField``).
n_tgt_feats (int): See above.
pad (str): Special pad symbol. Used on src and tgt side.
bos (str): Special beginning of sequence symbol. Only relevant
for tgt.
eos (str): Special end of sequence symbol. Only relevant
for tgt.
dynamic_dict (bool): Whether or not to include source map and
alignment fields.
with_align (bool): Whether or not to include word align.
src_truncate: Cut off src sequences beyond this (passed to
``src_data_type``'s data reader - see there for more details).
tgt_truncate: Cut off tgt sequences beyond this (passed to
:class:`TextDataReader` - see there for more details).
Returns:
A dict mapping names to fields. These names need to match
the dataset example attributes.
"""
assert src_data_type in ['text', 'img', 'audio', 'vec'], \
"Data type not implemented"
assert not dynamic_dict or src_data_type == 'text', \
'it is not possible to use dynamic_dict with non-text input'
fields = {}
fields_getters = {"text": text_fields,
"img": image_fields,
"audio": audio_fields,
"vec": vec_fields}
src_field_kwargs = {"n_feats": n_src_feats,
"include_lengths": True,
"pad": pad, "bos": None, "eos": None,
"truncate": src_truncate,
"base_name": "src"}
fields["src"] = fields_getters[src_data_type](**src_field_kwargs)
tgt_field_kwargs = {"n_feats": n_tgt_feats,
"include_lengths": False,
"pad": pad, "bos": bos, "eos": eos,
"truncate": tgt_truncate,
"base_name": "tgt"}
fields["tgt"] = fields_getters["text"](**tgt_field_kwargs)
indices = Field(use_vocab=False, dtype=torch.long, sequential=False)
fields["indices"] = indices
if dynamic_dict:
src_map = Field(
use_vocab=False, dtype=torch.float,
postprocessing=make_src, sequential=False)
fields["src_map"] = src_map
src_ex_vocab = RawField()
fields["src_ex_vocab"] = src_ex_vocab
align = Field(
use_vocab=False, dtype=torch.long,
postprocessing=make_tgt, sequential=False)
fields["alignment"] = align
if with_align:
word_align = AlignField()
fields["align"] = word_align
return fields
def load_old_vocab(vocab, data_type="text", dynamic_dict=False):
"""Update a legacy vocab/field format.
Args:
vocab: a list of (field name, torchtext.vocab.Vocab) pairs. This is the
format formerly saved in *.vocab.pt files. Or, text data
not using a :class:`TextMultiField`.
data_type (str): text, img, or audio
dynamic_dict (bool): Used for copy attention.
Returns:
a dictionary whose keys are the field names and whose values Fields.
"""
if _old_style_vocab(vocab):
# List[Tuple[str, Vocab]] -> List[Tuple[str, Field]]
# -> dict[str, Field]
vocab = dict(vocab)
n_src_features = sum('src_feat_' in k for k in vocab)
n_tgt_features = sum('tgt_feat_' in k for k in vocab)
fields = get_fields(
data_type, n_src_features, n_tgt_features,
dynamic_dict=dynamic_dict)
for n, f in fields.items():
try:
f_iter = iter(f)
except TypeError:
f_iter = [(n, f)]
for sub_n, sub_f in f_iter:
if sub_n in vocab:
sub_f.vocab = vocab[sub_n]
return fields
if _old_style_field_list(vocab): # upgrade to multifield
# Dict[str, List[Tuple[str, Field]]]
# doesn't change structure - don't return early.
fields = vocab
for base_name, vals in fields.items():
if ((base_name == 'src' and data_type == 'text') or
base_name == 'tgt'):
assert not isinstance(vals[0][1], TextMultiField)
fields[base_name] = [(base_name, TextMultiField(
vals[0][0], vals[0][1], vals[1:]))]
if _old_style_nesting(vocab):
# Dict[str, List[Tuple[str, Field]]] -> List[Tuple[str, Field]]
# -> dict[str, Field]
fields = dict(list(chain.from_iterable(vocab.values())))
return fields
def _old_style_vocab(vocab):
"""Detect old-style vocabs (``List[Tuple[str, torchtext.data.Vocab]]``).
Args:
vocab: some object loaded from a *.vocab.pt file
Returns:
Whether ``vocab`` is a list of pairs where the second object
is a :class:`torchtext.vocab.Vocab` object.
This exists because previously only the vocab objects from the fields
were saved directly, not the fields themselves, and the fields needed to
be reconstructed at training and translation time.
"""
return isinstance(vocab, list) and \
any(isinstance(v[1], Vocab) for v in vocab)
def _old_style_nesting(vocab):
"""Detect old-style nesting (``dict[str, List[Tuple[str, Field]]]``)."""
return isinstance(vocab, dict) and \
any(isinstance(v, list) for v in vocab.values())
def _old_style_field_list(vocab):
"""Detect old-style text fields.
Not old style vocab, old nesting, and text-type fields not using
``TextMultiField``.
Args:
vocab: some object loaded from a *.vocab.pt file
Returns:
Whether ``vocab`` is not an :func:`_old_style_vocab` and not
a :class:`TextMultiField` (using an old-style text representation).
"""
# if tgt isn't using TextMultiField, then no text field is.
return (not _old_style_vocab(vocab)) and _old_style_nesting(vocab) and \
(not isinstance(vocab['tgt'][0][1], TextMultiField))
def old_style_vocab(vocab):
"""The vocab/fields need updated."""
return _old_style_vocab(vocab) or _old_style_field_list(vocab) or \
_old_style_nesting(vocab)
def filter_example(ex, use_src_len=True, use_tgt_len=True,
min_src_len=1, max_src_len=float('inf'),
min_tgt_len=1, max_tgt_len=float('inf')):
"""Return whether an example is an acceptable length.
If used with a dataset as ``filter_pred``, use :func:`partial()`
for all keyword arguments.
Args:
ex (torchtext.data.Example): An object with a ``src`` and ``tgt``
property.
use_src_len (bool): Filter based on the length of ``ex.src``.
use_tgt_len (bool): Similar to above.
min_src_len (int): A non-negative minimally acceptable length
(examples of exactly this length will be included).
min_tgt_len (int): Similar to above.
max_src_len (int or float): A non-negative (possibly infinite)
maximally acceptable length (examples of exactly this length
will be included).
max_tgt_len (int or float): Similar to above.
"""
src_len = len(ex.src[0])
tgt_len = len(ex.tgt[0])
return (not use_src_len or min_src_len <= src_len <= max_src_len) and \
(not use_tgt_len or min_tgt_len <= tgt_len <= max_tgt_len)
def _pad_vocab_to_multiple(vocab, multiple):
vocab_size = len(vocab)
if vocab_size % multiple == 0:
return
target_size = int(math.ceil(vocab_size / multiple)) * multiple
padding_tokens = [
"averyunlikelytoken%d" % i for i in range(target_size - vocab_size)]
vocab.extend(Vocab(Counter(), specials=padding_tokens))
return vocab
def _build_field_vocab(field, counter, size_multiple=1, **kwargs):
# this is basically copy-pasted from torchtext.
all_specials = [
field.unk_token, field.pad_token, field.init_token, field.eos_token
]
specials = [tok for tok in all_specials if tok is not None]
field.vocab = field.vocab_cls(counter, specials=specials, **kwargs)
if size_multiple > 1:
_pad_vocab_to_multiple(field.vocab, size_multiple)
def _load_vocab(vocab_path, name, counters, min_freq):
# counters changes in place
vocab = _read_vocab_file(vocab_path, name)
vocab_size = len(vocab)
logger.info('Loaded %s vocab has %d tokens.' % (name, vocab_size))
for i, token in enumerate(vocab):
# keep the order of tokens specified in the vocab file by
# adding them to the counter with decreasing counting values
counters[name][token] = vocab_size - i + min_freq
return vocab, vocab_size
def _build_fv_from_multifield(multifield, counters, build_fv_args,
size_multiple=1):
for name, field in multifield:
_build_field_vocab(
field,
counters[name],
size_multiple=size_multiple,
**build_fv_args[name])
logger.info(" * %s vocab size: %d." % (name, len(field.vocab)))
def _build_fields_vocab(fields, counters, data_type, share_vocab,
vocab_size_multiple,
src_vocab_size, src_words_min_frequency,
tgt_vocab_size, tgt_words_min_frequency):
build_fv_args = defaultdict(dict)
build_fv_args["src"] = dict(
max_size=src_vocab_size, min_freq=src_words_min_frequency)
build_fv_args["tgt"] = dict(
max_size=tgt_vocab_size, min_freq=tgt_words_min_frequency)
tgt_multifield = fields["tgt"]
_build_fv_from_multifield(
tgt_multifield,
counters,
build_fv_args,
size_multiple=vocab_size_multiple if not share_vocab else 1)
if data_type == 'text':
src_multifield = fields["src"]
_build_fv_from_multifield(
src_multifield,
counters,
build_fv_args,
size_multiple=vocab_size_multiple if not share_vocab else 1)
if share_vocab:
# `tgt_vocab_size` is ignored when sharing vocabularies
logger.info(" * merging src and tgt vocab...")
src_field = src_multifield.base_field
tgt_field = tgt_multifield.base_field
_merge_field_vocabs(
src_field, tgt_field, vocab_size=src_vocab_size,
min_freq=src_words_min_frequency,
vocab_size_multiple=vocab_size_multiple)
logger.info(" * merged vocab size: %d." % len(src_field.vocab))
return fields
def build_vocab(train_dataset_files, fields, data_type, share_vocab,
src_vocab_path, src_vocab_size, src_words_min_frequency,
tgt_vocab_path, tgt_vocab_size, tgt_words_min_frequency,
vocab_size_multiple=1):
"""Build the fields for all data sides.
Args:
train_dataset_files: a list of train dataset pt file.
fields (dict[str, Field]): fields to build vocab for.
data_type (str): A supported data type string.
share_vocab (bool): share source and target vocabulary?
src_vocab_path (str): Path to src vocabulary file.
src_vocab_size (int): size of the source vocabulary.
src_words_min_frequency (int): the minimum frequency needed to
include a source word in the vocabulary.
tgt_vocab_path (str): Path to tgt vocabulary file.
tgt_vocab_size (int): size of the target vocabulary.
tgt_words_min_frequency (int): the minimum frequency needed to
include a target word in the vocabulary.
vocab_size_multiple (int): ensure that the vocabulary size is a
multiple of this value.
Returns:
Dict of Fields
"""
counters = defaultdict(Counter)
if src_vocab_path:
try:
logger.info("Using existing vocabulary...")
vocab = torch.load(src_vocab_path)
# return vocab to dump with standard name
return vocab
except torch.serialization.pickle.UnpicklingError:
logger.info("Building vocab from text file...")
# empty train_dataset_files so that vocab is only loaded from
# given paths in src_vocab_path, tgt_vocab_path
train_dataset_files = []
# Load vocabulary
if src_vocab_path:
src_vocab, src_vocab_size = _load_vocab(
src_vocab_path, "src", counters,
src_words_min_frequency)
else:
src_vocab = None
if tgt_vocab_path:
tgt_vocab, tgt_vocab_size = _load_vocab(
tgt_vocab_path, "tgt", counters,
tgt_words_min_frequency)
else:
tgt_vocab = None
for i, path in enumerate(train_dataset_files):
dataset = torch.load(path)
logger.info(" * reloading %s." % path)
for ex in dataset.examples:
for name, field in fields.items():
try:
f_iter = iter(field)
except TypeError:
f_iter = [(name, field)]
all_data = [getattr(ex, name, None)]
else:
all_data = getattr(ex, name)
for (sub_n, sub_f), fd in zip(
f_iter, all_data):
has_vocab = (sub_n == 'src' and src_vocab) or \
(sub_n == 'tgt' and tgt_vocab)
if sub_f.sequential and not has_vocab:
val = fd
counters[sub_n].update(val)
# Drop the none-using from memory but keep the last
if i < len(train_dataset_files) - 1:
dataset.examples = None
gc.collect()
del dataset.examples
gc.collect()
del dataset
gc.collect()
fields = _build_fields_vocab(
fields, counters, data_type,
share_vocab, vocab_size_multiple,
src_vocab_size, src_words_min_frequency,
tgt_vocab_size, tgt_words_min_frequency)
return fields # is the return necessary?
def _merge_field_vocabs(src_field, tgt_field, vocab_size, min_freq,
vocab_size_multiple):
# in the long run, shouldn't it be possible to do this by calling
# build_vocab with both the src and tgt data?
specials = [tgt_field.unk_token, tgt_field.pad_token,
tgt_field.init_token, tgt_field.eos_token]
merged = sum(
[src_field.vocab.freqs, tgt_field.vocab.freqs], Counter()
)
merged_vocab = Vocab(
merged, specials=specials,
max_size=vocab_size, min_freq=min_freq
)
if vocab_size_multiple > 1:
_pad_vocab_to_multiple(merged_vocab, vocab_size_multiple)
src_field.vocab = merged_vocab
tgt_field.vocab = merged_vocab
assert len(src_field.vocab) == len(tgt_field.vocab)
def _read_vocab_file(vocab_path, tag):
"""Loads a vocabulary from the given path.
Args:
vocab_path (str): Path to utf-8 text file containing vocabulary.
Each token should be on a line by itself. Tokens must not
contain whitespace (else only before the whitespace
is considered).
tag (str): Used for logging which vocab is being read.
"""
logger.info("Loading {} vocabulary from {}".format(tag, vocab_path))
if not os.path.exists(vocab_path):
raise RuntimeError(
"{} vocabulary not found at {}".format(tag, vocab_path))
else:
with codecs.open(vocab_path, 'r', 'utf-8') as f:
return [line.strip().split()[0] for line in f if line.strip()]
def batch_iter(data, batch_size, batch_size_fn=None, batch_size_multiple=1):
"""Yield elements from data in chunks of batch_size, where each chunk size
is a multiple of batch_size_multiple.
This is an extended version of torchtext.data.batch.
"""
if batch_size_fn is None:
def batch_size_fn(new, count, sofar):
return count
minibatch, size_so_far = [], 0
for ex in data:
minibatch.append(ex)
size_so_far = batch_size_fn(ex, len(minibatch), size_so_far)
if size_so_far >= batch_size:
overflowed = 0
if size_so_far > batch_size:
overflowed += 1
if batch_size_multiple > 1:
overflowed += (
(len(minibatch) - overflowed) % batch_size_multiple)
if overflowed == 0:
yield minibatch
minibatch, size_so_far = [], 0
else:
if overflowed == len(minibatch):
logger.warning(
"An example was ignored, more tokens"
" than allowed by tokens batch_size")
else:
yield minibatch[:-overflowed]
minibatch = minibatch[-overflowed:]
size_so_far = 0
for i, ex in enumerate(minibatch):
size_so_far = batch_size_fn(ex, i + 1, size_so_far)
if minibatch:
yield minibatch
def _pool(data, batch_size, batch_size_fn, batch_size_multiple,
sort_key, random_shuffler, pool_factor):
for p in torchtext.data.batch(
data, batch_size * pool_factor,
batch_size_fn=batch_size_fn):
p_batch = list(batch_iter(
sorted(p, key=sort_key),
batch_size,
batch_size_fn=batch_size_fn,
batch_size_multiple=batch_size_multiple))
for b in random_shuffler(p_batch):
yield b
class OrderedIterator(torchtext.data.Iterator):
def __init__(self,
dataset,
batch_size,
pool_factor=1,
batch_size_multiple=1,
yield_raw_example=False,
**kwargs):
super(OrderedIterator, self).__init__(dataset, batch_size, **kwargs)
self.batch_size_multiple = batch_size_multiple
self.yield_raw_example = yield_raw_example
self.dataset = dataset
self.pool_factor = pool_factor
def create_batches(self):
if self.train:
if self.yield_raw_example:
self.batches = batch_iter(
self.data(),
1,
batch_size_fn=None,
batch_size_multiple=1)
else:
self.batches = _pool(
self.data(),
self.batch_size,
self.batch_size_fn,
self.batch_size_multiple,
self.sort_key,
self.random_shuffler,
self.pool_factor)
else:
self.batches = []
for b in batch_iter(
self.data(),
self.batch_size,
batch_size_fn=self.batch_size_fn,
batch_size_multiple=self.batch_size_multiple):
self.batches.append(sorted(b, key=self.sort_key))
def __iter__(self):
"""
Extended version of the definition in torchtext.data.Iterator.
Added yield_raw_example behaviour to yield a torchtext.data.Example
instead of a torchtext.data.Batch object.
"""
while True:
self.init_epoch()
for idx, minibatch in enumerate(self.batches):
# fast-forward if loaded from state
if self._iterations_this_epoch > idx:
continue
self.iterations += 1
self._iterations_this_epoch += 1
if self.sort_within_batch:
# NOTE: `rnn.pack_padded_sequence` requires that a
# minibatch be sorted by decreasing order, which
# requires reversing relative to typical sort keys
if self.sort:
minibatch.reverse()
else:
minibatch.sort(key=self.sort_key, reverse=True)
if self.yield_raw_example:
yield minibatch[0]
else:
yield torchtext.data.Batch(
minibatch,
self.dataset,
self.device)
if not self.repeat:
return
class MultipleDatasetIterator(object):
"""
This takes a list of iterable objects (DatasetLazyIter) and their
respective weights, and yields a batch in the wanted proportions.
"""
def __init__(self,
train_shards,
fields,
device,
opt):
self.index = -1
self.iterables = []
for shard in train_shards:
self.iterables.append(
build_dataset_iter(shard, fields, opt, multi=True))
self.init_iterators = True
self.weights = opt.data_weights
self.batch_size = opt.batch_size
self.batch_size_fn = max_tok_len \
if opt.batch_type == "tokens" else None
self.batch_size_multiple = 8 if opt.model_dtype == "fp16" else 1
self.device = device
# Temporarily load one shard to retrieve sort_key for data_type
temp_dataset = torch.load(self.iterables[0]._paths[0])
self.sort_key = temp_dataset.sort_key
self.random_shuffler = RandomShuffler()
self.pool_factor = opt.pool_factor
del temp_dataset
def _iter_datasets(self):
if self.init_iterators:
self.iterators = [iter(iterable) for iterable in self.iterables]
self.init_iterators = False
for weight in self.weights:
self.index = (self.index + 1) % len(self.iterators)
for i in range(weight):
yield self.iterators[self.index]
def _iter_examples(self):
for iterator in cycle(self._iter_datasets()):
yield next(iterator)
def __iter__(self):
while True:
for minibatch in _pool(
self._iter_examples(),
self.batch_size,
self.batch_size_fn,
self.batch_size_multiple,
self.sort_key,
self.random_shuffler,
self.pool_factor):
minibatch = sorted(minibatch, key=self.sort_key, reverse=True)
yield torchtext.data.Batch(minibatch,
self.iterables[0].dataset,
self.device)
class DatasetLazyIter(object):
"""Yield data from sharded dataset files.
Args:
dataset_paths: a list containing the locations of dataset files.
fields (dict[str, Field]): fields dict for the
datasets.
batch_size (int): batch size.
batch_size_fn: custom batch process function.
device: See :class:`OrderedIterator` ``device``.
is_train (bool): train or valid?
"""
def __init__(self, dataset_paths, fields, batch_size, batch_size_fn,
batch_size_multiple, device, is_train, pool_factor,
repeat=True, num_batches_multiple=1, yield_raw_example=False):
self._paths = dataset_paths
self.fields = fields
self.batch_size = batch_size
self.batch_size_fn = batch_size_fn
self.batch_size_multiple = batch_size_multiple
self.device = device
self.is_train = is_train
self.repeat = repeat
self.num_batches_multiple = num_batches_multiple
self.yield_raw_example = yield_raw_example
self.pool_factor = pool_factor
def _iter_dataset(self, path):
logger.info('Loading dataset from %s' % path)
cur_dataset = torch.load(path)
logger.info('number of examples: %d' % len(cur_dataset))
cur_dataset.fields = self.fields
cur_iter = OrderedIterator(
dataset=cur_dataset,
batch_size=self.batch_size,
pool_factor=self.pool_factor,
batch_size_multiple=self.batch_size_multiple,
batch_size_fn=self.batch_size_fn,
device=self.device,
train=self.is_train,
sort=False,
sort_within_batch=True,
repeat=False,
yield_raw_example=self.yield_raw_example
)
for batch in cur_iter:
self.dataset = cur_iter.dataset
yield batch
# NOTE: This is causing some issues for consumer/producer,
# as we may still have some of those examples in some queue
# cur_dataset.examples = None
# gc.collect()
# del cur_dataset
# gc.collect()
def __iter__(self):
num_batches = 0
paths = self._paths
if self.is_train and self.repeat:
# Cycle through the shards indefinitely.
paths = cycle(paths)
for path in paths:
for batch in self._iter_dataset(path):
yield batch
num_batches += 1
if self.is_train and not self.repeat and \
num_batches % self.num_batches_multiple != 0:
# When the dataset is not repeated, we might need to ensure that
# the number of returned batches is the multiple of a given value.
# This is important for multi GPU training to ensure that all
# workers have the same number of batches to process.
for path in paths:
for batch in self._iter_dataset(path):
yield batch
num_batches += 1
if num_batches % self.num_batches_multiple == 0:
return
def max_tok_len(new, count, sofar):
"""
In token batching scheme, the number of sequences is limited
such that the total number of src/tgt tokens (including padding)
in a batch <= batch_size
"""
# Maintains the longest src and tgt length in the current batch
global max_src_in_batch, max_tgt_in_batch # this is a hack
# Reset current longest length at a new batch (count=1)
if count == 1:
max_src_in_batch = 0
max_tgt_in_batch = 0
# Src: [ w1 ... wN ]
max_src_in_batch = max(max_src_in_batch, len(new.src[0]) + 2)
# Tgt: [w1 ... wM ]
max_tgt_in_batch = max(max_tgt_in_batch, len(new.tgt[0]) + 1)
src_elements = count * max_src_in_batch
tgt_elements = count * max_tgt_in_batch
return max(src_elements, tgt_elements)
def build_dataset_iter(corpus_type, fields, opt, is_train=True, multi=False):
"""
This returns user-defined train/validate data iterator for the trainer
to iterate over. We implement simple ordered iterator strategy here,
but more sophisticated strategy like curriculum learning is ok too.
"""
dataset_paths = list(sorted(
glob.glob(opt.data + '.' + corpus_type + '.[0-9]*.pt')))
if not dataset_paths:
if is_train:
raise ValueError('Training data %s not found' % opt.data)
else:
return None
if multi:
batch_size = 1
batch_fn = None
batch_size_multiple = 1
else:
batch_size = opt.batch_size if is_train else opt.valid_batch_size
batch_fn = max_tok_len \
if is_train and opt.batch_type == "tokens" else None
batch_size_multiple = 8 if opt.model_dtype == "fp16" else 1
device = "cuda" if opt.gpu_ranks else "cpu"
return DatasetLazyIter(
dataset_paths,
fields,
batch_size,
batch_fn,
batch_size_multiple,
device,
is_train,
opt.pool_factor,
repeat=not opt.single_pass,
num_batches_multiple=max(opt.accum_count) * opt.world_size,
yield_raw_example=multi)
def build_dataset_iter_multiple(train_shards, fields, opt):
return MultipleDatasetIterator(
train_shards, fields, "cuda" if opt.gpu_ranks else "cpu", opt)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/inputters/text_dataset.py
================================================
# -*- coding: utf-8 -*-
from functools import partial
import six
import torch
from torchtext.data import Field, RawField
from onmt.inputters.datareader_base import DataReaderBase
class TextDataReader(DataReaderBase):
def read(self, sequences, side, _dir=None):
"""Read text data from disk.
Args:
sequences (str or Iterable[str]):
path to text file or iterable of the actual text data.
side (str): Prefix used in return dict. Usually
``"src"`` or ``"tgt"``.
_dir (NoneType): Leave as ``None``. This parameter exists to
conform with the :func:`DataReaderBase.read()` signature.
Yields:
dictionaries whose keys are the names of fields and whose
values are more or less the result of tokenizing with those
fields.
"""
assert _dir is None or _dir == "", \
"Cannot use _dir with TextDataReader."
if isinstance(sequences, str):
sequences = DataReaderBase._read_file(sequences)
for i, seq in enumerate(sequences):
if isinstance(seq, six.binary_type):
seq = seq.decode("utf-8")
yield {side: seq, "indices": i}
def text_sort_key(ex):
"""Sort using the number of tokens in the sequence."""
if hasattr(ex, "tgt"):
return len(ex.src[0]), len(ex.tgt[0])
return len(ex.src[0])
# mix this with partial
def _feature_tokenize(
string, layer=0, tok_delim=None, feat_delim=None, truncate=None):
"""Split apart word features (like POS/NER tags) from the tokens.
Args:
string (str): A string with ``tok_delim`` joining tokens and
features joined by ``feat_delim``. For example,
``"hello|NOUN|'' Earth|NOUN|PLANET"``.
layer (int): Which feature to extract. (Not used if there are no
features, indicated by ``feat_delim is None``). In the
example above, layer 2 is ``'' PLANET``.
truncate (int or NoneType): Restrict sequences to this length of
tokens.
Returns:
List[str] of tokens.
"""
tokens = string.split(tok_delim)
if truncate is not None:
tokens = tokens[:truncate]
if feat_delim is not None:
tokens = [t.split(feat_delim)[layer] for t in tokens]
return tokens
class TextMultiField(RawField):
"""Container for subfields.
Text data might use POS/NER/etc labels in addition to tokens.
This class associates the "base" :class:`Field` with any subfields.
It also handles padding the data and stacking it.
Args:
base_name (str): Name for the base field.
base_field (Field): The token field.
feats_fields (Iterable[Tuple[str, Field]]): A list of name-field
pairs.
Attributes:
fields (Iterable[Tuple[str, Field]]): A list of name-field pairs.
The order is defined as the base field first, then
``feats_fields`` in alphabetical order.
"""
def __init__(self, base_name, base_field, feats_fields):
super(TextMultiField, self).__init__()
self.fields = [(base_name, base_field)]
for name, ff in sorted(feats_fields, key=lambda kv: kv[0]):
self.fields.append((name, ff))
@property
def base_field(self):
return self.fields[0][1]
def process(self, batch, device=None):
"""Convert outputs of preprocess into Tensors.
Args:
batch (List[List[List[str]]]): A list of length batch size.
Each element is a list of the preprocess results for each
field (which are lists of str "words" or feature tags.
device (torch.device or str): The device on which the tensor(s)
are built.
Returns:
torch.LongTensor or Tuple[LongTensor, LongTensor]:
A tensor of shape ``(seq_len, batch_size, len(self.fields))``
where the field features are ordered like ``self.fields``.
If the base field returns lengths, these are also returned
and have shape ``(batch_size,)``.
"""
# batch (list(list(list))): batch_size x len(self.fields) x seq_len
batch_by_feat = list(zip(*batch))
base_data = self.base_field.process(batch_by_feat[0], device=device)
if self.base_field.include_lengths:
# lengths: batch_size
base_data, lengths = base_data
feats = [ff.process(batch_by_feat[i], device=device)
for i, (_, ff) in enumerate(self.fields[1:], 1)]
levels = [base_data] + feats
# data: seq_len x batch_size x len(self.fields)
data = torch.stack(levels, 2)
if self.base_field.include_lengths:
return data, lengths
else:
return data
def preprocess(self, x):
"""Preprocess data.
Args:
x (str): A sentence string (words joined by whitespace).
Returns:
List[List[str]]: A list of length ``len(self.fields)`` containing
lists of tokens/feature tags for the sentence. The output
is ordered like ``self.fields``.
"""
return [f.preprocess(x) for _, f in self.fields]
def __getitem__(self, item):
return self.fields[item]
def text_fields(**kwargs):
"""Create text fields.
Args:
base_name (str): Name associated with the field.
n_feats (int): Number of word level feats (not counting the tokens)
include_lengths (bool): Optionally return the sequence lengths.
pad (str, optional): Defaults to ``""``.
bos (str or NoneType, optional): Defaults to ``""``.
eos (str or NoneType, optional): Defaults to ``""``.
truncate (bool or NoneType, optional): Defaults to ``None``.
Returns:
TextMultiField
"""
n_feats = kwargs["n_feats"]
include_lengths = kwargs["include_lengths"]
base_name = kwargs["base_name"]
pad = kwargs.get("pad", "")
bos = kwargs.get("bos", "")
eos = kwargs.get("eos", "")
truncate = kwargs.get("truncate", None)
fields_ = []
feat_delim = u"│" if n_feats > 0 else None
for i in range(n_feats + 1):
name = base_name + "_feat_" + str(i - 1) if i > 0 else base_name
tokenize = partial(
_feature_tokenize,
layer=i,
truncate=truncate,
feat_delim=feat_delim)
use_len = i == 0 and include_lengths
feat = Field(
init_token=bos, eos_token=eos,
pad_token=pad, tokenize=tokenize,
include_lengths=use_len)
fields_.append((name, feat))
assert fields_[0][0] == base_name # sanity check
field = TextMultiField(fields_[0][0], fields_[0][1], fields_[1:])
return field
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/inputters/vec_dataset.py
================================================
import os
import torch
from torchtext.data import Field
from onmt.inputters.datareader_base import DataReaderBase
try:
import numpy as np
except ImportError:
np = None
class VecDataReader(DataReaderBase):
"""Read feature vector data from disk.
Raises:
onmt.inputters.datareader_base.MissingDependencyException: If
importing ``np`` fails.
"""
def __init__(self):
self._check_deps()
@classmethod
def _check_deps(cls):
if np is None:
cls._raise_missing_dep("np")
def read(self, vecs, side, vec_dir=None):
"""Read data into dicts.
Args:
vecs (str or Iterable[str]): Sequence of feature vector paths or
path to file containing feature vector paths.
In either case, the filenames may be relative to ``vec_dir``
(default behavior) or absolute.
side (str): Prefix used in return dict. Usually
``"src"`` or ``"tgt"``.
vec_dir (str): Location of source vectors. See ``vecs``.
Yields:
A dictionary containing feature vector data.
"""
if isinstance(vecs, str):
vecs = DataReaderBase._read_file(vecs)
for i, filename in enumerate(vecs):
filename = filename.decode("utf-8").strip()
vec_path = os.path.join(vec_dir, filename)
if not os.path.exists(vec_path):
vec_path = filename
assert os.path.exists(vec_path), \
'vec path %s not found' % filename
vec = np.load(vec_path)
yield {side: torch.from_numpy(vec),
side + "_path": filename, "indices": i}
def vec_sort_key(ex):
"""Sort using the length of the vector sequence."""
return ex.src.shape[0]
class VecSeqField(Field):
"""Defines an vector datatype and instructions for converting to Tensor.
See :class:`Fields` for attribute descriptions.
"""
def __init__(self, preprocessing=None, postprocessing=None,
include_lengths=False, batch_first=False, pad_index=0,
is_target=False):
super(VecSeqField, self).__init__(
sequential=True, use_vocab=False, init_token=None,
eos_token=None, fix_length=False, dtype=torch.float,
preprocessing=preprocessing, postprocessing=postprocessing,
lower=False, tokenize=None, include_lengths=include_lengths,
batch_first=batch_first, pad_token=pad_index, unk_token=None,
pad_first=False, truncate_first=False, stop_words=None,
is_target=is_target
)
def pad(self, minibatch):
"""Pad a batch of examples to the length of the longest example.
Args:
minibatch (List[torch.FloatTensor]): A list of audio data,
each having shape ``(len, n_feats, feat_dim)``
where len is variable.
Returns:
torch.FloatTensor or Tuple[torch.FloatTensor, List[int]]: The
padded tensor of shape
``(batch_size, max_len, n_feats, feat_dim)``.
and a list of the lengths if `self.include_lengths` is `True`
else just returns the padded tensor.
"""
assert not self.pad_first and not self.truncate_first \
and not self.fix_length and self.sequential
minibatch = list(minibatch)
lengths = [x.size(0) for x in minibatch]
max_len = max(lengths)
nfeats = minibatch[0].size(1)
feat_dim = minibatch[0].size(2)
feats = torch.full((len(minibatch), max_len, nfeats, feat_dim),
self.pad_token)
for i, (feat, len_) in enumerate(zip(minibatch, lengths)):
feats[i, 0:len_, :, :] = feat
if self.include_lengths:
return (feats, lengths)
return feats
def numericalize(self, arr, device=None):
"""Turn a batch of examples that use this field into a Variable.
If the field has ``include_lengths=True``, a tensor of lengths will be
included in the return value.
Args:
arr (torch.FloatTensor or Tuple(torch.FloatTensor, List[int])):
List of tokenized and padded examples, or tuple of List of
tokenized and padded examples and List of lengths of each
example if self.include_lengths is True.
device (str or torch.device): See `Field.numericalize`.
"""
assert self.use_vocab is False
if self.include_lengths and not isinstance(arr, tuple):
raise ValueError("Field has include_lengths set to True, but "
"input data is not a tuple of "
"(data batch, batch lengths).")
if isinstance(arr, tuple):
arr, lengths = arr
lengths = torch.tensor(lengths, dtype=torch.int, device=device)
arr = arr.to(device)
if self.postprocessing is not None:
arr = self.postprocessing(arr, None)
if self.sequential and not self.batch_first:
arr = arr.permute(1, 0, 2, 3)
if self.sequential:
arr = arr.contiguous()
if self.include_lengths:
return arr, lengths
return arr
def vec_fields(**kwargs):
vec = VecSeqField(pad_index=0, include_lengths=True)
return vec
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/model_builder.py
================================================
"""
This file is for models creation, which consults options
and creates each encoder and decoder accordingly.
"""
import re
import torch
import torch.nn as nn
from torch.nn.init import xavier_uniform_
import onmt.inputters as inputters
import onmt.modules
from onmt.encoders import str2enc
from onmt.decoders import str2dec
from onmt.modules import Embeddings, VecEmbedding, CopyGenerator
from onmt.modules.util_class import Cast
from onmt.utils.misc import use_gpu
from onmt.utils.logging import logger
from onmt.utils.parse import ArgumentParser
def build_embeddings(opt, text_field, for_encoder=True):
"""
Args:
opt: the option in current environment.
text_field(TextMultiField): word and feats field.
for_encoder(bool): build Embeddings for encoder or decoder?
"""
emb_dim = opt.src_word_vec_size if for_encoder else opt.tgt_word_vec_size
if opt.model_type == "vec" and for_encoder:
return VecEmbedding(
opt.feat_vec_size,
emb_dim,
position_encoding=opt.position_encoding,
dropout=(opt.dropout[0] if type(opt.dropout) is list
else opt.dropout),
)
pad_indices = [f.vocab.stoi[f.pad_token] for _, f in text_field]
word_padding_idx, feat_pad_indices = pad_indices[0], pad_indices[1:]
num_embs = [len(f.vocab) for _, f in text_field]
num_word_embeddings, num_feat_embeddings = num_embs[0], num_embs[1:]
fix_word_vecs = opt.fix_word_vecs_enc if for_encoder \
else opt.fix_word_vecs_dec
emb = Embeddings(
word_vec_size=emb_dim,
position_encoding=opt.position_encoding,
feat_merge=opt.feat_merge,
feat_vec_exponent=opt.feat_vec_exponent,
feat_vec_size=opt.feat_vec_size,
dropout=opt.dropout[0] if type(opt.dropout) is list else opt.dropout,
word_padding_idx=word_padding_idx,
feat_padding_idx=feat_pad_indices,
word_vocab_size=num_word_embeddings,
feat_vocab_sizes=num_feat_embeddings,
sparse=opt.optim == "sparseadam",
fix_word_vecs=fix_word_vecs
)
return emb
def build_encoder(opt, embeddings):
"""
Various encoder dispatcher function.
Args:
opt: the option in current environment.
embeddings (Embeddings): vocab embeddings for this encoder.
"""
enc_type = opt.encoder_type if opt.model_type == "text" \
or opt.model_type == "vec" else opt.model_type
return str2enc[enc_type].from_opt(opt, embeddings)
def build_decoder(opt, embeddings):
"""
Various decoder dispatcher function.
Args:
opt: the option in current environment.
embeddings (Embeddings): vocab embeddings for this decoder.
"""
dec_type = "ifrnn" if opt.decoder_type == "rnn" and opt.input_feed \
else opt.decoder_type
return str2dec[dec_type].from_opt(opt, embeddings)
def load_test_model(opt, model_path=None):
if model_path is None:
model_path = opt.models[0]
checkpoint = torch.load(model_path,
map_location=lambda storage, loc: storage)
model_opt = ArgumentParser.ckpt_model_opts(checkpoint['opt'])
ArgumentParser.update_model_opts(model_opt)
ArgumentParser.validate_model_opts(model_opt)
vocab = checkpoint['vocab']
if inputters.old_style_vocab(vocab):
fields = inputters.load_old_vocab(
vocab, opt.data_type, dynamic_dict=model_opt.copy_attn
)
else:
fields = vocab
model = build_base_model(model_opt, fields, use_gpu(opt), checkpoint,
opt.gpu)
if opt.fp32:
model.float()
model.eval()
model.generator.eval()
return fields, model, model_opt
def build_base_model(model_opt, fields, gpu, checkpoint=None, gpu_id=None):
"""Build a model from opts.
Args:
model_opt: the option loaded from checkpoint. It's important that
the opts have been updated and validated. See
:class:`onmt.utils.parse.ArgumentParser`.
fields (dict[str, torchtext.data.Field]):
`Field` objects for the model.
gpu (bool): whether to use gpu.
checkpoint: the model gnerated by train phase, or a resumed snapshot
model from a stopped training.
gpu_id (int or NoneType): Which GPU to use.
Returns:
the NMTModel.
"""
# for back compat when attention_dropout was not defined
try:
model_opt.attention_dropout
except AttributeError:
model_opt.attention_dropout = model_opt.dropout
# Build embeddings.
if model_opt.model_type == "text" or model_opt.model_type == "vec":
src_field = fields["src"]
src_emb = build_embeddings(model_opt, src_field)
else:
src_emb = None
# Build encoder.
encoder = build_encoder(model_opt, src_emb)
# Build decoder.
tgt_field = fields["tgt"]
tgt_emb = build_embeddings(model_opt, tgt_field, for_encoder=False)
# Share the embedding matrix - preprocess with share_vocab required.
if model_opt.share_embeddings:
# src/tgt vocab should be the same if `-share_vocab` is specified.
assert src_field.base_field.vocab == tgt_field.base_field.vocab, \
"preprocess with -share_vocab if you use share_embeddings"
tgt_emb.word_lut.weight = src_emb.word_lut.weight
decoder = build_decoder(model_opt, tgt_emb)
# Build NMTModel(= encoder + decoder).
if gpu and gpu_id is not None:
device = torch.device("cuda", gpu_id)
elif gpu and not gpu_id:
device = torch.device("cuda")
elif not gpu:
device = torch.device("cpu")
model = onmt.models.NMTModel(encoder, decoder)
# Build Generator.
if not model_opt.copy_attn:
if model_opt.generator_function == "sparsemax":
gen_func = onmt.modules.sparse_activations.LogSparsemax(dim=-1)
else:
gen_func = nn.LogSoftmax(dim=-1)
generator = nn.Sequential(
nn.Linear(model_opt.dec_rnn_size,
len(fields["tgt"].base_field.vocab)),
Cast(torch.float32),
gen_func
)
if model_opt.share_decoder_embeddings:
generator[0].weight = decoder.embeddings.word_lut.weight
else:
tgt_base_field = fields["tgt"].base_field
vocab_size = len(tgt_base_field.vocab)
pad_idx = tgt_base_field.vocab.stoi[tgt_base_field.pad_token]
generator = CopyGenerator(model_opt.dec_rnn_size, vocab_size, pad_idx)
# Load the model states from checkpoint or initialize them.
if checkpoint is not None:
# This preserves backward-compat for models using customed layernorm
def fix_key(s):
s = re.sub(r'(.*)\.layer_norm((_\d+)?)\.b_2',
r'\1.layer_norm\2.bias', s)
s = re.sub(r'(.*)\.layer_norm((_\d+)?)\.a_2',
r'\1.layer_norm\2.weight', s)
return s
checkpoint['model'] = {fix_key(k): v
for k, v in checkpoint['model'].items()}
# end of patch for backward compatibility
model.load_state_dict(checkpoint['model'], strict=False)
generator.load_state_dict(checkpoint['generator'], strict=False)
else:
if model_opt.param_init != 0.0:
for p in model.parameters():
p.data.uniform_(-model_opt.param_init, model_opt.param_init)
for p in generator.parameters():
p.data.uniform_(-model_opt.param_init, model_opt.param_init)
if model_opt.param_init_glorot:
for p in model.parameters():
if p.dim() > 1:
xavier_uniform_(p)
for p in generator.parameters():
if p.dim() > 1:
xavier_uniform_(p)
if hasattr(model.encoder, 'embeddings'):
model.encoder.embeddings.load_pretrained_vectors(
model_opt.pre_word_vecs_enc)
if hasattr(model.decoder, 'embeddings'):
model.decoder.embeddings.load_pretrained_vectors(
model_opt.pre_word_vecs_dec)
model.generator = generator
model.to(device)
if model_opt.model_dtype == 'fp16' and model_opt.optim == 'fusedadam':
model.half()
return model
def build_model(model_opt, opt, fields, checkpoint):
logger.info('Building model...')
model = build_base_model(model_opt, fields, use_gpu(opt), checkpoint)
logger.info(model)
return model
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/models/__init__.py
================================================
"""Module defining models."""
from onmt.models.model_saver import build_model_saver, ModelSaver
from onmt.models.model import NMTModel
__all__ = ["build_model_saver", "ModelSaver", "NMTModel"]
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/models/model.py
================================================
""" Onmt NMT Model base class definition """
import torch.nn as nn
class NMTModel(nn.Module):
"""
Core trainable object in OpenNMT. Implements a trainable interface
for a simple, generic encoder + decoder model.
Args:
encoder (onmt.encoders.EncoderBase): an encoder object
decoder (onmt.decoders.DecoderBase): a decoder object
"""
def __init__(self, encoder, decoder):
super(NMTModel, self).__init__()
self.encoder = encoder
self.decoder = decoder
def forward(self, src, tgt, lengths, bptt=False, with_align=False):
"""Forward propagate a `src` and `tgt` pair for training.
Possible initialized with a beginning decoder state.
Args:
src (Tensor): A source sequence passed to encoder.
typically for inputs this will be a padded `LongTensor`
of size ``(len, batch, features)``. However, may be an
image or other generic input depending on encoder.
tgt (LongTensor): A target sequence passed to decoder.
Size ``(tgt_len, batch, features)``.
lengths(LongTensor): The src lengths, pre-padding ``(batch,)``.
bptt (Boolean): A flag indicating if truncated bptt is set.
If reset then init_state
with_align (Boolean): A flag indicating whether output alignment,
Only valid for transformer decoder.
Returns:
(FloatTensor, dict[str, FloatTensor]):
* decoder output ``(tgt_len, batch, hidden)``
* dictionary attention dists of ``(tgt_len, batch, src_len)``
"""
dec_in = tgt[:-1] # exclude last target from inputs
enc_state, memory_bank, lengths = self.encoder(src, lengths)
if bptt is False:
self.decoder.init_state(src, memory_bank, enc_state)
dec_out, attns = self.decoder(dec_in, memory_bank,
memory_lengths=lengths,
with_align=with_align)
return dec_out, attns
def update_dropout(self, dropout):
self.encoder.update_dropout(dropout)
self.decoder.update_dropout(dropout)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/models/model_saver.py
================================================
import os
import torch
from collections import deque
from onmt.utils.logging import logger
from copy import deepcopy
def build_model_saver(model_opt, opt, model, fields, optim):
model_saver = ModelSaver(opt.save_model,
model,
model_opt,
fields,
optim,
opt.keep_checkpoint)
return model_saver
class ModelSaverBase(object):
"""Base class for model saving operations
Inherited classes must implement private methods:
* `_save`
* `_rm_checkpoint
"""
def __init__(self, base_path, model, model_opt, fields, optim,
keep_checkpoint=-1):
self.base_path = base_path
self.model = model
self.model_opt = model_opt
self.fields = fields
self.optim = optim
self.last_saved_step = None
self.keep_checkpoint = keep_checkpoint
if keep_checkpoint > 0:
self.checkpoint_queue = deque([], maxlen=keep_checkpoint)
def save(self, step, moving_average=None):
"""Main entry point for model saver
It wraps the `_save` method with checks and apply `keep_checkpoint`
related logic
"""
if self.keep_checkpoint == 0 or step == self.last_saved_step:
return
save_model = self.model
if moving_average:
model_params_data = []
for avg, param in zip(moving_average, save_model.parameters()):
model_params_data.append(param.data)
param.data = avg.data
chkpt, chkpt_name = self._save(step, save_model)
self.last_saved_step = step
if moving_average:
for param_data, param in zip(model_params_data,
save_model.parameters()):
param.data = param_data
if self.keep_checkpoint > 0:
if len(self.checkpoint_queue) == self.checkpoint_queue.maxlen:
todel = self.checkpoint_queue.popleft()
self._rm_checkpoint(todel)
self.checkpoint_queue.append(chkpt_name)
def _save(self, step):
"""Save a resumable checkpoint.
Args:
step (int): step number
Returns:
(object, str):
* checkpoint: the saved object
* checkpoint_name: name (or path) of the saved checkpoint
"""
raise NotImplementedError()
def _rm_checkpoint(self, name):
"""Remove a checkpoint
Args:
name(str): name that indentifies the checkpoint
(it may be a filepath)
"""
raise NotImplementedError()
class ModelSaver(ModelSaverBase):
"""Simple model saver to filesystem"""
def _save(self, step, model):
model_state_dict = model.state_dict()
model_state_dict = {k: v for k, v in model_state_dict.items()
if 'generator' not in k}
generator_state_dict = model.generator.state_dict()
# NOTE: We need to trim the vocab to remove any unk tokens that
# were not originally here.
vocab = deepcopy(self.fields)
for side in ["src", "tgt"]:
keys_to_pop = []
if hasattr(vocab[side], "fields"):
unk_token = vocab[side].fields[0][1].vocab.itos[0]
for key, value in vocab[side].fields[0][1].vocab.stoi.items():
if value == 0 and key != unk_token:
keys_to_pop.append(key)
for key in keys_to_pop:
vocab[side].fields[0][1].vocab.stoi.pop(key, None)
checkpoint = {
'model': model_state_dict,
'generator': generator_state_dict,
'vocab': vocab,
'opt': self.model_opt,
'optim': self.optim.state_dict(),
}
logger.info("Saving checkpoint %s_step_%d.pt" % (self.base_path, step))
checkpoint_path = '%s_step_%d.pt' % (self.base_path, step)
torch.save(checkpoint, checkpoint_path)
return checkpoint, checkpoint_path
def _rm_checkpoint(self, name):
os.remove(name)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/models/sru.py
================================================
""" SRU Implementation """
# flake8: noqa
import subprocess
import platform
import os
import re
import configargparse
import torch
import torch.nn as nn
from torch.autograd import Function
from collections import namedtuple
# For command-line option parsing
class CheckSRU(configargparse.Action):
def __init__(self, option_strings, dest, **kwargs):
super(CheckSRU, self).__init__(option_strings, dest, **kwargs)
def __call__(self, parser, namespace, values, option_string=None):
if values == 'SRU':
check_sru_requirement(abort=True)
# Check pass, set the args.
setattr(namespace, self.dest, values)
# This SRU version implements its own cuda-level optimization,
# so it requires that:
# 1. `cupy` and `pynvrtc` python package installed.
# 2. pytorch is built with cuda support.
# 3. library path set: export LD_LIBRARY_PATH=.
def check_sru_requirement(abort=False):
"""
Return True if check pass; if check fails and abort is True,
raise an Exception, othereise return False.
"""
# Check 1.
try:
if platform.system() == 'Windows':
subprocess.check_output('pip freeze | findstr cupy', shell=True)
subprocess.check_output('pip freeze | findstr pynvrtc',
shell=True)
else: # Unix-like systems
subprocess.check_output('pip freeze | grep -w cupy', shell=True)
subprocess.check_output('pip freeze | grep -w pynvrtc',
shell=True)
except subprocess.CalledProcessError:
if not abort:
return False
raise AssertionError("Using SRU requires 'cupy' and 'pynvrtc' "
"python packages installed.")
# Check 2.
if torch.cuda.is_available() is False:
if not abort:
return False
raise AssertionError("Using SRU requires pytorch built with cuda.")
# Check 3.
pattern = re.compile(".*cuda/lib.*")
ld_path = os.getenv('LD_LIBRARY_PATH', "")
if re.match(pattern, ld_path) is None:
if not abort:
return False
raise AssertionError("Using SRU requires setting cuda lib path, e.g. "
"export LD_LIBRARY_PATH=/usr/local/cuda/lib64.")
return True
SRU_CODE = """
extern "C" {
__forceinline__ __device__ float sigmoidf(float x)
{
return 1.f / (1.f + expf(-x));
}
__forceinline__ __device__ float reluf(float x)
{
return (x > 0.f) ? x : 0.f;
}
__global__ void sru_fwd(const float * __restrict__ u,
const float * __restrict__ x,
const float * __restrict__ bias,
const float * __restrict__ init,
const float * __restrict__ mask_h,
const int len, const int batch,
const int d, const int k,
float * __restrict__ h,
float * __restrict__ c,
const int activation_type)
{
assert ((k == 3) || (x == NULL));
int ncols = batch*d;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (col >= ncols) return;
int ncols_u = ncols*k;
int ncols_x = (k == 3) ? ncols : ncols_u;
const float bias1 = *(bias + (col%d));
const float bias2 = *(bias + (col%d) + d);
const float mask = (mask_h == NULL) ? 1.0 : (*(mask_h + col));
float cur = *(init + col);
const float *up = u + (col*k);
const float *xp = (k == 3) ? (x + col) : (up + 3);
float *cp = c + col;
float *hp = h + col;
for (int row = 0; row < len; ++row)
{
float g1 = sigmoidf((*(up+1))+bias1);
float g2 = sigmoidf((*(up+2))+bias2);
cur = (cur-(*up))*g1 + (*up);
*cp = cur;
float val = (activation_type == 1) ? tanh(cur) : (
(activation_type == 2) ? reluf(cur) : cur
);
*hp = (val*mask-(*xp))*g2 + (*xp);
up += ncols_u;
xp += ncols_x;
cp += ncols;
hp += ncols;
}
}
__global__ void sru_bwd(const float * __restrict__ u,
const float * __restrict__ x,
const float * __restrict__ bias,
const float * __restrict__ init,
const float * __restrict__ mask_h,
const float * __restrict__ c,
const float * __restrict__ grad_h,
const float * __restrict__ grad_last,
const int len,
const int batch, const int d, const int k,
float * __restrict__ grad_u,
float * __restrict__ grad_x,
float * __restrict__ grad_bias,
float * __restrict__ grad_init,
int activation_type)
{
assert((k == 3) || (x == NULL));
assert((k == 3) || (grad_x == NULL));
int ncols = batch*d;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (col >= ncols) return;
int ncols_u = ncols*k;
int ncols_x = (k == 3) ? ncols : ncols_u;
const float bias1 = *(bias + (col%d));
const float bias2 = *(bias + (col%d) + d);
const float mask = (mask_h == NULL) ? 1.0 : (*(mask_h + col));
float gbias1 = 0;
float gbias2 = 0;
float cur = *(grad_last + col);
const float *up = u + (col*k) + (len-1)*ncols_u;
const float *xp = (k == 3) ? (x + col + (len-1)*ncols) : (up + 3);
const float *cp = c + col + (len-1)*ncols;
const float *ghp = grad_h + col + (len-1)*ncols;
float *gup = grad_u + (col*k) + (len-1)*ncols_u;
float *gxp = (k == 3) ? (grad_x + col + (len-1)*ncols) : (gup + 3);
for (int row = len-1; row >= 0; --row)
{
const float g1 = sigmoidf((*(up+1))+bias1);
const float g2 = sigmoidf((*(up+2))+bias2);
const float c_val = (activation_type == 1) ? tanh(*cp) : (
(activation_type == 2) ? reluf(*cp) : (*cp)
);
const float x_val = *xp;
const float u_val = *up;
const float prev_c_val = (row>0) ? (*(cp-ncols)) : (*(init+col));
const float gh_val = *ghp;
// h = c*g2 + x*(1-g2) = (c-x)*g2 + x
// c = c'*g1 + g0*(1-g1) = (c'-g0)*g1 + g0
// grad wrt x
*gxp = gh_val*(1-g2);
// grad wrt g2, u2 and bias2
float gg2 = gh_val*(c_val*mask-x_val)*(g2*(1-g2));
*(gup+2) = gg2;
gbias2 += gg2;
// grad wrt c
const float tmp = (activation_type == 1) ? (g2*(1-c_val*c_val)) : (
((activation_type == 0) || (c_val > 0)) ? g2 : 0.f
);
const float gc = gh_val*mask*tmp + cur;
// grad wrt u0
*gup = gc*(1-g1);
// grad wrt g1, u1, and bias1
float gg1 = gc*(prev_c_val-u_val)*(g1*(1-g1));
*(gup+1) = gg1;
gbias1 += gg1;
// grad wrt c'
cur = gc*g1;
up -= ncols_u;
xp -= ncols_x;
cp -= ncols;
gup -= ncols_u;
gxp -= ncols_x;
ghp -= ncols;
}
*(grad_bias + col) = gbias1;
*(grad_bias + col + ncols) = gbias2;
*(grad_init +col) = cur;
}
__global__ void sru_bi_fwd(const float * __restrict__ u,
const float * __restrict__ x,
const float * __restrict__ bias,
const float * __restrict__ init,
const float * __restrict__ mask_h,
const int len, const int batch,
const int d, const int k,
float * __restrict__ h,
float * __restrict__ c,
const int activation_type)
{
assert ((k == 3) || (x == NULL));
assert ((k == 3) || (k == 4));
int ncols = batch*d*2;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (col >= ncols) return;
int ncols_u = ncols*k;
int ncols_x = (k == 3) ? ncols : ncols_u;
const float mask = (mask_h == NULL) ? 1.0 : (*(mask_h + col));
float cur = *(init + col);
const int d2 = d*2;
const bool flip = (col%d2) >= d;
const float bias1 = *(bias + (col%d2));
const float bias2 = *(bias + (col%d2) + d2);
const float *up = u + (col*k);
const float *xp = (k == 3) ? (x + col) : (up + 3);
float *cp = c + col;
float *hp = h + col;
if (flip) {
up += (len-1)*ncols_u;
xp += (len-1)*ncols_x;
cp += (len-1)*ncols;
hp += (len-1)*ncols;
}
int ncols_u_ = flip ? -ncols_u : ncols_u;
int ncols_x_ = flip ? -ncols_x : ncols_x;
int ncols_ = flip ? -ncols : ncols;
for (int cnt = 0; cnt < len; ++cnt)
{
float g1 = sigmoidf((*(up+1))+bias1);
float g2 = sigmoidf((*(up+2))+bias2);
cur = (cur-(*up))*g1 + (*up);
*cp = cur;
float val = (activation_type == 1) ? tanh(cur) : (
(activation_type == 2) ? reluf(cur) : cur
);
*hp = (val*mask-(*xp))*g2 + (*xp);
up += ncols_u_;
xp += ncols_x_;
cp += ncols_;
hp += ncols_;
}
}
__global__ void sru_bi_bwd(const float * __restrict__ u,
const float * __restrict__ x,
const float * __restrict__ bias,
const float * __restrict__ init,
const float * __restrict__ mask_h,
const float * __restrict__ c,
const float * __restrict__ grad_h,
const float * __restrict__ grad_last,
const int len, const int batch,
const int d, const int k,
float * __restrict__ grad_u,
float * __restrict__ grad_x,
float * __restrict__ grad_bias,
float * __restrict__ grad_init,
int activation_type)
{
assert((k == 3) || (x == NULL));
assert((k == 3) || (grad_x == NULL));
assert((k == 3) || (k == 4));
int ncols = batch*d*2;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (col >= ncols) return;
int ncols_u = ncols*k;
int ncols_x = (k == 3) ? ncols : ncols_u;
const float mask = (mask_h == NULL) ? 1.0 : (*(mask_h + col));
float gbias1 = 0;
float gbias2 = 0;
float cur = *(grad_last + col);
const int d2 = d*2;
const bool flip = ((col%d2) >= d);
const float bias1 = *(bias + (col%d2));
const float bias2 = *(bias + (col%d2) + d2);
const float *up = u + (col*k);
const float *xp = (k == 3) ? (x + col) : (up + 3);
const float *cp = c + col;
const float *ghp = grad_h + col;
float *gup = grad_u + (col*k);
float *gxp = (k == 3) ? (grad_x + col) : (gup + 3);
if (!flip) {
up += (len-1)*ncols_u;
xp += (len-1)*ncols_x;
cp += (len-1)*ncols;
ghp += (len-1)*ncols;
gup += (len-1)*ncols_u;
gxp += (len-1)*ncols_x;
}
int ncols_u_ = flip ? -ncols_u : ncols_u;
int ncols_x_ = flip ? -ncols_x : ncols_x;
int ncols_ = flip ? -ncols : ncols;
for (int cnt = 0; cnt < len; ++cnt)
{
const float g1 = sigmoidf((*(up+1))+bias1);
const float g2 = sigmoidf((*(up+2))+bias2);
const float c_val = (activation_type == 1) ? tanh(*cp) : (
(activation_type == 2) ? reluf(*cp) : (*cp)
);
const float x_val = *xp;
const float u_val = *up;
const float prev_c_val = (cnt 0)) ? g2 : 0.f
);
const float gc = gh_val*mask*tmp + cur;
// grad wrt u0
*gup = gc*(1-g1);
// grad wrt g1, u1, and bias1
float gg1 = gc*(prev_c_val-u_val)*(g1*(1-g1));
*(gup+1) = gg1;
gbias1 += gg1;
// grad wrt c'
cur = gc*g1;
up -= ncols_u_;
xp -= ncols_x_;
cp -= ncols_;
gup -= ncols_u_;
gxp -= ncols_x_;
ghp -= ncols_;
}
*(grad_bias + col) = gbias1;
*(grad_bias + col + ncols) = gbias2;
*(grad_init +col) = cur;
}
}
"""
SRU_FWD_FUNC, SRU_BWD_FUNC = None, None
SRU_BiFWD_FUNC, SRU_BiBWD_FUNC = None, None
SRU_STREAM = None
def load_sru_mod():
global SRU_FWD_FUNC, SRU_BWD_FUNC, SRU_BiFWD_FUNC, SRU_BiBWD_FUNC
global SRU_STREAM
if check_sru_requirement():
from cupy.cuda import function
from pynvrtc.compiler import Program
# This sets up device to use.
device = torch.device("cuda")
tmp_ = torch.rand(1, 1).to(device)
sru_prog = Program(SRU_CODE.encode('utf-8'),
'sru_prog.cu'.encode('utf-8'))
sru_ptx = sru_prog.compile()
sru_mod = function.Module()
sru_mod.load(bytes(sru_ptx.encode()))
SRU_FWD_FUNC = sru_mod.get_function('sru_fwd')
SRU_BWD_FUNC = sru_mod.get_function('sru_bwd')
SRU_BiFWD_FUNC = sru_mod.get_function('sru_bi_fwd')
SRU_BiBWD_FUNC = sru_mod.get_function('sru_bi_bwd')
stream = namedtuple('Stream', ['ptr'])
SRU_STREAM = stream(ptr=torch.cuda.current_stream().cuda_stream)
class SRU_Compute(Function):
def __init__(self, activation_type, d_out, bidirectional=False):
SRU_Compute.maybe_load_sru_mod()
super(SRU_Compute, self).__init__()
self.activation_type = activation_type
self.d_out = d_out
self.bidirectional = bidirectional
@staticmethod
def maybe_load_sru_mod():
global SRU_FWD_FUNC
if SRU_FWD_FUNC is None:
load_sru_mod()
def forward(self, u, x, bias, init=None, mask_h=None):
bidir = 2 if self.bidirectional else 1
length = x.size(0) if x.dim() == 3 else 1
batch = x.size(-2)
d = self.d_out
k = u.size(-1) // d
k_ = k // 2 if self.bidirectional else k
ncols = batch * d * bidir
thread_per_block = min(512, ncols)
num_block = (ncols - 1) // thread_per_block + 1
init_ = x.new(ncols).zero_() if init is None else init
size = (length, batch, d * bidir) if x.dim() == 3 else (batch, d * bidir)
c = x.new(*size)
h = x.new(*size)
FUNC = SRU_FWD_FUNC if not self.bidirectional else SRU_BiFWD_FUNC
FUNC(args=[
u.contiguous().data_ptr(),
x.contiguous().data_ptr() if k_ == 3 else 0,
bias.data_ptr(),
init_.contiguous().data_ptr(),
mask_h.data_ptr() if mask_h is not None else 0,
length,
batch,
d,
k_,
h.data_ptr(),
c.data_ptr(),
self.activation_type],
block=(thread_per_block, 1, 1), grid=(num_block, 1, 1),
stream=SRU_STREAM
)
self.save_for_backward(u, x, bias, init, mask_h)
self.intermediate = c
if x.dim() == 2:
last_hidden = c
elif self.bidirectional:
# -> directions x batch x dim
last_hidden = torch.stack((c[-1, :, :d], c[0, :, d:]))
else:
last_hidden = c[-1]
return h, last_hidden
def backward(self, grad_h, grad_last):
if self.bidirectional:
grad_last = torch.cat((grad_last[0], grad_last[1]), 1)
bidir = 2 if self.bidirectional else 1
u, x, bias, init, mask_h = self.saved_tensors
c = self.intermediate
length = x.size(0) if x.dim() == 3 else 1
batch = x.size(-2)
d = self.d_out
k = u.size(-1) // d
k_ = k // 2 if self.bidirectional else k
ncols = batch * d * bidir
thread_per_block = min(512, ncols)
num_block = (ncols - 1) // thread_per_block + 1
init_ = x.new(ncols).zero_() if init is None else init
grad_u = u.new(*u.size())
grad_bias = x.new(2, batch, d * bidir)
grad_init = x.new(batch, d * bidir)
# For DEBUG
# size = (length, batch, x.size(-1)) \
# if x.dim() == 3 else (batch, x.size(-1))
# grad_x = x.new(*x.size()) if k_ == 3 else x.new(*size).zero_()
# Normal use
grad_x = x.new(*x.size()) if k_ == 3 else None
FUNC = SRU_BWD_FUNC if not self.bidirectional else SRU_BiBWD_FUNC
FUNC(args=[
u.contiguous().data_ptr(),
x.contiguous().data_ptr() if k_ == 3 else 0,
bias.data_ptr(),
init_.contiguous().data_ptr(),
mask_h.data_ptr() if mask_h is not None else 0,
c.data_ptr(),
grad_h.contiguous().data_ptr(),
grad_last.contiguous().data_ptr(),
length,
batch,
d,
k_,
grad_u.data_ptr(),
grad_x.data_ptr() if k_ == 3 else 0,
grad_bias.data_ptr(),
grad_init.data_ptr(),
self.activation_type],
block=(thread_per_block, 1, 1), grid=(num_block, 1, 1),
stream=SRU_STREAM
)
return grad_u, grad_x, grad_bias.sum(1).view(-1), grad_init, None
class SRUCell(nn.Module):
def __init__(self, n_in, n_out, dropout=0, rnn_dropout=0,
bidirectional=False, use_tanh=1, use_relu=0):
super(SRUCell, self).__init__()
self.n_in = n_in
self.n_out = n_out
self.rnn_dropout = rnn_dropout
self.dropout = dropout
self.bidirectional = bidirectional
self.activation_type = 2 if use_relu else (1 if use_tanh else 0)
out_size = n_out * 2 if bidirectional else n_out
k = 4 if n_in != out_size else 3
self.size_per_dir = n_out * k
self.weight = nn.Parameter(torch.Tensor(
n_in,
self.size_per_dir * 2 if bidirectional else self.size_per_dir
))
self.bias = nn.Parameter(torch.Tensor(
n_out * 4 if bidirectional else n_out * 2
))
self.init_weight()
def init_weight(self):
val_range = (3.0 / self.n_in)**0.5
self.weight.data.uniform_(-val_range, val_range)
self.bias.data.zero_()
def set_bias(self, bias_val=0):
n_out = self.n_out
if self.bidirectional:
self.bias.data[n_out * 2:].zero_().add_(bias_val)
else:
self.bias.data[n_out:].zero_().add_(bias_val)
def forward(self, input, c0=None):
assert input.dim() == 2 or input.dim() == 3
n_in, n_out = self.n_in, self.n_out
batch = input.size(-2)
if c0 is None:
c0 = input.data.new(
batch, n_out if not self.bidirectional else n_out * 2
).zero_()
if self.training and (self.rnn_dropout > 0):
mask = self.get_dropout_mask_((batch, n_in), self.rnn_dropout)
x = input * mask.expand_as(input)
else:
x = input
x_2d = x if x.dim() == 2 else x.contiguous().view(-1, n_in)
u = x_2d.mm(self.weight)
if self.training and (self.dropout > 0):
bidir = 2 if self.bidirectional else 1
mask_h = self.get_dropout_mask_(
(batch, n_out * bidir), self.dropout)
h, c = SRU_Compute(self.activation_type, n_out,
self.bidirectional)(
u, input, self.bias, c0, mask_h
)
else:
h, c = SRU_Compute(self.activation_type, n_out,
self.bidirectional)(
u, input, self.bias, c0
)
return h, c
def get_dropout_mask_(self, size, p):
w = self.weight.data
return w.new(*size).bernoulli_(1 - p).div_(1 - p)
class SRU(nn.Module):
"""
Implementation of "Training RNNs as Fast as CNNs"
:cite:`DBLP:journals/corr/abs-1709-02755`
TODO: turn to pytorch's implementation when it is available.
This implementation is adpoted from the author of the paper:
https://github.com/taolei87/sru/blob/master/cuda_functional.py.
Args:
input_size (int): input to model
hidden_size (int): hidden dimension
num_layers (int): number of layers
dropout (float): dropout to use (stacked)
rnn_dropout (float): dropout to use (recurrent)
bidirectional (bool): bidirectional
use_tanh (bool): activation
use_relu (bool): activation
"""
def __init__(self, input_size, hidden_size,
num_layers=2, dropout=0, rnn_dropout=0,
bidirectional=False, use_tanh=1, use_relu=0):
# An entry check here, will catch on train side and translate side
# if requirements are not satisfied.
check_sru_requirement(abort=True)
super(SRU, self).__init__()
self.n_in = input_size
self.n_out = hidden_size
self.depth = num_layers
self.dropout = dropout
self.rnn_dropout = rnn_dropout
self.rnn_lst = nn.ModuleList()
self.bidirectional = bidirectional
self.out_size = hidden_size * 2 if bidirectional else hidden_size
for i in range(num_layers):
sru_cell = SRUCell(
n_in=self.n_in if i == 0 else self.out_size,
n_out=self.n_out,
dropout=dropout if i + 1 != num_layers else 0,
rnn_dropout=rnn_dropout,
bidirectional=bidirectional,
use_tanh=use_tanh,
use_relu=use_relu,
)
self.rnn_lst.append(sru_cell)
def set_bias(self, bias_val=0):
for l in self.rnn_lst:
l.set_bias(bias_val)
def forward(self, input, c0=None, return_hidden=True):
assert input.dim() == 3 # (len, batch, n_in)
dir_ = 2 if self.bidirectional else 1
if c0 is None:
zeros = input.data.new(
input.size(1), self.n_out * dir_
).zero_()
c0 = [zeros for i in range(self.depth)]
else:
if isinstance(c0, tuple):
# RNNDecoderState wraps hidden as a tuple.
c0 = c0[0]
assert c0.dim() == 3 # (depth, batch, dir_*n_out)
c0 = [h.squeeze(0) for h in c0.chunk(self.depth, 0)]
prevx = input
lstc = []
for i, rnn in enumerate(self.rnn_lst):
h, c = rnn(prevx, c0[i])
prevx = h
lstc.append(c)
if self.bidirectional:
# fh -> (layers*directions) x batch x dim
fh = torch.cat(lstc)
else:
fh = torch.stack(lstc)
if return_hidden:
return prevx, fh
else:
return prevx
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/models/stacked_rnn.py
================================================
""" Implementation of ONMT RNN for Input Feeding Decoding """
import torch
import torch.nn as nn
class StackedLSTM(nn.Module):
"""
Our own implementation of stacked LSTM.
Needed for the decoder, because we do input feeding.
"""
def __init__(self, num_layers, input_size, rnn_size, dropout):
super(StackedLSTM, self).__init__()
self.dropout = nn.Dropout(dropout)
self.num_layers = num_layers
self.layers = nn.ModuleList()
for _ in range(num_layers):
self.layers.append(nn.LSTMCell(input_size, rnn_size))
input_size = rnn_size
def forward(self, input_feed, hidden):
h_0, c_0 = hidden
h_1, c_1 = [], []
for i, layer in enumerate(self.layers):
h_1_i, c_1_i = layer(input_feed, (h_0[i], c_0[i]))
input_feed = h_1_i
if i + 1 != self.num_layers:
input_feed = self.dropout(input_feed)
h_1 += [h_1_i]
c_1 += [c_1_i]
h_1 = torch.stack(h_1)
c_1 = torch.stack(c_1)
return input_feed, (h_1, c_1)
class StackedGRU(nn.Module):
"""
Our own implementation of stacked GRU.
Needed for the decoder, because we do input feeding.
"""
def __init__(self, num_layers, input_size, rnn_size, dropout):
super(StackedGRU, self).__init__()
self.dropout = nn.Dropout(dropout)
self.num_layers = num_layers
self.layers = nn.ModuleList()
for _ in range(num_layers):
self.layers.append(nn.GRUCell(input_size, rnn_size))
input_size = rnn_size
def forward(self, input_feed, hidden):
h_1 = []
for i, layer in enumerate(self.layers):
h_1_i = layer(input_feed, hidden[0][i])
input_feed = h_1_i
if i + 1 != self.num_layers:
input_feed = self.dropout(input_feed)
h_1 += [h_1_i]
h_1 = torch.stack(h_1)
return input_feed, (h_1,)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/modules/__init__.py
================================================
""" Attention and normalization modules """
from onmt.modules.util_class import Elementwise
from onmt.modules.gate import context_gate_factory, ContextGate
from onmt.modules.global_attention import GlobalAttention
from onmt.modules.conv_multi_step_attention import ConvMultiStepAttention
from onmt.modules.copy_generator import CopyGenerator, CopyGeneratorLoss, \
CopyGeneratorLossCompute
from onmt.modules.multi_headed_attn import MultiHeadedAttention
from onmt.modules.embeddings import Embeddings, PositionalEncoding, \
VecEmbedding
from onmt.modules.weight_norm import WeightNormConv2d
from onmt.modules.average_attn import AverageAttention
__all__ = ["Elementwise", "context_gate_factory", "ContextGate",
"GlobalAttention", "ConvMultiStepAttention", "CopyGenerator",
"CopyGeneratorLoss", "CopyGeneratorLossCompute",
"MultiHeadedAttention", "Embeddings", "PositionalEncoding",
"WeightNormConv2d", "AverageAttention", "VecEmbedding"]
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/modules/average_attn.py
================================================
# -*- coding: utf-8 -*-
"""Average Attention module."""
import torch
import torch.nn as nn
from onmt.modules.position_ffn import PositionwiseFeedForward
class AverageAttention(nn.Module):
"""
Average Attention module from
"Accelerating Neural Transformer via an Average Attention Network"
:cite:`DBLP:journals/corr/abs-1805-00631`.
Args:
model_dim (int): the dimension of keys/values/queries,
must be divisible by head_count
dropout (float): dropout parameter
"""
def __init__(self, model_dim, dropout=0.1, aan_useffn=False):
self.model_dim = model_dim
self.aan_useffn = aan_useffn
super(AverageAttention, self).__init__()
if aan_useffn:
self.average_layer = PositionwiseFeedForward(model_dim, model_dim,
dropout)
self.gating_layer = nn.Linear(model_dim * 2, model_dim * 2)
def cumulative_average_mask(self, batch_size, inputs_len, device):
"""
Builds the mask to compute the cumulative average as described in
:cite:`DBLP:journals/corr/abs-1805-00631` -- Figure 3
Args:
batch_size (int): batch size
inputs_len (int): length of the inputs
Returns:
(FloatTensor):
* A Tensor of shape ``(batch_size, input_len, input_len)``
"""
triangle = torch.tril(torch.ones(inputs_len, inputs_len,
dtype=torch.float, device=device))
weights = torch.ones(1, inputs_len, dtype=torch.float, device=device) \
/ torch.arange(1, inputs_len + 1, dtype=torch.float, device=device)
mask = triangle * weights.transpose(0, 1)
return mask.unsqueeze(0).expand(batch_size, inputs_len, inputs_len)
def cumulative_average(self, inputs, mask_or_step,
layer_cache=None, step=None):
"""
Computes the cumulative average as described in
:cite:`DBLP:journals/corr/abs-1805-00631` -- Equations (1) (5) (6)
Args:
inputs (FloatTensor): sequence to average
``(batch_size, input_len, dimension)``
mask_or_step: if cache is set, this is assumed
to be the current step of the
dynamic decoding. Otherwise, it is the mask matrix
used to compute the cumulative average.
layer_cache: a dictionary containing the cumulative average
of the previous step.
Returns:
a tensor of the same shape and type as ``inputs``.
"""
if layer_cache is not None:
step = mask_or_step
average_attention = (inputs + step *
layer_cache["prev_g"]) / (step + 1)
layer_cache["prev_g"] = average_attention
return average_attention
else:
mask = mask_or_step
return torch.matmul(mask.to(inputs.dtype), inputs)
def forward(self, inputs, mask=None, layer_cache=None, step=None):
"""
Args:
inputs (FloatTensor): ``(batch_size, input_len, model_dim)``
Returns:
(FloatTensor, FloatTensor):
* gating_outputs ``(batch_size, input_len, model_dim)``
* average_outputs average attention
``(batch_size, input_len, model_dim)``
"""
batch_size = inputs.size(0)
inputs_len = inputs.size(1)
average_outputs = self.cumulative_average(
inputs, self.cumulative_average_mask(batch_size,
inputs_len, inputs.device)
if layer_cache is None else step, layer_cache=layer_cache)
if self.aan_useffn:
average_outputs = self.average_layer(average_outputs)
gating_outputs = self.gating_layer(torch.cat((inputs,
average_outputs), -1))
input_gate, forget_gate = torch.chunk(gating_outputs, 2, dim=2)
gating_outputs = torch.sigmoid(input_gate) * inputs + \
torch.sigmoid(forget_gate) * average_outputs
return gating_outputs, average_outputs
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/modules/conv_multi_step_attention.py
================================================
""" Multi Step Attention for CNN """
import torch
import torch.nn as nn
import torch.nn.functional as F
from onmt.utils.misc import aeq
SCALE_WEIGHT = 0.5 ** 0.5
def seq_linear(linear, x):
""" linear transform for 3-d tensor """
batch, hidden_size, length, _ = x.size()
h = linear(torch.transpose(x, 1, 2).contiguous().view(
batch * length, hidden_size))
return torch.transpose(h.view(batch, length, hidden_size, 1), 1, 2)
class ConvMultiStepAttention(nn.Module):
"""
Conv attention takes a key matrix, a value matrix and a query vector.
Attention weight is calculated by key matrix with the query vector
and sum on the value matrix. And the same operation is applied
in each decode conv layer.
"""
def __init__(self, input_size):
super(ConvMultiStepAttention, self).__init__()
self.linear_in = nn.Linear(input_size, input_size)
self.mask = None
def apply_mask(self, mask):
""" Apply mask """
self.mask = mask
def forward(self, base_target_emb, input_from_dec, encoder_out_top,
encoder_out_combine):
"""
Args:
base_target_emb: target emb tensor
input_from_dec: output of decode conv
encoder_out_top: the key matrix for calculation of attetion weight,
which is the top output of encode conv
encoder_out_combine:
the value matrix for the attention-weighted sum,
which is the combination of base emb and top output of encode
"""
# checks
# batch, channel, height, width = base_target_emb.size()
batch, _, height, _ = base_target_emb.size()
# batch_, channel_, height_, width_ = input_from_dec.size()
batch_, _, height_, _ = input_from_dec.size()
aeq(batch, batch_)
aeq(height, height_)
# enc_batch, enc_channel, enc_height = encoder_out_top.size()
enc_batch, _, enc_height = encoder_out_top.size()
# enc_batch_, enc_channel_, enc_height_ = encoder_out_combine.size()
enc_batch_, _, enc_height_ = encoder_out_combine.size()
aeq(enc_batch, enc_batch_)
aeq(enc_height, enc_height_)
preatt = seq_linear(self.linear_in, input_from_dec)
target = (base_target_emb + preatt) * SCALE_WEIGHT
target = torch.squeeze(target, 3)
target = torch.transpose(target, 1, 2)
pre_attn = torch.bmm(target, encoder_out_top)
if self.mask is not None:
pre_attn.data.masked_fill_(self.mask, -float('inf'))
attn = F.softmax(pre_attn, dim=2)
context_output = torch.bmm(
attn, torch.transpose(encoder_out_combine, 1, 2))
context_output = torch.transpose(
torch.unsqueeze(context_output, 3), 1, 2)
return context_output, attn
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/modules/copy_generator.py
================================================
import torch
import torch.nn as nn
from onmt.utils.misc import aeq
from onmt.utils.loss import NMTLossCompute
def collapse_copy_scores(scores, batch, tgt_vocab, src_vocabs=None,
batch_dim=1, batch_offset=None):
"""
Given scores from an expanded dictionary
corresponeding to a batch, sums together copies,
with a dictionary word when it is ambiguous.
"""
offset = len(tgt_vocab)
for b in range(scores.size(batch_dim)):
blank = []
fill = []
if src_vocabs is None:
src_vocab = batch.src_ex_vocab[b]
else:
batch_id = batch_offset[b] if batch_offset is not None else b
index = batch.indices.data[batch_id]
src_vocab = src_vocabs[index]
for i in range(1, len(src_vocab)):
sw = src_vocab.itos[i]
ti = tgt_vocab.stoi[sw]
if ti != 0:
blank.append(offset + i)
fill.append(ti)
if blank:
blank = torch.Tensor(blank).type_as(batch.indices.data)
fill = torch.Tensor(fill).type_as(batch.indices.data)
score = scores[:, b] if batch_dim == 1 else scores[b]
score.index_add_(1, fill, score.index_select(1, blank))
score.index_fill_(1, blank, 1e-10)
return scores
class CopyGenerator(nn.Module):
"""An implementation of pointer-generator networks
:cite:`DBLP:journals/corr/SeeLM17`.
These networks consider copying words
directly from the source sequence.
The copy generator is an extended version of the standard
generator that computes three values.
* :math:`p_{softmax}` the standard softmax over `tgt_dict`
* :math:`p(z)` the probability of copying a word from
the source
* :math:`p_{copy}` the probility of copying a particular word.
taken from the attention distribution directly.
The model returns a distribution over the extend dictionary,
computed as
:math:`p(w) = p(z=1) p_{copy}(w) + p(z=0) p_{softmax}(w)`
.. mermaid::
graph BT
A[input]
S[src_map]
B[softmax]
BB[switch]
C[attn]
D[copy]
O[output]
A --> B
A --> BB
S --> D
C --> D
D --> O
B --> O
BB --> O
Args:
input_size (int): size of input representation
output_size (int): size of output vocabulary
pad_idx (int)
"""
def __init__(self, input_size, output_size, pad_idx):
super(CopyGenerator, self).__init__()
self.linear = nn.Linear(input_size, output_size)
self.linear_copy = nn.Linear(input_size, 1)
self.pad_idx = pad_idx
def forward(self, hidden, attn, src_map):
"""
Compute a distribution over the target dictionary
extended by the dynamic dictionary implied by copying
source words.
Args:
hidden (FloatTensor): hidden outputs ``(batch x tlen, input_size)``
attn (FloatTensor): attn for each ``(batch x tlen, input_size)``
src_map (FloatTensor):
A sparse indicator matrix mapping each source word to
its index in the "extended" vocab containing.
``(src_len, batch, extra_words)``
"""
# CHECKS
batch_by_tlen, _ = hidden.size()
batch_by_tlen_, slen = attn.size()
slen_, batch, cvocab = src_map.size()
aeq(batch_by_tlen, batch_by_tlen_)
aeq(slen, slen_)
# Original probabilities.
logits = self.linear(hidden)
logits[:, self.pad_idx] = -float('inf')
prob = torch.softmax(logits, 1)
# Probability of copying p(z=1) batch.
p_copy = torch.sigmoid(self.linear_copy(hidden))
# Probability of not copying: p_{word}(w) * (1 - p(z))
out_prob = torch.mul(prob, 1 - p_copy)
mul_attn = torch.mul(attn, p_copy)
copy_prob = torch.bmm(
mul_attn.view(-1, batch, slen).transpose(0, 1),
src_map.transpose(0, 1)
).transpose(0, 1)
copy_prob = copy_prob.contiguous().view(-1, cvocab)
return torch.cat([out_prob, copy_prob], 1)
class CopyGeneratorLoss(nn.Module):
"""Copy generator criterion."""
def __init__(self, vocab_size, force_copy, unk_index=0,
ignore_index=-100, eps=1e-20):
super(CopyGeneratorLoss, self).__init__()
self.force_copy = force_copy
self.eps = eps
self.vocab_size = vocab_size
self.ignore_index = ignore_index
self.unk_index = unk_index
def forward(self, scores, align, target):
"""
Args:
scores (FloatTensor): ``(batch_size*tgt_len)`` x dynamic vocab size
whose sum along dim 1 is less than or equal to 1, i.e. cols
softmaxed.
align (LongTensor): ``(batch_size x tgt_len)``
target (LongTensor): ``(batch_size x tgt_len)``
"""
# probabilities assigned by the model to the gold targets
vocab_probs = scores.gather(1, target.unsqueeze(1)).squeeze(1)
# probability of tokens copied from source
copy_ix = align.unsqueeze(1) + self.vocab_size
copy_tok_probs = scores.gather(1, copy_ix).squeeze(1)
# Set scores for unk to 0 and add eps
copy_tok_probs[align == self.unk_index] = 0
copy_tok_probs += self.eps # to avoid -inf logs
# find the indices in which you do not use the copy mechanism
non_copy = align == self.unk_index
if not self.force_copy:
non_copy = non_copy | (target != self.unk_index)
probs = torch.where(
non_copy, copy_tok_probs + vocab_probs, copy_tok_probs
)
loss = -probs.log() # just NLLLoss; can the module be incorporated?
# Drop padding.
loss[target == self.ignore_index] = 0
return loss
class CopyGeneratorLossCompute(NMTLossCompute):
"""Copy Generator Loss Computation."""
def __init__(self, criterion, generator, tgt_vocab, normalize_by_length,
lambda_coverage=0.0):
super(CopyGeneratorLossCompute, self).__init__(
criterion, generator, lambda_coverage=lambda_coverage)
self.tgt_vocab = tgt_vocab
self.normalize_by_length = normalize_by_length
def _make_shard_state(self, batch, output, range_, attns):
"""See base class for args description."""
if getattr(batch, "alignment", None) is None:
raise AssertionError("using -copy_attn you need to pass in "
"-dynamic_dict during preprocess stage.")
shard_state = super(CopyGeneratorLossCompute, self)._make_shard_state(
batch, output, range_, attns)
shard_state.update({
"copy_attn": attns.get("copy"),
"align": batch.alignment[range_[0] + 1: range_[1]]
})
return shard_state
def _compute_loss(self, batch, output, target, copy_attn, align,
std_attn=None, coverage_attn=None):
"""Compute the loss.
The args must match :func:`self._make_shard_state()`.
Args:
batch: the current batch.
output: the predict output from the model.
target: the validate target to compare output with.
copy_attn: the copy attention value.
align: the align info.
"""
target = target.view(-1)
align = align.view(-1)
scores = self.generator(
self._bottle(output), self._bottle(copy_attn), batch.src_map
)
loss = self.criterion(scores, align, target)
if self.lambda_coverage != 0.0:
coverage_loss = self._compute_coverage_loss(std_attn,
coverage_attn)
loss += coverage_loss
# this block does not depend on the loss value computed above
# and is used only for stats
scores_data = collapse_copy_scores(
self._unbottle(scores.clone(), batch.batch_size),
batch, self.tgt_vocab, None)
scores_data = self._bottle(scores_data)
# this block does not depend on the loss value computed above
# and is used only for stats
# Correct target copy token instead of
# tgt[i] = align[i] + len(tgt_vocab)
# for i such that tgt[i] == 0 and align[i] != 0
target_data = target.clone()
unk = self.criterion.unk_index
correct_mask = (target_data == unk) & (align != unk)
offset_align = align[correct_mask] + len(self.tgt_vocab)
target_data[correct_mask] += offset_align
# Compute sum of perplexities for stats
stats = self._stats(loss.sum().clone(), scores_data, target_data)
# this part looks like it belongs in CopyGeneratorLoss
if self.normalize_by_length:
# Compute Loss as NLL divided by seq length
tgt_lens = batch.tgt[:, :, 0].ne(self.padding_idx).sum(0).float()
# Compute Total Loss per sequence in batch
loss = loss.view(-1, batch.batch_size).sum(0)
# Divide by length of each sequence and sum
loss = torch.div(loss, tgt_lens).sum()
else:
loss = loss.sum()
return loss, stats
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/modules/embeddings.py
================================================
""" Embeddings module """
import math
import warnings
import torch
import torch.nn as nn
from onmt.modules.util_class import Elementwise
class PositionalEncoding(nn.Module):
"""Sinusoidal positional encoding for non-recurrent neural networks.
Implementation based on "Attention Is All You Need"
:cite:`DBLP:journals/corr/VaswaniSPUJGKP17`
Args:
dropout (float): dropout parameter
dim (int): embedding size
"""
def __init__(self, dropout, dim, max_len=5000):
if dim % 2 != 0:
raise ValueError("Cannot use sin/cos positional encoding with "
"odd dim (got dim={:d})".format(dim))
pe = torch.zeros(max_len, dim)
position = torch.arange(0, max_len).unsqueeze(1)
div_term = torch.exp((torch.arange(0, dim, 2, dtype=torch.float) *
-(math.log(10000.0) / dim)))
pe[:, 0::2] = torch.sin(position.float() * div_term)
pe[:, 1::2] = torch.cos(position.float() * div_term)
pe = pe.unsqueeze(1)
super(PositionalEncoding, self).__init__()
self.register_buffer('pe', pe)
self.dropout = nn.Dropout(p=dropout)
self.dim = dim
def forward(self, emb, step=None):
"""Embed inputs.
Args:
emb (FloatTensor): Sequence of word vectors
``(seq_len, batch_size, self.dim)``
step (int or NoneType): If stepwise (``seq_len = 1``), use
the encoding for this position.
"""
emb = emb * math.sqrt(self.dim)
if step is None:
emb = emb + self.pe[:emb.size(0)]
else:
emb = emb + self.pe[step]
emb = self.dropout(emb)
return emb
class VecEmbedding(nn.Module):
def __init__(self, vec_size,
emb_dim,
position_encoding=False,
dropout=0):
super(VecEmbedding, self).__init__()
self.embedding_size = emb_dim
self.proj = nn.Linear(vec_size, emb_dim, bias=False)
self.word_padding_idx = 0 # vector seqs are zero-padded
self.position_encoding = position_encoding
if self.position_encoding:
self.pe = PositionalEncoding(dropout, self.embedding_size)
def forward(self, x, step=None):
"""
Args:
x (FloatTensor): input, ``(len, batch, 1, vec_feats)``.
Returns:
FloatTensor: embedded vecs ``(len, batch, embedding_size)``.
"""
x = self.proj(x).squeeze(2)
if self.position_encoding:
x = self.pe(x, step=step)
return x
def load_pretrained_vectors(self, file):
assert not file
class Embeddings(nn.Module):
"""Words embeddings for encoder/decoder.
Additionally includes ability to add sparse input features
based on "Linguistic Input Features Improve Neural Machine Translation"
:cite:`sennrich2016linguistic`.
.. mermaid::
graph LR
A[Input]
C[Feature 1 Lookup]
A-->B[Word Lookup]
A-->C
A-->D[Feature N Lookup]
B-->E[MLP/Concat]
C-->E
D-->E
E-->F[Output]
Args:
word_vec_size (int): size of the dictionary of embeddings.
word_padding_idx (int): padding index for words in the embeddings.
feat_padding_idx (List[int]): padding index for a list of features
in the embeddings.
word_vocab_size (int): size of dictionary of embeddings for words.
feat_vocab_sizes (List[int], optional): list of size of dictionary
of embeddings for each feature.
position_encoding (bool): see :class:`~onmt.modules.PositionalEncoding`
feat_merge (string): merge action for the features embeddings:
concat, sum or mlp.
feat_vec_exponent (float): when using `-feat_merge concat`, feature
embedding size is N^feat_dim_exponent, where N is the
number of values the feature takes.
feat_vec_size (int): embedding dimension for features when using
`-feat_merge mlp`
dropout (float): dropout probability.
"""
def __init__(self, word_vec_size,
word_vocab_size,
word_padding_idx,
position_encoding=False,
feat_merge="concat",
feat_vec_exponent=0.7,
feat_vec_size=-1,
feat_padding_idx=[],
feat_vocab_sizes=[],
dropout=0,
sparse=False,
fix_word_vecs=False):
self._validate_args(feat_merge, feat_vocab_sizes, feat_vec_exponent,
feat_vec_size, feat_padding_idx)
if feat_padding_idx is None:
feat_padding_idx = []
self.word_padding_idx = word_padding_idx
self.word_vec_size = word_vec_size
# Dimensions and padding for constructing the word embedding matrix
vocab_sizes = [word_vocab_size]
emb_dims = [word_vec_size]
pad_indices = [word_padding_idx]
# Dimensions and padding for feature embedding matrices
# (these have no effect if feat_vocab_sizes is empty)
if feat_merge == 'sum':
feat_dims = [word_vec_size] * len(feat_vocab_sizes)
elif feat_vec_size > 0:
feat_dims = [feat_vec_size] * len(feat_vocab_sizes)
else:
feat_dims = [int(vocab ** feat_vec_exponent)
for vocab in feat_vocab_sizes]
vocab_sizes.extend(feat_vocab_sizes)
emb_dims.extend(feat_dims)
pad_indices.extend(feat_padding_idx)
# The embedding matrix look-up tables. The first look-up table
# is for words. Subsequent ones are for features, if any exist.
emb_params = zip(vocab_sizes, emb_dims, pad_indices)
embeddings = [nn.Embedding(vocab, dim, padding_idx=pad, sparse=sparse)
for vocab, dim, pad in emb_params]
emb_luts = Elementwise(feat_merge, embeddings)
# The final output size of word + feature vectors. This can vary
# from the word vector size if and only if features are defined.
# This is the attribute you should access if you need to know
# how big your embeddings are going to be.
self.embedding_size = (sum(emb_dims) if feat_merge == 'concat'
else word_vec_size)
# The sequence of operations that converts the input sequence
# into a sequence of embeddings. At minimum this consists of
# looking up the embeddings for each word and feature in the
# input. Model parameters may require the sequence to contain
# additional operations as well.
super(Embeddings, self).__init__()
self.make_embedding = nn.Sequential()
self.make_embedding.add_module('emb_luts', emb_luts)
if feat_merge == 'mlp' and len(feat_vocab_sizes) > 0:
in_dim = sum(emb_dims)
mlp = nn.Sequential(nn.Linear(in_dim, word_vec_size), nn.ReLU())
self.make_embedding.add_module('mlp', mlp)
self.position_encoding = position_encoding
if self.position_encoding:
pe = PositionalEncoding(dropout, self.embedding_size)
self.make_embedding.add_module('pe', pe)
if fix_word_vecs:
self.word_lut.weight.requires_grad = False
def _validate_args(self, feat_merge, feat_vocab_sizes, feat_vec_exponent,
feat_vec_size, feat_padding_idx):
if feat_merge == "sum":
# features must use word_vec_size
if feat_vec_exponent != 0.7:
warnings.warn("Merging with sum, but got non-default "
"feat_vec_exponent. It will be unused.")
if feat_vec_size != -1:
warnings.warn("Merging with sum, but got non-default "
"feat_vec_size. It will be unused.")
elif feat_vec_size > 0:
# features will use feat_vec_size
if feat_vec_exponent != -1:
warnings.warn("Not merging with sum and positive "
"feat_vec_size, but got non-default "
"feat_vec_exponent. It will be unused.")
else:
if feat_vec_exponent <= 0:
raise ValueError("Using feat_vec_exponent to determine "
"feature vec size, but got feat_vec_exponent "
"less than or equal to 0.")
n_feats = len(feat_vocab_sizes)
if n_feats != len(feat_padding_idx):
raise ValueError("Got unequal number of feat_vocab_sizes and "
"feat_padding_idx ({:d} != {:d})".format(
n_feats, len(feat_padding_idx)))
@property
def word_lut(self):
"""Word look-up table."""
return self.make_embedding[0][0]
@property
def emb_luts(self):
"""Embedding look-up table."""
return self.make_embedding[0]
def load_pretrained_vectors(self, emb_file):
"""Load in pretrained embeddings.
Args:
emb_file (str) : path to torch serialized embeddings
"""
if emb_file:
pretrained = torch.load(emb_file)
pretrained_vec_size = pretrained.size(1)
if self.word_vec_size > pretrained_vec_size:
self.word_lut.weight.data[:, :pretrained_vec_size] = pretrained
elif self.word_vec_size < pretrained_vec_size:
self.word_lut.weight.data \
.copy_(pretrained[:, :self.word_vec_size])
else:
self.word_lut.weight.data.copy_(pretrained)
def forward(self, source, step=None):
"""Computes the embeddings for words and features.
Args:
source (LongTensor): index tensor ``(len, batch, nfeat)``
Returns:
FloatTensor: Word embeddings ``(len, batch, embedding_size)``
"""
if self.position_encoding:
for i, module in enumerate(self.make_embedding._modules.values()):
if i == len(self.make_embedding._modules.values()) - 1:
source = module(source, step=step)
else:
source = module(source)
else:
source = self.make_embedding(source)
return source
def update_dropout(self, dropout):
if self.position_encoding:
self._modules['make_embedding'][1].dropout.p = dropout
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/modules/gate.py
================================================
""" ContextGate module """
import torch
import torch.nn as nn
def context_gate_factory(gate_type, embeddings_size, decoder_size,
attention_size, output_size):
"""Returns the correct ContextGate class"""
gate_types = {'source': SourceContextGate,
'target': TargetContextGate,
'both': BothContextGate}
assert gate_type in gate_types, "Not valid ContextGate type: {0}".format(
gate_type)
return gate_types[gate_type](embeddings_size, decoder_size, attention_size,
output_size)
class ContextGate(nn.Module):
"""
Context gate is a decoder module that takes as input the previous word
embedding, the current decoder state and the attention state, and
produces a gate.
The gate can be used to select the input from the target side context
(decoder state), from the source context (attention state) or both.
"""
def __init__(self, embeddings_size, decoder_size,
attention_size, output_size):
super(ContextGate, self).__init__()
input_size = embeddings_size + decoder_size + attention_size
self.gate = nn.Linear(input_size, output_size, bias=True)
self.sig = nn.Sigmoid()
self.source_proj = nn.Linear(attention_size, output_size)
self.target_proj = nn.Linear(embeddings_size + decoder_size,
output_size)
def forward(self, prev_emb, dec_state, attn_state):
input_tensor = torch.cat((prev_emb, dec_state, attn_state), dim=1)
z = self.sig(self.gate(input_tensor))
proj_source = self.source_proj(attn_state)
proj_target = self.target_proj(
torch.cat((prev_emb, dec_state), dim=1))
return z, proj_source, proj_target
class SourceContextGate(nn.Module):
"""Apply the context gate only to the source context"""
def __init__(self, embeddings_size, decoder_size,
attention_size, output_size):
super(SourceContextGate, self).__init__()
self.context_gate = ContextGate(embeddings_size, decoder_size,
attention_size, output_size)
self.tanh = nn.Tanh()
def forward(self, prev_emb, dec_state, attn_state):
z, source, target = self.context_gate(
prev_emb, dec_state, attn_state)
return self.tanh(target + z * source)
class TargetContextGate(nn.Module):
"""Apply the context gate only to the target context"""
def __init__(self, embeddings_size, decoder_size,
attention_size, output_size):
super(TargetContextGate, self).__init__()
self.context_gate = ContextGate(embeddings_size, decoder_size,
attention_size, output_size)
self.tanh = nn.Tanh()
def forward(self, prev_emb, dec_state, attn_state):
z, source, target = self.context_gate(prev_emb, dec_state, attn_state)
return self.tanh(z * target + source)
class BothContextGate(nn.Module):
"""Apply the context gate to both contexts"""
def __init__(self, embeddings_size, decoder_size,
attention_size, output_size):
super(BothContextGate, self).__init__()
self.context_gate = ContextGate(embeddings_size, decoder_size,
attention_size, output_size)
self.tanh = nn.Tanh()
def forward(self, prev_emb, dec_state, attn_state):
z, source, target = self.context_gate(prev_emb, dec_state, attn_state)
return self.tanh((1. - z) * target + z * source)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/modules/global_attention.py
================================================
"""Global attention modules (Luong / Bahdanau)"""
import torch
import torch.nn as nn
import torch.nn.functional as F
from onmt.modules.sparse_activations import sparsemax
from onmt.utils.misc import aeq, sequence_mask
# This class is mainly used by decoder.py for RNNs but also
# by the CNN / transformer decoder when copy attention is used
# CNN has its own attention mechanism ConvMultiStepAttention
# Transformer has its own MultiHeadedAttention
class GlobalAttention(nn.Module):
r"""
Global attention takes a matrix and a query vector. It
then computes a parameterized convex combination of the matrix
based on the input query.
Constructs a unit mapping a query `q` of size `dim`
and a source matrix `H` of size `n x dim`, to an output
of size `dim`.
.. mermaid::
graph BT
A[Query]
subgraph RNN
C[H 1]
D[H 2]
E[H N]
end
F[Attn]
G[Output]
A --> F
C --> F
D --> F
E --> F
C -.-> G
D -.-> G
E -.-> G
F --> G
All models compute the output as
:math:`c = \sum_{j=1}^{\text{SeqLength}} a_j H_j` where
:math:`a_j` is the softmax of a score function.
Then then apply a projection layer to [q, c].
However they
differ on how they compute the attention score.
* Luong Attention (dot, general):
* dot: :math:`\text{score}(H_j,q) = H_j^T q`
* general: :math:`\text{score}(H_j, q) = H_j^T W_a q`
* Bahdanau Attention (mlp):
* :math:`\text{score}(H_j, q) = v_a^T \text{tanh}(W_a q + U_a h_j)`
Args:
dim (int): dimensionality of query and key
coverage (bool): use coverage term
attn_type (str): type of attention to use, options [dot,general,mlp]
attn_func (str): attention function to use, options [softmax,sparsemax]
"""
def __init__(self, dim, coverage=False, attn_type="dot",
attn_func="softmax"):
super(GlobalAttention, self).__init__()
self.dim = dim
assert attn_type in ["dot", "general", "mlp"], (
"Please select a valid attention type (got {:s}).".format(
attn_type))
self.attn_type = attn_type
assert attn_func in ["softmax", "sparsemax"], (
"Please select a valid attention function.")
self.attn_func = attn_func
if self.attn_type == "general":
self.linear_in = nn.Linear(dim, dim, bias=False)
elif self.attn_type == "mlp":
self.linear_context = nn.Linear(dim, dim, bias=False)
self.linear_query = nn.Linear(dim, dim, bias=True)
self.v = nn.Linear(dim, 1, bias=False)
# mlp wants it with bias
out_bias = self.attn_type == "mlp"
self.linear_out = nn.Linear(dim * 2, dim, bias=out_bias)
if coverage:
self.linear_cover = nn.Linear(1, dim, bias=False)
def score(self, h_t, h_s):
"""
Args:
h_t (FloatTensor): sequence of queries ``(batch, tgt_len, dim)``
h_s (FloatTensor): sequence of sources ``(batch, src_len, dim``
Returns:
FloatTensor: raw attention scores (unnormalized) for each src index
``(batch, tgt_len, src_len)``
"""
# Check input sizes
src_batch, src_len, src_dim = h_s.size()
tgt_batch, tgt_len, tgt_dim = h_t.size()
aeq(src_batch, tgt_batch)
aeq(src_dim, tgt_dim)
aeq(self.dim, src_dim)
if self.attn_type in ["general", "dot"]:
if self.attn_type == "general":
h_t_ = h_t.view(tgt_batch * tgt_len, tgt_dim)
h_t_ = self.linear_in(h_t_)
h_t = h_t_.view(tgt_batch, tgt_len, tgt_dim)
h_s_ = h_s.transpose(1, 2)
# (batch, t_len, d) x (batch, d, s_len) --> (batch, t_len, s_len)
return torch.bmm(h_t, h_s_)
else:
dim = self.dim
wq = self.linear_query(h_t.view(-1, dim))
wq = wq.view(tgt_batch, tgt_len, 1, dim)
wq = wq.expand(tgt_batch, tgt_len, src_len, dim)
uh = self.linear_context(h_s.contiguous().view(-1, dim))
uh = uh.view(src_batch, 1, src_len, dim)
uh = uh.expand(src_batch, tgt_len, src_len, dim)
# (batch, t_len, s_len, d)
wquh = torch.tanh(wq + uh)
return self.v(wquh.view(-1, dim)).view(tgt_batch, tgt_len, src_len)
def forward(self, source, memory_bank, memory_lengths=None, coverage=None):
"""
Args:
source (FloatTensor): query vectors ``(batch, tgt_len, dim)``
memory_bank (FloatTensor): source vectors ``(batch, src_len, dim)``
memory_lengths (LongTensor): the source context lengths ``(batch,)``
coverage (FloatTensor): None (not supported yet)
Returns:
(FloatTensor, FloatTensor):
* Computed vector ``(tgt_len, batch, dim)``
* Attention distribtutions for each query
``(tgt_len, batch, src_len)``
"""
# one step input
if source.dim() == 2:
one_step = True
source = source.unsqueeze(1)
else:
one_step = False
batch, source_l, dim = memory_bank.size()
batch_, target_l, dim_ = source.size()
aeq(batch, batch_)
aeq(dim, dim_)
aeq(self.dim, dim)
if coverage is not None:
batch_, source_l_ = coverage.size()
aeq(batch, batch_)
aeq(source_l, source_l_)
if coverage is not None:
cover = coverage.view(-1).unsqueeze(1)
memory_bank += self.linear_cover(cover).view_as(memory_bank)
memory_bank = torch.tanh(memory_bank)
# compute attention scores, as in Luong et al.
align = self.score(source, memory_bank)
if memory_lengths is not None:
mask = sequence_mask(memory_lengths, max_len=align.size(-1))
mask = mask.unsqueeze(1) # Make it broadcastable.
align.masked_fill_(~mask, -float('inf'))
# Softmax or sparsemax to normalize attention weights
if self.attn_func == "softmax":
align_vectors = F.softmax(align.view(batch*target_l, source_l), -1)
else:
align_vectors = sparsemax(align.view(batch*target_l, source_l), -1)
align_vectors = align_vectors.view(batch, target_l, source_l)
# each context vector c_t is the weighted average
# over all the source hidden states
c = torch.bmm(align_vectors, memory_bank)
# concatenate
concat_c = torch.cat([c, source], 2).view(batch*target_l, dim*2)
attn_h = self.linear_out(concat_c).view(batch, target_l, dim)
if self.attn_type in ["general", "dot"]:
attn_h = torch.tanh(attn_h)
if one_step:
attn_h = attn_h.squeeze(1)
align_vectors = align_vectors.squeeze(1)
# Check output sizes
batch_, dim_ = attn_h.size()
aeq(batch, batch_)
aeq(dim, dim_)
batch_, source_l_ = align_vectors.size()
aeq(batch, batch_)
aeq(source_l, source_l_)
else:
attn_h = attn_h.transpose(0, 1).contiguous()
align_vectors = align_vectors.transpose(0, 1).contiguous()
# Check output sizes
target_l_, batch_, dim_ = attn_h.size()
aeq(target_l, target_l_)
aeq(batch, batch_)
aeq(dim, dim_)
target_l_, batch_, source_l_ = align_vectors.size()
aeq(target_l, target_l_)
aeq(batch, batch_)
aeq(source_l, source_l_)
return attn_h, align_vectors
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/modules/multi_headed_attn.py
================================================
""" Multi-Head Attention module """
import math
import torch
import torch.nn as nn
from onmt.utils.misc import generate_relative_positions_matrix,\
relative_matmul
# from onmt.utils.misc import aeq
class MultiHeadedAttention(nn.Module):
"""Multi-Head Attention module from "Attention is All You Need"
:cite:`DBLP:journals/corr/VaswaniSPUJGKP17`.
Similar to standard `dot` attention but uses
multiple attention distributions simulataneously
to select relevant items.
.. mermaid::
graph BT
A[key]
B[value]
C[query]
O[output]
subgraph Attn
D[Attn 1]
E[Attn 2]
F[Attn N]
end
A --> D
C --> D
A --> E
C --> E
A --> F
C --> F
D --> O
E --> O
F --> O
B --> O
Also includes several additional tricks.
Args:
head_count (int): number of parallel heads
model_dim (int): the dimension of keys/values/queries,
must be divisible by head_count
dropout (float): dropout parameter
"""
def __init__(self, head_count, model_dim, dropout=0.1,
max_relative_positions=0):
assert model_dim % head_count == 0
self.dim_per_head = model_dim // head_count
self.model_dim = model_dim
super(MultiHeadedAttention, self).__init__()
self.head_count = head_count
self.linear_keys = nn.Linear(model_dim,
head_count * self.dim_per_head)
self.linear_values = nn.Linear(model_dim,
head_count * self.dim_per_head)
self.linear_query = nn.Linear(model_dim,
head_count * self.dim_per_head)
self.softmax = nn.Softmax(dim=-1)
self.dropout = nn.Dropout(dropout)
self.final_linear = nn.Linear(model_dim, model_dim)
self.max_relative_positions = max_relative_positions
if max_relative_positions > 0:
vocab_size = max_relative_positions * 2 + 1
self.relative_positions_embeddings = nn.Embedding(
vocab_size, self.dim_per_head)
def forward(self, key, value, query, mask=None,
layer_cache=None, attn_type=None):
"""
Compute the context vector and the attention vectors.
Args:
key (FloatTensor): set of `key_len`
key vectors ``(batch, key_len, dim)``
value (FloatTensor): set of `key_len`
value vectors ``(batch, key_len, dim)``
query (FloatTensor): set of `query_len`
query vectors ``(batch, query_len, dim)``
mask: binary mask 1/0 indicating which keys have
zero / non-zero attention ``(batch, query_len, key_len)``
Returns:
(FloatTensor, FloatTensor):
* output context vectors ``(batch, query_len, dim)``
* Attention vector in heads ``(batch, head, query_len, key_len)``.
"""
# CHECKS
# batch, k_len, d = key.size()
# batch_, k_len_, d_ = value.size()
# aeq(batch, batch_)
# aeq(k_len, k_len_)
# aeq(d, d_)
# batch_, q_len, d_ = query.size()
# aeq(batch, batch_)
# aeq(d, d_)
# aeq(self.model_dim % 8, 0)
# if mask is not None:
# batch_, q_len_, k_len_ = mask.size()
# aeq(batch_, batch)
# aeq(k_len_, k_len)
# aeq(q_len_ == q_len)
# END CHECKS
batch_size = key.size(0)
dim_per_head = self.dim_per_head
head_count = self.head_count
key_len = key.size(1)
query_len = query.size(1)
def shape(x):
"""Projection."""
return x.view(batch_size, -1, head_count, dim_per_head) \
.transpose(1, 2)
def unshape(x):
"""Compute context."""
return x.transpose(1, 2).contiguous() \
.view(batch_size, -1, head_count * dim_per_head)
# 1) Project key, value, and query.
if layer_cache is not None:
if attn_type == "self":
query, key, value = self.linear_query(query),\
self.linear_keys(query),\
self.linear_values(query)
key = shape(key)
value = shape(value)
if layer_cache["self_keys"] is not None:
key = torch.cat(
(layer_cache["self_keys"], key),
dim=2)
if layer_cache["self_values"] is not None:
value = torch.cat(
(layer_cache["self_values"], value),
dim=2)
layer_cache["self_keys"] = key
layer_cache["self_values"] = value
elif attn_type == "context":
query = self.linear_query(query)
if layer_cache["memory_keys"] is None:
key, value = self.linear_keys(key),\
self.linear_values(value)
key = shape(key)
value = shape(value)
else:
key, value = layer_cache["memory_keys"],\
layer_cache["memory_values"]
layer_cache["memory_keys"] = key
layer_cache["memory_values"] = value
else:
key = self.linear_keys(key)
value = self.linear_values(value)
query = self.linear_query(query)
key = shape(key)
value = shape(value)
if self.max_relative_positions > 0 and attn_type == "self":
key_len = key.size(2)
# 1 or key_len x key_len
relative_positions_matrix = generate_relative_positions_matrix(
key_len, self.max_relative_positions,
cache=True if layer_cache is not None else False)
# 1 or key_len x key_len x dim_per_head
relations_keys = self.relative_positions_embeddings(
relative_positions_matrix.to(key.device))
# 1 or key_len x key_len x dim_per_head
relations_values = self.relative_positions_embeddings(
relative_positions_matrix.to(key.device))
query = shape(query)
key_len = key.size(2)
query_len = query.size(2)
# 2) Calculate and scale scores.
query = query / math.sqrt(dim_per_head)
# batch x num_heads x query_len x key_len
query_key = torch.matmul(query, key.transpose(2, 3))
if self.max_relative_positions > 0 and attn_type == "self":
scores = query_key + relative_matmul(query, relations_keys, True)
else:
scores = query_key
scores = scores.float()
if mask is not None:
mask = mask.unsqueeze(1) # [B, 1, 1, T_values]
scores = scores.masked_fill(mask, -1e18)
# 3) Apply attention dropout and compute context vectors.
attn = self.softmax(scores).to(query.dtype)
drop_attn = self.dropout(attn)
context_original = torch.matmul(drop_attn, value)
if self.max_relative_positions > 0 and attn_type == "self":
context = unshape(context_original
+ relative_matmul(drop_attn,
relations_values,
False))
else:
context = unshape(context_original)
output = self.final_linear(context)
# CHECK
# batch_, q_len_, d_ = output.size()
# aeq(q_len, q_len_)
# aeq(batch, batch_)
# aeq(d, d_)
# Return multi-head attn
attns = attn \
.view(batch_size, head_count,
query_len, key_len)
return output, attns
def update_dropout(self, dropout):
self.dropout.p = dropout
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/modules/position_ffn.py
================================================
"""Position feed-forward network from "Attention is All You Need"."""
import torch.nn as nn
class PositionwiseFeedForward(nn.Module):
""" A two-layer Feed-Forward-Network with residual layer norm.
Args:
d_model (int): the size of input for the first-layer of the FFN.
d_ff (int): the hidden layer size of the second-layer
of the FNN.
dropout (float): dropout probability in :math:`[0, 1)`.
"""
def __init__(self, d_model, d_ff, dropout=0.1):
super(PositionwiseFeedForward, self).__init__()
self.w_1 = nn.Linear(d_model, d_ff)
self.w_2 = nn.Linear(d_ff, d_model)
self.layer_norm = nn.LayerNorm(d_model, eps=1e-6)
self.dropout_1 = nn.Dropout(dropout)
self.relu = nn.ReLU()
self.dropout_2 = nn.Dropout(dropout)
def forward(self, x):
"""Layer definition.
Args:
x: ``(batch_size, input_len, model_dim)``
Returns:
(FloatTensor): Output ``(batch_size, input_len, model_dim)``.
"""
inter = self.dropout_1(self.relu(self.w_1(self.layer_norm(x))))
output = self.dropout_2(self.w_2(inter))
return output + x
def update_dropout(self, dropout):
self.dropout_1.p = dropout
self.dropout_2.p = dropout
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/modules/sparse_activations.py
================================================
"""
An implementation of sparsemax (Martins & Astudillo, 2016). See
:cite:`DBLP:journals/corr/MartinsA16` for detailed description.
By Ben Peters and Vlad Niculae
"""
import torch
from torch.autograd import Function
import torch.nn as nn
def _make_ix_like(input, dim=0):
d = input.size(dim)
rho = torch.arange(1, d + 1, device=input.device, dtype=input.dtype)
view = [1] * input.dim()
view[0] = -1
return rho.view(view).transpose(0, dim)
def _threshold_and_support(input, dim=0):
"""Sparsemax building block: compute the threshold
Args:
input: any dimension
dim: dimension along which to apply the sparsemax
Returns:
the threshold value
"""
input_srt, _ = torch.sort(input, descending=True, dim=dim)
input_cumsum = input_srt.cumsum(dim) - 1
rhos = _make_ix_like(input, dim)
support = rhos * input_srt > input_cumsum
support_size = support.sum(dim=dim).unsqueeze(dim)
tau = input_cumsum.gather(dim, support_size - 1)
tau /= support_size.to(input.dtype)
return tau, support_size
class SparsemaxFunction(Function):
@staticmethod
def forward(ctx, input, dim=0):
"""sparsemax: normalizing sparse transform (a la softmax)
Parameters:
input (Tensor): any shape
dim: dimension along which to apply sparsemax
Returns:
output (Tensor): same shape as input
"""
ctx.dim = dim
max_val, _ = input.max(dim=dim, keepdim=True)
input -= max_val # same numerical stability trick as for softmax
tau, supp_size = _threshold_and_support(input, dim=dim)
output = torch.clamp(input - tau, min=0)
ctx.save_for_backward(supp_size, output)
return output
@staticmethod
def backward(ctx, grad_output):
supp_size, output = ctx.saved_tensors
dim = ctx.dim
grad_input = grad_output.clone()
grad_input[output == 0] = 0
v_hat = grad_input.sum(dim=dim) / supp_size.to(output.dtype).squeeze()
v_hat = v_hat.unsqueeze(dim)
grad_input = torch.where(output != 0, grad_input - v_hat, grad_input)
return grad_input, None
sparsemax = SparsemaxFunction.apply
class Sparsemax(nn.Module):
def __init__(self, dim=0):
self.dim = dim
super(Sparsemax, self).__init__()
def forward(self, input):
return sparsemax(input, self.dim)
class LogSparsemax(nn.Module):
def __init__(self, dim=0):
self.dim = dim
super(LogSparsemax, self).__init__()
def forward(self, input):
return torch.log(sparsemax(input, self.dim))
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/modules/sparse_losses.py
================================================
import torch
import torch.nn as nn
from torch.autograd import Function
from onmt.modules.sparse_activations import _threshold_and_support
from onmt.utils.misc import aeq
class SparsemaxLossFunction(Function):
@staticmethod
def forward(ctx, input, target):
"""
input (FloatTensor): ``(n, num_classes)``.
target (LongTensor): ``(n,)``, the indices of the target classes
"""
input_batch, classes = input.size()
target_batch = target.size(0)
aeq(input_batch, target_batch)
z_k = input.gather(1, target.unsqueeze(1)).squeeze()
tau_z, support_size = _threshold_and_support(input, dim=1)
support = input > tau_z
x = torch.where(
support, input**2 - tau_z**2,
torch.tensor(0.0, device=input.device)
).sum(dim=1)
ctx.save_for_backward(input, target, tau_z)
# clamping necessary because of numerical errors: loss should be lower
# bounded by zero, but negative values near zero are possible without
# the clamp
return torch.clamp(x / 2 - z_k + 0.5, min=0.0)
@staticmethod
def backward(ctx, grad_output):
input, target, tau_z = ctx.saved_tensors
sparsemax_out = torch.clamp(input - tau_z, min=0)
delta = torch.zeros_like(sparsemax_out)
delta.scatter_(1, target.unsqueeze(1), 1)
return sparsemax_out - delta, None
sparsemax_loss = SparsemaxLossFunction.apply
class SparsemaxLoss(nn.Module):
"""
An implementation of sparsemax loss, first proposed in
:cite:`DBLP:journals/corr/MartinsA16`. If using
a sparse output layer, it is not possible to use negative log likelihood
because the loss is infinite in the case the target is assigned zero
probability. Inputs to SparsemaxLoss are arbitrary dense real-valued
vectors (like in nn.CrossEntropyLoss), not probability vectors (like in
nn.NLLLoss).
"""
def __init__(self, weight=None, ignore_index=-100,
reduction='elementwise_mean'):
assert reduction in ['elementwise_mean', 'sum', 'none']
self.reduction = reduction
self.weight = weight
self.ignore_index = ignore_index
super(SparsemaxLoss, self).__init__()
def forward(self, input, target):
loss = sparsemax_loss(input, target)
if self.ignore_index >= 0:
ignored_positions = target == self.ignore_index
size = float((target.size(0) - ignored_positions.sum()).item())
loss.masked_fill_(ignored_positions, 0.0)
else:
size = float(target.size(0))
if self.reduction == 'sum':
loss = loss.sum()
elif self.reduction == 'elementwise_mean':
loss = loss.sum() / size
return loss
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/modules/structured_attention.py
================================================
import torch.nn as nn
import torch
import torch.cuda
class MatrixTree(nn.Module):
"""Implementation of the matrix-tree theorem for computing marginals
of non-projective dependency parsing. This attention layer is used
in the paper "Learning Structured Text Representations"
:cite:`DBLP:journals/corr/LiuL17d`.
"""
def __init__(self, eps=1e-5):
self.eps = eps
super(MatrixTree, self).__init__()
def forward(self, input):
laplacian = input.exp() + self.eps
output = input.clone()
for b in range(input.size(0)):
lap = laplacian[b].masked_fill(
torch.eye(input.size(1), device=input.device).ne(0), 0)
lap = -lap + torch.diag(lap.sum(0))
# store roots on diagonal
lap[0] = input[b].diag().exp()
inv_laplacian = lap.inverse()
factor = inv_laplacian.diag().unsqueeze(1)\
.expand_as(input[b]).transpose(0, 1)
term1 = input[b].exp().mul(factor).clone()
term2 = input[b].exp().mul(inv_laplacian.transpose(0, 1)).clone()
term1[:, 0] = 0
term2[0] = 0
output[b] = term1 - term2
roots_output = input[b].diag().exp().mul(
inv_laplacian.transpose(0, 1)[0])
output[b] = output[b] + torch.diag(roots_output)
return output
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/modules/util_class.py
================================================
""" Misc classes """
import torch
import torch.nn as nn
# At the moment this class is only used by embeddings.Embeddings look-up tables
class Elementwise(nn.ModuleList):
"""
A simple network container.
Parameters are a list of modules.
Inputs are a 3d Tensor whose last dimension is the same length
as the list.
Outputs are the result of applying modules to inputs elementwise.
An optional merge parameter allows the outputs to be reduced to a
single Tensor.
"""
def __init__(self, merge=None, *args):
assert merge in [None, 'first', 'concat', 'sum', 'mlp']
self.merge = merge
super(Elementwise, self).__init__(*args)
def forward(self, inputs):
inputs_ = [feat.squeeze(2) for feat in inputs.split(1, dim=2)]
assert len(self) == len(inputs_)
outputs = [f(x) for f, x in zip(self, inputs_)]
if self.merge == 'first':
return outputs[0]
elif self.merge == 'concat' or self.merge == 'mlp':
return torch.cat(outputs, 2)
elif self.merge == 'sum':
return sum(outputs)
else:
return outputs
class Cast(nn.Module):
"""
Basic layer that casts its input to a specific data type. The same tensor
is returned if the data type is already correct.
"""
def __init__(self, dtype):
super(Cast, self).__init__()
self._dtype = dtype
def forward(self, x):
return x.to(self._dtype)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/modules/weight_norm.py
================================================
""" Weights normalization modules """
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import Parameter
def get_var_maybe_avg(namespace, var_name, training, polyak_decay):
""" utility for retrieving polyak averaged params
Update average
"""
v = getattr(namespace, var_name)
v_avg = getattr(namespace, var_name + '_avg')
v_avg -= (1 - polyak_decay) * (v_avg - v.data)
if training:
return v
else:
return v_avg
def get_vars_maybe_avg(namespace, var_names, training, polyak_decay):
""" utility for retrieving polyak averaged params """
vars = []
for vn in var_names:
vars.append(get_var_maybe_avg(
namespace, vn, training, polyak_decay))
return vars
class WeightNormLinear(nn.Linear):
"""
Implementation of "Weight Normalization: A Simple Reparameterization
to Accelerate Training of Deep Neural Networks"
:cite:`DBLP:journals/corr/SalimansK16`
As a reparameterization method, weight normalization is same
as BatchNormalization, but it doesn't depend on minibatch.
NOTE: This is used nowhere in the code at this stage
Vincent Nguyen 05/18/2018
"""
def __init__(self, in_features, out_features,
init_scale=1., polyak_decay=0.9995):
super(WeightNormLinear, self).__init__(
in_features, out_features, bias=True)
self.V = self.weight
self.g = Parameter(torch.Tensor(out_features))
self.b = self.bias
self.register_buffer(
'V_avg', torch.zeros(out_features, in_features))
self.register_buffer('g_avg', torch.zeros(out_features))
self.register_buffer('b_avg', torch.zeros(out_features))
self.init_scale = init_scale
self.polyak_decay = polyak_decay
self.reset_parameters()
def reset_parameters(self):
return
def forward(self, x, init=False):
if init is True:
# out_features * in_features
self.V.data.copy_(torch.randn(self.V.data.size()).type_as(
self.V.data) * 0.05)
# norm is out_features * 1
v_norm = self.V.data / \
self.V.data.norm(2, 1).expand_as(self.V.data)
# batch_size * out_features
x_init = F.linear(x, v_norm).data
# out_features
m_init, v_init = x_init.mean(0).squeeze(
0), x_init.var(0).squeeze(0)
# out_features
scale_init = self.init_scale / \
torch.sqrt(v_init + 1e-10)
self.g.data.copy_(scale_init)
self.b.data.copy_(-m_init * scale_init)
x_init = scale_init.view(1, -1).expand_as(x_init) \
* (x_init - m_init.view(1, -1).expand_as(x_init))
self.V_avg.copy_(self.V.data)
self.g_avg.copy_(self.g.data)
self.b_avg.copy_(self.b.data)
return x_init
else:
v, g, b = get_vars_maybe_avg(self, ['V', 'g', 'b'],
self.training,
polyak_decay=self.polyak_decay)
# batch_size * out_features
x = F.linear(x, v)
scalar = g / torch.norm(v, 2, 1).squeeze(1)
x = scalar.view(1, -1).expand_as(x) * x + \
b.view(1, -1).expand_as(x)
return x
class WeightNormConv2d(nn.Conv2d):
def __init__(self, in_channels, out_channels, kernel_size, stride=1,
padding=0, dilation=1, groups=1, init_scale=1.,
polyak_decay=0.9995):
super(WeightNormConv2d, self).__init__(in_channels, out_channels,
kernel_size, stride, padding,
dilation, groups)
self.V = self.weight
self.g = Parameter(torch.Tensor(out_channels))
self.b = self.bias
self.register_buffer('V_avg', torch.zeros(self.V.size()))
self.register_buffer('g_avg', torch.zeros(out_channels))
self.register_buffer('b_avg', torch.zeros(out_channels))
self.init_scale = init_scale
self.polyak_decay = polyak_decay
self.reset_parameters()
def reset_parameters(self):
return
def forward(self, x, init=False):
if init is True:
# out_channels, in_channels // groups, * kernel_size
self.V.data.copy_(torch.randn(self.V.data.size()
).type_as(self.V.data) * 0.05)
v_norm = self.V.data / self.V.data.view(self.out_channels, -1)\
.norm(2, 1).view(self.out_channels, *(
[1] * (len(self.kernel_size) + 1))).expand_as(self.V.data)
x_init = F.conv2d(x, v_norm, None, self.stride,
self.padding, self.dilation, self.groups).data
t_x_init = x_init.transpose(0, 1).contiguous().view(
self.out_channels, -1)
m_init, v_init = t_x_init.mean(1).squeeze(
1), t_x_init.var(1).squeeze(1)
# out_features
scale_init = self.init_scale / \
torch.sqrt(v_init + 1e-10)
self.g.data.copy_(scale_init)
self.b.data.copy_(-m_init * scale_init)
scale_init_shape = scale_init.view(
1, self.out_channels, *([1] * (len(x_init.size()) - 2)))
m_init_shape = m_init.view(
1, self.out_channels, *([1] * (len(x_init.size()) - 2)))
x_init = scale_init_shape.expand_as(
x_init) * (x_init - m_init_shape.expand_as(x_init))
self.V_avg.copy_(self.V.data)
self.g_avg.copy_(self.g.data)
self.b_avg.copy_(self.b.data)
return x_init
else:
v, g, b = get_vars_maybe_avg(
self, ['V', 'g', 'b'], self.training,
polyak_decay=self.polyak_decay)
scalar = torch.norm(v.view(self.out_channels, -1), 2, 1)
if len(scalar.size()) == 2:
scalar = g / scalar.squeeze(1)
else:
scalar = g / scalar
w = scalar.view(self.out_channels, *
([1] * (len(v.size()) - 1))).expand_as(v) * v
x = F.conv2d(x, w, b, self.stride,
self.padding, self.dilation, self.groups)
return x
# This is used nowhere in the code at the moment (Vincent Nguyen 05/18/2018)
class WeightNormConvTranspose2d(nn.ConvTranspose2d):
def __init__(self, in_channels, out_channels, kernel_size, stride=1,
padding=0, output_padding=0, groups=1, init_scale=1.,
polyak_decay=0.9995):
super(WeightNormConvTranspose2d, self).__init__(
in_channels, out_channels,
kernel_size, stride,
padding, output_padding,
groups)
# in_channels, out_channels, *kernel_size
self.V = self.weight
self.g = Parameter(torch.Tensor(out_channels))
self.b = self.bias
self.register_buffer('V_avg', torch.zeros(self.V.size()))
self.register_buffer('g_avg', torch.zeros(out_channels))
self.register_buffer('b_avg', torch.zeros(out_channels))
self.init_scale = init_scale
self.polyak_decay = polyak_decay
self.reset_parameters()
def reset_parameters(self):
return
def forward(self, x, init=False):
if init is True:
# in_channels, out_channels, *kernel_size
self.V.data.copy_(torch.randn(self.V.data.size()).type_as(
self.V.data) * 0.05)
v_norm = self.V.data / self.V.data.transpose(0, 1).contiguous() \
.view(self.out_channels, -1).norm(2, 1).view(
self.in_channels, self.out_channels,
*([1] * len(self.kernel_size))).expand_as(self.V.data)
x_init = F.conv_transpose2d(
x, v_norm, None, self.stride,
self.padding, self.output_padding, self.groups).data
# self.out_channels, 1
t_x_init = x_init.tranpose(0, 1).contiguous().view(
self.out_channels, -1)
# out_features
m_init, v_init = t_x_init.mean(1).squeeze(
1), t_x_init.var(1).squeeze(1)
# out_features
scale_init = self.init_scale / \
torch.sqrt(v_init + 1e-10)
self.g.data.copy_(scale_init)
self.b.data.copy_(-m_init * scale_init)
scale_init_shape = scale_init.view(
1, self.out_channels, *([1] * (len(x_init.size()) - 2)))
m_init_shape = m_init.view(
1, self.out_channels, *([1] * (len(x_init.size()) - 2)))
x_init = scale_init_shape.expand_as(x_init)\
* (x_init - m_init_shape.expand_as(x_init))
self.V_avg.copy_(self.V.data)
self.g_avg.copy_(self.g.data)
self.b_avg.copy_(self.b.data)
return x_init
else:
v, g, b = get_vars_maybe_avg(
self, ['V', 'g', 'b'], self.training,
polyak_decay=self.polyak_decay)
scalar = g / \
torch.norm(v.transpose(0, 1).contiguous().view(
self.out_channels, -1), 2, 1).squeeze(1)
w = scalar.view(self.in_channels, self.out_channels,
*([1] * (len(v.size()) - 2))).expand_as(v) * v
x = F.conv_transpose2d(x, w, b, self.stride,
self.padding, self.output_padding,
self.groups)
return x
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/opts.py
================================================
""" Implementation of all available options """
from __future__ import print_function
import configargparse
from onmt.models.sru import CheckSRU
def config_opts(parser):
parser.add('-config', '--config', required=False,
is_config_file_arg=True, help='config file path')
parser.add('-save_config', '--save_config', required=False,
is_write_out_config_file_arg=True,
help='config file save path')
def model_opts(parser):
"""
These options are passed to the construction of the model.
Be careful with these as they will be used during translation.
"""
# Embedding Options
group = parser.add_argument_group('Model-Embeddings')
group.add('--src_word_vec_size', '-src_word_vec_size',
type=int, default=500,
help='Word embedding size for src.')
group.add('--tgt_word_vec_size', '-tgt_word_vec_size',
type=int, default=500,
help='Word embedding size for tgt.')
group.add('--word_vec_size', '-word_vec_size', type=int, default=-1,
help='Word embedding size for src and tgt.')
group.add('--share_decoder_embeddings', '-share_decoder_embeddings',
action='store_true',
help="Use a shared weight matrix for the input and "
"output word embeddings in the decoder.")
group.add('--share_embeddings', '-share_embeddings', action='store_true',
help="Share the word embeddings between encoder "
"and decoder. Need to use shared dictionary for this "
"option.")
group.add('--position_encoding', '-position_encoding', action='store_true',
help="Use a sin to mark relative words positions. "
"Necessary for non-RNN style models.")
group = parser.add_argument_group('Model-Embedding Features')
group.add('--feat_merge', '-feat_merge', type=str, default='concat',
choices=['concat', 'sum', 'mlp'],
help="Merge action for incorporating features embeddings. "
"Options [concat|sum|mlp].")
group.add('--feat_vec_size', '-feat_vec_size', type=int, default=-1,
help="If specified, feature embedding sizes "
"will be set to this. Otherwise, feat_vec_exponent "
"will be used.")
group.add('--feat_vec_exponent', '-feat_vec_exponent',
type=float, default=0.7,
help="If -feat_merge_size is not set, feature "
"embedding sizes will be set to N^feat_vec_exponent "
"where N is the number of values the feature takes.")
# Encoder-Decoder Options
group = parser.add_argument_group('Model- Encoder-Decoder')
group.add('--model_type', '-model_type', default='text',
choices=['text', 'img', 'audio', 'vec'],
help="Type of source model to use. Allows "
"the system to incorporate non-text inputs. "
"Options are [text|img|audio|vec].")
group.add('--model_dtype', '-model_dtype', default='fp32',
choices=['fp32', 'fp16'],
help='Data type of the model.')
group.add('--encoder_type', '-encoder_type', type=str, default='rnn',
choices=['rnn', 'brnn', 'mean', 'transformer', 'cnn'],
help="Type of encoder layer to use. Non-RNN layers "
"are experimental. Options are "
"[rnn|brnn|mean|transformer|cnn].")
group.add('--decoder_type', '-decoder_type', type=str, default='rnn',
choices=['rnn', 'transformer', 'cnn'],
help="Type of decoder layer to use. Non-RNN layers "
"are experimental. Options are "
"[rnn|transformer|cnn].")
group.add('--layers', '-layers', type=int, default=-1,
help='Number of layers in enc/dec.')
group.add('--enc_layers', '-enc_layers', type=int, default=2,
help='Number of layers in the encoder')
group.add('--dec_layers', '-dec_layers', type=int, default=2,
help='Number of layers in the decoder')
group.add('--rnn_size', '-rnn_size', type=int, default=-1,
help="Size of rnn hidden states. Overwrites "
"enc_rnn_size and dec_rnn_size")
group.add('--enc_rnn_size', '-enc_rnn_size', type=int, default=500,
help="Size of encoder rnn hidden states. "
"Must be equal to dec_rnn_size except for "
"speech-to-text.")
group.add('--dec_rnn_size', '-dec_rnn_size', type=int, default=500,
help="Size of decoder rnn hidden states. "
"Must be equal to enc_rnn_size except for "
"speech-to-text.")
group.add('--audio_enc_pooling', '-audio_enc_pooling',
type=str, default='1',
help="The amount of pooling of audio encoder, "
"either the same amount of pooling across all layers "
"indicated by a single number, or different amounts of "
"pooling per layer separated by comma.")
group.add('--cnn_kernel_width', '-cnn_kernel_width', type=int, default=3,
help="Size of windows in the cnn, the kernel_size is "
"(cnn_kernel_width, 1) in conv layer")
group.add('--input_feed', '-input_feed', type=int, default=1,
help="Feed the context vector at each time step as "
"additional input (via concatenation with the word "
"embeddings) to the decoder.")
group.add('--bridge', '-bridge', action="store_true",
help="Have an additional layer between the last encoder "
"state and the first decoder state")
group.add('--rnn_type', '-rnn_type', type=str, default='LSTM',
choices=['LSTM', 'GRU', 'SRU'],
action=CheckSRU,
help="The gate type to use in the RNNs")
# group.add('--residual', '-residual', action="store_true",
# help="Add residual connections between RNN layers.")
group.add('--brnn', '-brnn', action=DeprecateAction,
help="Deprecated, use `encoder_type`.")
group.add('--context_gate', '-context_gate', type=str, default=None,
choices=['source', 'target', 'both'],
help="Type of context gate to use. "
"Do not select for no context gate.")
# Attention options
group = parser.add_argument_group('Model- Attention')
group.add('--global_attention', '-global_attention',
type=str, default='general',
choices=['dot', 'general', 'mlp', 'none'],
help="The attention type to use: "
"dotprod or general (Luong) or MLP (Bahdanau)")
group.add('--global_attention_function', '-global_attention_function',
type=str, default="softmax", choices=["softmax", "sparsemax"])
group.add('--self_attn_type', '-self_attn_type',
type=str, default="scaled-dot",
help='Self attention type in Transformer decoder '
'layer -- currently "scaled-dot" or "average" ')
group.add('--max_relative_positions', '-max_relative_positions',
type=int, default=0,
help="Maximum distance between inputs in relative "
"positions representations. "
"For more detailed information, see: "
"https://arxiv.org/pdf/1803.02155.pdf")
group.add('--heads', '-heads', type=int, default=8,
help='Number of heads for transformer self-attention')
group.add('--transformer_ff', '-transformer_ff', type=int, default=2048,
help='Size of hidden transformer feed-forward')
group.add('--aan_useffn', '-aan_useffn', action="store_true",
help='Turn on the FFN layer in the AAN decoder')
# Alignement options
group = parser.add_argument_group('Model - Alignement')
group.add('--lambda_align', '-lambda_align', type=float, default=0.0,
help="Lambda value for alignement loss of Garg et al (2019)"
"For more detailed information, see: "
"https://arxiv.org/abs/1909.02074")
group.add('--alignment_layer', '-alignment_layer', type=int, default=-3,
help='Layer number which has to be supervised.')
group.add('--alignment_heads', '-alignment_heads', type=int, default=None,
help='N. of cross attention heads per layer to supervised with')
group.add('--full_context_alignment', '-full_context_alignment',
action="store_true",
help='Whether alignment is conditioned on full target context.')
# Generator and loss options.
group = parser.add_argument_group('Generator')
group.add('--copy_attn', '-copy_attn', action="store_true",
help='Train copy attention layer.')
group.add('--copy_attn_type', '-copy_attn_type',
type=str, default=None,
choices=['dot', 'general', 'mlp', 'none'],
help="The copy attention type to use. Leave as None to use "
"the same as -global_attention.")
group.add('--generator_function', '-generator_function', default="softmax",
choices=["softmax", "sparsemax"],
help="Which function to use for generating "
"probabilities over the target vocabulary (choices: "
"softmax, sparsemax)")
group.add('--copy_attn_force', '-copy_attn_force', action="store_true",
help='When available, train to copy.')
group.add('--reuse_copy_attn', '-reuse_copy_attn', action="store_true",
help="Reuse standard attention for copy")
group.add('--copy_loss_by_seqlength', '-copy_loss_by_seqlength',
action="store_true",
help="Divide copy loss by length of sequence")
group.add('--coverage_attn', '-coverage_attn', action="store_true",
help='Train a coverage attention layer.')
group.add('--lambda_coverage', '-lambda_coverage', type=float, default=0.0,
help='Lambda value for coverage loss of See et al (2017)')
group.add('--loss_scale', '-loss_scale', type=float, default=0,
help="For FP16 training, the static loss scale to use. If not "
"set, the loss scale is dynamically computed.")
group.add('--apex_opt_level', '-apex_opt_level', type=str, default="O1",
choices=["O0", "O1", "O2", "O3"],
help="For FP16 training, the opt_level to use."
"See https://nvidia.github.io/apex/amp.html#opt-levels.")
def preprocess_opts(parser):
""" Pre-procesing options """
# Data options
group = parser.add_argument_group('Data')
group.add('--data_type', '-data_type', default="text",
help="Type of the source input. "
"Options are [text|img|audio|vec].")
group.add('--train_src', '-train_src', required=True, nargs='+',
help="Path(s) to the training source data")
group.add('--train_tgt', '-train_tgt', required=True, nargs='+',
help="Path(s) to the training target data")
group.add('--train_align', '-train_align', nargs='+', default=[None],
help="Path(s) to the training src-tgt alignment")
group.add('--train_ids', '-train_ids', nargs='+', default=[None],
help="ids to name training shards, used for corpus weighting")
group.add('--valid_src', '-valid_src',
help="Path to the validation source data")
group.add('--valid_tgt', '-valid_tgt',
help="Path to the validation target data")
group.add('--valid_align', '-valid_align', default=None,
help="Path(s) to the validation src-tgt alignment")
group.add('--src_dir', '-src_dir', default="",
help="Source directory for image or audio files.")
group.add('--save_data', '-save_data', required=True,
help="Output file for the prepared data")
group.add('--max_shard_size', '-max_shard_size', type=int, default=0,
help="""Deprecated use shard_size instead""")
group.add('--shard_size', '-shard_size', type=int, default=1000000,
help="Divide src_corpus and tgt_corpus into "
"smaller multiple src_copus and tgt corpus files, then "
"build shards, each shard will have "
"opt.shard_size samples except last shard. "
"shard_size=0 means no segmentation "
"shard_size>0 means segment dataset into multiple shards, "
"each shard has shard_size samples")
group.add('--num_threads', '-num_threads', type=int, default=1,
help="Number of shards to build in parallel.")
group.add('--overwrite', '-overwrite', action="store_true",
help="Overwrite existing shards if any.")
# Dictionary options, for text corpus
group = parser.add_argument_group('Vocab')
# if you want to pass an existing vocab.pt file, pass it to
# -src_vocab alone as it already contains tgt vocab.
group.add('--src_vocab', '-src_vocab', default="",
help="Path to an existing source vocabulary. Format: "
"one word per line.")
group.add('--tgt_vocab', '-tgt_vocab', default="",
help="Path to an existing target vocabulary. Format: "
"one word per line.")
group.add('--features_vocabs_prefix', '-features_vocabs_prefix',
type=str, default='',
help="Path prefix to existing features vocabularies")
group.add('--src_vocab_size', '-src_vocab_size', type=int, default=50000,
help="Size of the source vocabulary")
group.add('--tgt_vocab_size', '-tgt_vocab_size', type=int, default=50000,
help="Size of the target vocabulary")
group.add('--vocab_size_multiple', '-vocab_size_multiple',
type=int, default=1,
help="Make the vocabulary size a multiple of this value")
group.add('--src_words_min_frequency',
'-src_words_min_frequency', type=int, default=0)
group.add('--tgt_words_min_frequency',
'-tgt_words_min_frequency', type=int, default=0)
group.add('--dynamic_dict', '-dynamic_dict', action='store_true',
help="Create dynamic dictionaries")
group.add('--share_vocab', '-share_vocab', action='store_true',
help="Share source and target vocabulary")
# Truncation options, for text corpus
group = parser.add_argument_group('Pruning')
group.add('--src_seq_length', '-src_seq_length', type=int, default=50,
help="Maximum source sequence length")
group.add('--src_seq_length_trunc', '-src_seq_length_trunc',
type=int, default=None,
help="Truncate source sequence length.")
group.add('--tgt_seq_length', '-tgt_seq_length', type=int, default=50,
help="Maximum target sequence length to keep.")
group.add('--tgt_seq_length_trunc', '-tgt_seq_length_trunc',
type=int, default=None,
help="Truncate target sequence length.")
group.add('--lower', '-lower', action='store_true', help='lowercase data')
group.add('--filter_valid', '-filter_valid', action='store_true',
help='Filter validation data by src and/or tgt length')
# Data processing options
group = parser.add_argument_group('Random')
group.add('--shuffle', '-shuffle', type=int, default=0,
help="Shuffle data")
group.add('--seed', '-seed', type=int, default=3435,
help="Random seed")
group = parser.add_argument_group('Logging')
group.add('--report_every', '-report_every', type=int, default=100000,
help="Report status every this many sentences")
group.add('--log_file', '-log_file', type=str, default="",
help="Output logs to a file under this path.")
group.add('--log_file_level', '-log_file_level', type=str,
action=StoreLoggingLevelAction,
choices=StoreLoggingLevelAction.CHOICES,
default="0")
# Options most relevant to speech
group = parser.add_argument_group('Speech')
group.add('--sample_rate', '-sample_rate', type=int, default=16000,
help="Sample rate.")
group.add('--window_size', '-window_size', type=float, default=.02,
help="Window size for spectrogram in seconds.")
group.add('--window_stride', '-window_stride', type=float, default=.01,
help="Window stride for spectrogram in seconds.")
group.add('--window', '-window', default='hamming',
help="Window type for spectrogram generation.")
# Option most relevant to image input
group.add('--image_channel_size', '-image_channel_size',
type=int, default=3,
choices=[3, 1],
help="Using grayscale image can training "
"model faster and smaller")
def train_opts(parser):
""" Training and saving options """
group = parser.add_argument_group('General')
group.add('--data', '-data', required=True,
help='Path prefix to the ".train.pt" and '
'".valid.pt" file path from preprocess.py')
group.add('--data_ids', '-data_ids', nargs='+', default=[None],
help="In case there are several corpora.")
group.add('--data_weights', '-data_weights', type=int, nargs='+',
default=[1], help="""Weights of different corpora,
should follow the same order as in -data_ids.""")
group.add('--save_model', '-save_model', default='model',
help="Model filename (the model will be saved as "
"_N.pt where N is the number "
"of steps")
group.add('--save_checkpoint_steps', '-save_checkpoint_steps',
type=int, default=5000,
help="""Save a checkpoint every X steps""")
group.add('--keep_checkpoint', '-keep_checkpoint', type=int, default=-1,
help="Keep X checkpoints (negative: keep all)")
# GPU
group.add('--gpuid', '-gpuid', default=[], nargs='*', type=int,
help="Deprecated see world_size and gpu_ranks.")
group.add('--gpu_ranks', '-gpu_ranks', default=[], nargs='*', type=int,
help="list of ranks of each process.")
group.add('--world_size', '-world_size', default=1, type=int,
help="total number of distributed processes.")
group.add('--gpu_backend', '-gpu_backend',
default="nccl", type=str,
help="Type of torch distributed backend")
group.add('--gpu_verbose_level', '-gpu_verbose_level', default=0, type=int,
help="Gives more info on each process per GPU.")
group.add('--master_ip', '-master_ip', default="localhost", type=str,
help="IP of master for torch.distributed training.")
group.add('--master_port', '-master_port', default=10000, type=int,
help="Port of master for torch.distributed training.")
group.add('--queue_size', '-queue_size', default=400, type=int,
help="Size of queue for each process in producer/consumer")
group.add('--seed', '-seed', type=int, default=-1,
help="Random seed used for the experiments "
"reproducibility.")
# Init options
group = parser.add_argument_group('Initialization')
group.add('--param_init', '-param_init', type=float, default=0.1,
help="Parameters are initialized over uniform distribution "
"with support (-param_init, param_init). "
"Use 0 to not use initialization")
group.add('--param_init_glorot', '-param_init_glorot', action='store_true',
help="Init parameters with xavier_uniform. "
"Required for transformer.")
group.add('--train_from', '-train_from', default='', type=str,
help="If training from a checkpoint then this is the "
"path to the pretrained model's state_dict.")
group.add('--reset_optim', '-reset_optim', default='none',
choices=['none', 'all', 'states', 'keep_states'],
help="Optimization resetter when train_from.")
# Pretrained word vectors
group.add('--pre_word_vecs_enc', '-pre_word_vecs_enc',
help="If a valid path is specified, then this will load "
"pretrained word embeddings on the encoder side. "
"See README for specific formatting instructions.")
group.add('--pre_word_vecs_dec', '-pre_word_vecs_dec',
help="If a valid path is specified, then this will load "
"pretrained word embeddings on the decoder side. "
"See README for specific formatting instructions.")
# Fixed word vectors
group.add('--fix_word_vecs_enc', '-fix_word_vecs_enc',
action='store_true',
help="Fix word embeddings on the encoder side.")
group.add('--fix_word_vecs_dec', '-fix_word_vecs_dec',
action='store_true',
help="Fix word embeddings on the decoder side.")
# Optimization options
group = parser.add_argument_group('Optimization- Type')
group.add('--batch_size', '-batch_size', type=int, default=64,
help='Maximum batch size for training')
group.add('--batch_type', '-batch_type', default='sents',
choices=["sents", "tokens"],
help="Batch grouping for batch_size. Standard "
"is sents. Tokens will do dynamic batching")
group.add('--pool_factor', '-pool_factor', type=int, default=8192,
help="""Factor used in data loading and batch creations.
It will load the equivalent of `pool_factor` batches,
sort them by the according `sort_key` to produce
homogeneous batches and reduce padding, and yield
the produced batches in a shuffled way.
Inspired by torchtext's pool mechanism.""")
group.add('--normalization', '-normalization', default='sents',
choices=["sents", "tokens"],
help='Normalization method of the gradient.')
group.add('--accum_count', '-accum_count', type=int, nargs='+',
default=[1],
help="Accumulate gradient this many times. "
"Approximately equivalent to updating "
"batch_size * accum_count batches at once. "
"Recommended for Transformer.")
group.add('--accum_steps', '-accum_steps', type=int, nargs='+',
default=[0], help="Steps at which accum_count values change")
group.add('--valid_steps', '-valid_steps', type=int, default=10000,
help='Perfom validation every X steps')
group.add('--valid_batch_size', '-valid_batch_size', type=int, default=32,
help='Maximum batch size for validation')
group.add('--max_generator_batches', '-max_generator_batches',
type=int, default=32,
help="Maximum batches of words in a sequence to run "
"the generator on in parallel. Higher is faster, but "
"uses more memory. Set to 0 to disable.")
group.add('--train_steps', '-train_steps', type=int, default=100000,
help='Number of training steps')
group.add('--single_pass', '-single_pass', action='store_true',
help="Make a single pass over the training dataset.")
group.add('--epochs', '-epochs', type=int, default=0,
help='Deprecated epochs see train_steps')
group.add('--early_stopping', '-early_stopping', type=int, default=0,
help='Number of validation steps without improving.')
group.add('--early_stopping_criteria', '-early_stopping_criteria',
nargs="*", default=None,
help='Criteria to use for early stopping.')
group.add('--optim', '-optim', default='sgd',
choices=['sgd', 'adagrad', 'adadelta', 'adam',
'sparseadam', 'adafactor', 'fusedadam'],
help="Optimization method.")
group.add('--adagrad_accumulator_init', '-adagrad_accumulator_init',
type=float, default=0,
help="Initializes the accumulator values in adagrad. "
"Mirrors the initial_accumulator_value option "
"in the tensorflow adagrad (use 0.1 for their default).")
group.add('--max_grad_norm', '-max_grad_norm', type=float, default=5,
help="If the norm of the gradient vector exceeds this, "
"renormalize it to have the norm equal to "
"max_grad_norm")
group.add('--dropout', '-dropout', type=float, default=[0.3], nargs='+',
help="Dropout probability; applied in LSTM stacks.")
group.add('--attention_dropout', '-attention_dropout', type=float,
default=[0.1], nargs='+',
help="Attention Dropout probability.")
group.add('--dropout_steps', '-dropout_steps', type=int, nargs='+',
default=[0], help="Steps at which dropout changes.")
group.add('--truncated_decoder', '-truncated_decoder', type=int, default=0,
help="""Truncated bptt.""")
group.add('--adam_beta1', '-adam_beta1', type=float, default=0.9,
help="The beta1 parameter used by Adam. "
"Almost without exception a value of 0.9 is used in "
"the literature, seemingly giving good results, "
"so we would discourage changing this value from "
"the default without due consideration.")
group.add('--adam_beta2', '-adam_beta2', type=float, default=0.999,
help='The beta2 parameter used by Adam. '
'Typically a value of 0.999 is recommended, as this is '
'the value suggested by the original paper describing '
'Adam, and is also the value adopted in other frameworks '
'such as Tensorflow and Keras, i.e. see: '
'https://www.tensorflow.org/api_docs/python/tf/train/Adam'
'Optimizer or https://keras.io/optimizers/ . '
'Whereas recently the paper "Attention is All You Need" '
'suggested a value of 0.98 for beta2, this parameter may '
'not work well for normal models / default '
'baselines.')
group.add('--label_smoothing', '-label_smoothing', type=float, default=0.0,
help="Label smoothing value epsilon. "
"Probabilities of all non-true labels "
"will be smoothed by epsilon / (vocab_size - 1). "
"Set to zero to turn off label smoothing. "
"For more detailed information, see: "
"https://arxiv.org/abs/1512.00567")
group.add('--average_decay', '-average_decay', type=float, default=0,
help="Moving average decay. "
"Set to other than 0 (e.g. 1e-4) to activate. "
"Similar to Marian NMT implementation: "
"http://www.aclweb.org/anthology/P18-4020 "
"For more detail on Exponential Moving Average: "
"https://en.wikipedia.org/wiki/Moving_average")
group.add('--average_every', '-average_every', type=int, default=1,
help="Step for moving average. "
"Default is every update, "
"if -average_decay is set.")
# learning rate
group = parser.add_argument_group('Optimization- Rate')
group.add('--learning_rate', '-learning_rate', type=float, default=1.0,
help="Starting learning rate. "
"Recommended settings: sgd = 1, adagrad = 0.1, "
"adadelta = 1, adam = 0.001")
group.add('--learning_rate_decay', '-learning_rate_decay',
type=float, default=0.5,
help="If update_learning_rate, decay learning rate by "
"this much if steps have gone past "
"start_decay_steps")
group.add('--start_decay_steps', '-start_decay_steps',
type=int, default=50000,
help="Start decaying every decay_steps after "
"start_decay_steps")
group.add('--decay_steps', '-decay_steps', type=int, default=10000,
help="Decay every decay_steps")
group.add('--decay_method', '-decay_method', type=str, default="none",
choices=['noam', 'noamwd', 'rsqrt', 'none'],
help="Use a custom decay rate.")
group.add('--warmup_steps', '-warmup_steps', type=int, default=4000,
help="Number of warmup steps for custom decay.")
group = parser.add_argument_group('Logging')
group.add('--report_every', '-report_every', type=int, default=50,
help="Print stats at this interval.")
group.add('--log_file', '-log_file', type=str, default="",
help="Output logs to a file under this path.")
group.add('--log_file_level', '-log_file_level', type=str,
action=StoreLoggingLevelAction,
choices=StoreLoggingLevelAction.CHOICES,
default="0")
group.add('--exp_host', '-exp_host', type=str, default="",
help="Send logs to this crayon server.")
group.add('--exp', '-exp', type=str, default="",
help="Name of the experiment for logging.")
# Use Tensorboard for visualization during training
group.add('--tensorboard', '-tensorboard', action="store_true",
help="Use tensorboard for visualization during training. "
"Must have the library tensorboard >= 1.14.")
group.add("--tensorboard_log_dir", "-tensorboard_log_dir",
type=str, default="runs/onmt",
help="Log directory for Tensorboard. "
"This is also the name of the run.")
group = parser.add_argument_group('Speech')
# Options most relevant to speech
group.add('--sample_rate', '-sample_rate', type=int, default=16000,
help="Sample rate.")
group.add('--window_size', '-window_size', type=float, default=.02,
help="Window size for spectrogram in seconds.")
# Option most relevant to image input
group.add('--image_channel_size', '-image_channel_size',
type=int, default=3, choices=[3, 1],
help="Using grayscale image can training "
"model faster and smaller")
def translate_opts(parser):
""" Translation / inference options """
group = parser.add_argument_group('Model')
group.add('--model', '-model', dest='models', metavar='MODEL',
nargs='+', type=str, default=[], required=True,
help="Path to model .pt file(s). "
"Multiple models can be specified, "
"for ensemble decoding.")
group.add('--fp32', '-fp32', action='store_true',
help="Force the model to be in FP32 "
"because FP16 is very slow on GTX1080(ti).")
group.add('--avg_raw_probs', '-avg_raw_probs', action='store_true',
help="If this is set, during ensembling scores from "
"different models will be combined by averaging their "
"raw probabilities and then taking the log. Otherwise, "
"the log probabilities will be averaged directly. "
"Necessary for models whose output layers can assign "
"zero probability.")
group = parser.add_argument_group('Data')
group.add('--data_type', '-data_type', default="text",
help="Type of the source input. Options: [text|img].")
group.add('--src', '-src', required=True,
help="Source sequence to decode (one line per "
"sequence)")
group.add('--src_dir', '-src_dir', default="",
help='Source directory for image or audio files')
group.add('--tgt', '-tgt',
help='True target sequence (optional)')
group.add('--shard_size', '-shard_size', type=int, default=10000,
help="Divide src and tgt (if applicable) into "
"smaller multiple src and tgt files, then "
"build shards, each shard will have "
"opt.shard_size samples except last shard. "
"shard_size=0 means no segmentation "
"shard_size>0 means segment dataset into multiple shards, "
"each shard has shard_size samples")
group.add('--output', '-output', default='pred.txt',
help="Path to output the predictions (each line will "
"be the decoded sequence")
group.add('--report_align', '-report_align', action='store_true',
help="Report alignment for each translation.")
group.add('--report_bleu', '-report_bleu', action='store_true',
help="Report bleu score after translation, "
"call tools/multi-bleu.perl on command line")
group.add('--report_rouge', '-report_rouge', action='store_true',
help="Report rouge 1/2/3/L/SU4 score after translation "
"call tools/test_rouge.py on command line")
group.add('--report_time', '-report_time', action='store_true',
help="Report some translation time metrics")
# Options most relevant to summarization.
group.add('--dynamic_dict', '-dynamic_dict', action='store_true',
help="Create dynamic dictionaries")
group.add('--share_vocab', '-share_vocab', action='store_true',
help="Share source and target vocabulary")
group = parser.add_argument_group('Random Sampling')
group.add('--random_sampling_topk', '-random_sampling_topk',
default=1, type=int,
help="Set this to -1 to do random sampling from full "
"distribution. Set this to value k>1 to do random "
"sampling restricted to the k most likely next tokens. "
"Set this to 1 to use argmax or for doing beam "
"search.")
group.add('--random_sampling_temp', '-random_sampling_temp',
default=1., type=float,
help="If doing random sampling, divide the logits by "
"this before computing softmax during decoding.")
group.add('--seed', '-seed', type=int, default=829,
help="Random seed")
group = parser.add_argument_group('Beam')
group.add('--beam_size', '-beam_size', type=int, default=5,
help='Beam size')
group.add('--min_length', '-min_length', type=int, default=0,
help='Minimum prediction length')
group.add('--max_length', '-max_length', type=int, default=100,
help='Maximum prediction length.')
group.add('--max_sent_length', '-max_sent_length', action=DeprecateAction,
help="Deprecated, use `-max_length` instead")
# Alpha and Beta values for Google Length + Coverage penalty
# Described here: https://arxiv.org/pdf/1609.08144.pdf, Section 7
group.add('--stepwise_penalty', '-stepwise_penalty', action='store_true',
help="Apply penalty at every decoding step. "
"Helpful for summary penalty.")
group.add('--length_penalty', '-length_penalty', default='none',
choices=['none', 'wu', 'avg'],
help="Length Penalty to use.")
group.add('--ratio', '-ratio', type=float, default=-0.,
help="Ratio based beam stop condition")
group.add('--coverage_penalty', '-coverage_penalty', default='none',
choices=['none', 'wu', 'summary'],
help="Coverage Penalty to use.")
group.add('--alpha', '-alpha', type=float, default=0.,
help="Google NMT length penalty parameter "
"(higher = longer generation)")
group.add('--beta', '-beta', type=float, default=-0.,
help="Coverage penalty parameter")
group.add('--block_ngram_repeat', '-block_ngram_repeat',
type=int, default=0,
help='Block repetition of ngrams during decoding.')
group.add('--ignore_when_blocking', '-ignore_when_blocking',
nargs='+', type=str, default=[],
help="Ignore these strings when blocking repeats. "
"You want to block sentence delimiters.")
group.add('--replace_unk', '-replace_unk', action="store_true",
help="Replace the generated UNK tokens with the "
"source token that had highest attention weight. If "
"phrase_table is provided, it will look up the "
"identified source token and give the corresponding "
"target token. If it is not provided (or the identified "
"source token does not exist in the table), then it "
"will copy the source token.")
group.add('--phrase_table', '-phrase_table', type=str, default="",
help="If phrase_table is provided (with replace_unk), it will "
"look up the identified source token and give the "
"corresponding target token. If it is not provided "
"(or the identified source token does not exist in "
"the table), then it will copy the source token.")
group = parser.add_argument_group('Logging')
group.add('--verbose', '-verbose', action="store_true",
help='Print scores and predictions for each sentence')
group.add('--log_file', '-log_file', type=str, default="",
help="Output logs to a file under this path.")
group.add('--log_file_level', '-log_file_level', type=str,
action=StoreLoggingLevelAction,
choices=StoreLoggingLevelAction.CHOICES,
default="0")
group.add('--attn_debug', '-attn_debug', action="store_true",
help='Print best attn for each word')
group.add('--align_debug', '-align_debug', action="store_true",
help='Print best align for each word')
group.add('--dump_beam', '-dump_beam', type=str, default="",
help='File to dump beam information to.')
group.add('--n_best', '-n_best', type=int, default=1,
help="If verbose is set, will output the n_best "
"decoded sentences")
group = parser.add_argument_group('Efficiency')
group.add('--batch_size', '-batch_size', type=int, default=30,
help='Batch size')
group.add('--batch_type', '-batch_type', default='sents',
choices=["sents", "tokens"],
help="Batch grouping for batch_size. Standard "
"is sents. Tokens will do dynamic batching")
group.add('--gpu', '-gpu', type=int, default=-1,
help="Device to run on")
# Options most relevant to speech.
group = parser.add_argument_group('Speech')
group.add('--sample_rate', '-sample_rate', type=int, default=16000,
help="Sample rate.")
group.add('--window_size', '-window_size', type=float, default=.02,
help='Window size for spectrogram in seconds')
group.add('--window_stride', '-window_stride', type=float, default=.01,
help='Window stride for spectrogram in seconds')
group.add('--window', '-window', default='hamming',
help='Window type for spectrogram generation')
# Option most relevant to image input
group.add('--image_channel_size', '-image_channel_size',
type=int, default=3, choices=[3, 1],
help="Using grayscale image can training "
"model faster and smaller")
# Copyright 2016 The Chromium Authors. All rights reserved.
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.
class StoreLoggingLevelAction(configargparse.Action):
""" Convert string to logging level """
import logging
LEVELS = {
"CRITICAL": logging.CRITICAL,
"ERROR": logging.ERROR,
"WARNING": logging.WARNING,
"INFO": logging.INFO,
"DEBUG": logging.DEBUG,
"NOTSET": logging.NOTSET
}
CHOICES = list(LEVELS.keys()) + [str(_) for _ in LEVELS.values()]
def __init__(self, option_strings, dest, help=None, **kwargs):
super(StoreLoggingLevelAction, self).__init__(
option_strings, dest, help=help, **kwargs)
def __call__(self, parser, namespace, value, option_string=None):
# Get the key 'value' in the dict, or just use 'value'
level = StoreLoggingLevelAction.LEVELS.get(value, value)
setattr(namespace, self.dest, level)
class DeprecateAction(configargparse.Action):
""" Deprecate action """
def __init__(self, option_strings, dest, help=None, **kwargs):
super(DeprecateAction, self).__init__(option_strings, dest, nargs=0,
help=help, **kwargs)
def __call__(self, parser, namespace, values, flag_name):
help = self.help if self.help is not None else ""
msg = "Flag '%s' is deprecated. %s" % (flag_name, help)
raise configargparse.ArgumentTypeError(msg)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/train_single.py
================================================
#!/usr/bin/env python
"""Training on a single process."""
import os
import torch
from onmt.inputters.inputter import build_dataset_iter, \
load_old_vocab, old_style_vocab, build_dataset_iter_multiple
from onmt.model_builder import build_model
from onmt.utils.optimizers import Optimizer
from onmt.utils.misc import set_random_seed
from onmt.trainer import build_trainer
from onmt.models import build_model_saver
from onmt.utils.logging import init_logger, logger
from onmt.utils.parse import ArgumentParser
def _check_save_model_path(opt):
save_model_path = os.path.abspath(opt.save_model)
model_dirname = os.path.dirname(save_model_path)
if not os.path.exists(model_dirname):
os.makedirs(model_dirname)
def _tally_parameters(model):
enc = 0
dec = 0
for name, param in model.named_parameters():
if 'encoder' in name:
enc += param.nelement()
else:
dec += param.nelement()
return enc + dec, enc, dec
def configure_process(opt, device_id):
if device_id >= 0:
torch.cuda.set_device(device_id)
set_random_seed(opt.seed, device_id >= 0)
def main(opt, device_id, batch_queue=None, semaphore=None):
# NOTE: It's important that ``opt`` has been validated and updated
# at this point.
configure_process(opt, device_id)
init_logger(opt.log_file)
assert len(opt.accum_count) == len(opt.accum_steps), \
'Number of accum_count values must match number of accum_steps'
# Load checkpoint if we resume from a previous training.
if opt.train_from:
logger.info('Loading checkpoint from %s' % opt.train_from)
checkpoint = torch.load(opt.train_from,
map_location=lambda storage, loc: storage)
model_opt = ArgumentParser.ckpt_model_opts(checkpoint["opt"])
ArgumentParser.update_model_opts(model_opt)
ArgumentParser.validate_model_opts(model_opt)
logger.info('Loading vocab from checkpoint at %s.' % opt.train_from)
vocab = checkpoint['vocab']
else:
checkpoint = None
model_opt = opt
vocab = torch.load(opt.data + '.vocab.pt')
# check for code where vocab is saved instead of fields
# (in the future this will be done in a smarter way)
if old_style_vocab(vocab):
fields = load_old_vocab(
vocab, opt.model_type, dynamic_dict=opt.copy_attn)
else:
fields = vocab
# Report src and tgt vocab sizes, including for features
for side in ['src', 'tgt']:
f = fields[side]
try:
f_iter = iter(f)
except TypeError:
f_iter = [(side, f)]
for sn, sf in f_iter:
if sf.use_vocab:
logger.info(' * %s vocab size = %d' % (sn, len(sf.vocab)))
# Build model.
model = build_model(model_opt, opt, fields, checkpoint)
n_params, enc, dec = _tally_parameters(model)
logger.info('encoder: %d' % enc)
logger.info('decoder: %d' % dec)
logger.info('* number of parameters: %d' % n_params)
_check_save_model_path(opt)
# Build optimizer.
optim = Optimizer.from_opt(model, opt, checkpoint=checkpoint)
# Build model saver
model_saver = build_model_saver(model_opt, opt, model, fields, optim)
trainer = build_trainer(
opt, device_id, model, fields, optim, model_saver=model_saver)
if batch_queue is None:
if len(opt.data_ids) > 1:
train_shards = []
for train_id in opt.data_ids:
shard_base = "train_" + train_id
train_shards.append(shard_base)
train_iter = build_dataset_iter_multiple(train_shards, fields, opt)
else:
if opt.data_ids[0] is not None:
shard_base = "train_" + opt.data_ids[0]
else:
shard_base = "train"
train_iter = build_dataset_iter(shard_base, fields, opt)
else:
assert semaphore is not None, \
"Using batch_queue requires semaphore as well"
def _train_iter():
while True:
batch = batch_queue.get()
semaphore.release()
yield batch
train_iter = _train_iter()
valid_iter = build_dataset_iter(
"valid", fields, opt, is_train=False)
if len(opt.gpu_ranks):
logger.info('Starting training on GPU: %s' % opt.gpu_ranks)
else:
logger.info('Starting training on CPU, could be very slow')
train_steps = opt.train_steps
if opt.single_pass and train_steps > 0:
logger.warning("Option single_pass is enabled, ignoring train_steps.")
train_steps = 0
trainer.train(
train_iter,
train_steps,
save_checkpoint_steps=opt.save_checkpoint_steps,
valid_iter=valid_iter,
valid_steps=opt.valid_steps)
if trainer.report_manager.tensorboard_writer is not None:
trainer.report_manager.tensorboard_writer.close()
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/trainer.py
================================================
"""
This is the loadable seq2seq trainer library that is
in charge of training details, loss compute, and statistics.
See train.py for a use case of this library.
Note: To make this a general library, we implement *only*
mechanism things here(i.e. what to do), and leave the strategy
things to users(i.e. how to do it). Also see train.py(one of the
users of this library) for the strategy things we do.
"""
import torch
import traceback
import onmt.utils
from onmt.utils.logging import logger
def build_trainer(opt, device_id, model, fields, optim, model_saver=None):
"""
Simplify `Trainer` creation based on user `opt`s*
Args:
opt (:obj:`Namespace`): user options (usually from argument parsing)
model (:obj:`onmt.models.NMTModel`): the model to train
fields (dict): dict of fields
optim (:obj:`onmt.utils.Optimizer`): optimizer used during training
data_type (str): string describing the type of data
e.g. "text", "img", "audio"
model_saver(:obj:`onmt.models.ModelSaverBase`): the utility object
used to save the model
"""
tgt_field = dict(fields)["tgt"].base_field
train_loss = onmt.utils.loss.build_loss_compute(model, tgt_field, opt)
valid_loss = onmt.utils.loss.build_loss_compute(
model, tgt_field, opt, train=False)
trunc_size = opt.truncated_decoder # Badly named...
shard_size = opt.max_generator_batches if opt.model_dtype == 'fp32' else 0
norm_method = opt.normalization
accum_count = opt.accum_count
accum_steps = opt.accum_steps
n_gpu = opt.world_size
average_decay = opt.average_decay
average_every = opt.average_every
dropout = opt.dropout
dropout_steps = opt.dropout_steps
if device_id >= 0:
gpu_rank = opt.gpu_ranks[device_id]
else:
gpu_rank = 0
n_gpu = 0
gpu_verbose_level = opt.gpu_verbose_level
earlystopper = onmt.utils.EarlyStopping(
opt.early_stopping, scorers=onmt.utils.scorers_from_opts(opt)) \
if opt.early_stopping > 0 else None
report_manager = onmt.utils.build_report_manager(opt, gpu_rank)
trainer = onmt.Trainer(model, train_loss, valid_loss, optim, trunc_size,
shard_size, norm_method,
accum_count, accum_steps,
n_gpu, gpu_rank,
gpu_verbose_level, report_manager,
with_align=True if opt.lambda_align > 0 else False,
model_saver=model_saver if gpu_rank == 0 else None,
average_decay=average_decay,
average_every=average_every,
model_dtype=opt.model_dtype,
earlystopper=earlystopper,
dropout=dropout,
dropout_steps=dropout_steps)
return trainer
class Trainer(object):
"""
Class that controls the training process.
Args:
model(:py:class:`onmt.models.model.NMTModel`): translation model
to train
train_loss(:obj:`onmt.utils.loss.LossComputeBase`):
training loss computation
valid_loss(:obj:`onmt.utils.loss.LossComputeBase`):
training loss computation
optim(:obj:`onmt.utils.optimizers.Optimizer`):
the optimizer responsible for update
trunc_size(int): length of truncated back propagation through time
shard_size(int): compute loss in shards of this size for efficiency
data_type(string): type of the source input: [text|img|audio]
norm_method(string): normalization methods: [sents|tokens]
accum_count(list): accumulate gradients this many times.
accum_steps(list): steps for accum gradients changes.
report_manager(:obj:`onmt.utils.ReportMgrBase`):
the object that creates reports, or None
model_saver(:obj:`onmt.models.ModelSaverBase`): the saver is
used to save a checkpoint.
Thus nothing will be saved if this parameter is None
"""
def __init__(self, model, train_loss, valid_loss, optim,
trunc_size=0, shard_size=32,
norm_method="sents", accum_count=[1],
accum_steps=[0],
n_gpu=1, gpu_rank=1, gpu_verbose_level=0,
report_manager=None, with_align=False, model_saver=None,
average_decay=0, average_every=1, model_dtype='fp32',
earlystopper=None, dropout=[0.3], dropout_steps=[0]):
# Basic attributes.
self.model = model
self.train_loss = train_loss
self.valid_loss = valid_loss
self.optim = optim
self.trunc_size = trunc_size
self.shard_size = shard_size
self.norm_method = norm_method
self.accum_count_l = accum_count
self.accum_count = accum_count[0]
self.accum_steps = accum_steps
self.n_gpu = n_gpu
self.gpu_rank = gpu_rank
self.gpu_verbose_level = gpu_verbose_level
self.report_manager = report_manager
self.with_align = with_align
self.model_saver = model_saver
self.average_decay = average_decay
self.moving_average = None
self.average_every = average_every
self.model_dtype = model_dtype
self.earlystopper = earlystopper
self.dropout = dropout
self.dropout_steps = dropout_steps
for i in range(len(self.accum_count_l)):
assert self.accum_count_l[i] > 0
if self.accum_count_l[i] > 1:
assert self.trunc_size == 0, \
"""To enable accumulated gradients,
you must disable target sequence truncating."""
# Set model in training mode.
self.model.train()
def _accum_count(self, step):
for i in range(len(self.accum_steps)):
if step > self.accum_steps[i]:
_accum = self.accum_count_l[i]
return _accum
def _maybe_update_dropout(self, step):
for i in range(len(self.dropout_steps)):
if step > 1 and step == self.dropout_steps[i] + 1:
self.model.update_dropout(self.dropout[i])
logger.info("Updated dropout to %f from step %d"
% (self.dropout[i], step))
def _accum_batches(self, iterator):
batches = []
normalization = 0
self.accum_count = self._accum_count(self.optim.training_step)
for batch in iterator:
batches.append(batch)
if self.norm_method == "tokens":
num_tokens = batch.tgt[1:, :, 0].ne(
self.train_loss.padding_idx).sum()
normalization += num_tokens.item()
else:
normalization += batch.batch_size
if len(batches) == self.accum_count:
yield batches, normalization
self.accum_count = self._accum_count(self.optim.training_step)
batches = []
normalization = 0
if batches:
yield batches, normalization
def _update_average(self, step):
if self.moving_average is None:
copy_params = [params.detach().float()
for params in self.model.parameters()]
self.moving_average = copy_params
else:
average_decay = max(self.average_decay,
1 - (step + 1)/(step + 10))
for (i, avg), cpt in zip(enumerate(self.moving_average),
self.model.parameters()):
self.moving_average[i] = \
(1 - average_decay) * avg + \
cpt.detach().float() * average_decay
def train(self,
train_iter,
train_steps,
save_checkpoint_steps=5000,
valid_iter=None,
valid_steps=10000):
"""
The main training loop by iterating over `train_iter` and possibly
running validation on `valid_iter`.
Args:
train_iter: A generator that returns the next training batch.
train_steps: Run training for this many iterations.
save_checkpoint_steps: Save a checkpoint every this many
iterations.
valid_iter: A generator that returns the next validation batch.
valid_steps: Run evaluation every this many iterations.
Returns:
The gathered statistics.
"""
if valid_iter is None:
logger.info('Start training loop without validation...')
else:
logger.info('Start training loop and validate every %d steps...',
valid_steps)
total_stats = onmt.utils.Statistics()
report_stats = onmt.utils.Statistics()
self._start_report_manager(start_time=total_stats.start_time)
for i, (batches, normalization) in enumerate(
self._accum_batches(train_iter)):
step = self.optim.training_step
# UPDATE DROPOUT
self._maybe_update_dropout(step)
if self.gpu_verbose_level > 1:
logger.info("GpuRank %d: index: %d", self.gpu_rank, i)
if self.gpu_verbose_level > 0:
logger.info("GpuRank %d: reduce_counter: %d \
n_minibatch %d"
% (self.gpu_rank, i + 1, len(batches)))
if self.n_gpu > 1:
normalization = sum(onmt.utils.distributed
.all_gather_list
(normalization))
self._gradient_accumulation(
batches, normalization, total_stats,
report_stats)
if self.average_decay > 0 and i % self.average_every == 0:
self._update_average(step)
report_stats = self._maybe_report_training(
step, train_steps,
self.optim.learning_rate(),
report_stats)
if valid_iter is not None and step % valid_steps == 0:
if self.gpu_verbose_level > 0:
logger.info('GpuRank %d: validate step %d'
% (self.gpu_rank, step))
valid_stats = self.validate(
valid_iter, moving_average=self.moving_average)
if self.gpu_verbose_level > 0:
logger.info('GpuRank %d: gather valid stat \
step %d' % (self.gpu_rank, step))
valid_stats = self._maybe_gather_stats(valid_stats)
if self.gpu_verbose_level > 0:
logger.info('GpuRank %d: report stat step %d'
% (self.gpu_rank, step))
self._report_step(self.optim.learning_rate(),
step, valid_stats=valid_stats)
# Run patience mechanism
if self.earlystopper is not None:
self.earlystopper(valid_stats, step)
# If the patience has reached the limit, stop training
if self.earlystopper.has_stopped():
break
if (self.model_saver is not None
and (save_checkpoint_steps != 0
and step % save_checkpoint_steps == 0)):
self.model_saver.save(step, moving_average=self.moving_average)
if train_steps > 0 and step >= train_steps:
break
if self.model_saver is not None:
self.model_saver.save(step, moving_average=self.moving_average)
return total_stats
def validate(self, valid_iter, moving_average=None):
""" Validate model.
valid_iter: validate data iterator
Returns:
:obj:`nmt.Statistics`: validation loss statistics
"""
valid_model = self.model
if moving_average:
# swap model params w/ moving average
# (and keep the original parameters)
model_params_data = []
for avg, param in zip(self.moving_average,
valid_model.parameters()):
model_params_data.append(param.data)
param.data = avg.data.half() if self.optim._fp16 == "legacy" \
else avg.data
# Set model in validating mode.
valid_model.eval()
with torch.no_grad():
stats = onmt.utils.Statistics()
for batch in valid_iter:
src, src_lengths = batch.src if isinstance(batch.src, tuple) \
else (batch.src, None)
tgt = batch.tgt
# F-prop through the model.
outputs, attns = valid_model(src, tgt, src_lengths,
with_align=self.with_align)
# Compute loss.
_, batch_stats = self.valid_loss(batch, outputs, attns)
# Update statistics.
stats.update(batch_stats)
if moving_average:
for param_data, param in zip(model_params_data,
self.model.parameters()):
param.data = param_data
# Set model back to training mode.
valid_model.train()
return stats
def _gradient_accumulation(self, true_batches, normalization, total_stats,
report_stats):
if self.accum_count > 1:
self.optim.zero_grad()
for k, batch in enumerate(true_batches):
target_size = batch.tgt.size(0)
# Truncated BPTT: reminder not compatible with accum > 1
if self.trunc_size:
trunc_size = self.trunc_size
else:
trunc_size = target_size
src, src_lengths = batch.src if isinstance(batch.src, tuple) \
else (batch.src, None)
if src_lengths is not None:
report_stats.n_src_words += src_lengths.sum().item()
tgt_outer = batch.tgt
bptt = False
for j in range(0, target_size-1, trunc_size):
# 1. Create truncated target.
tgt = tgt_outer[j: j + trunc_size]
# 2. F-prop all but generator.
if self.accum_count == 1:
self.optim.zero_grad()
outputs, attns = self.model(src, tgt, src_lengths, bptt=bptt,
with_align=self.with_align)
bptt = True
# 3. Compute loss.
try:
loss, batch_stats = self.train_loss(
batch,
outputs,
attns,
normalization=normalization,
shard_size=self.shard_size,
trunc_start=j,
trunc_size=trunc_size)
if loss is not None:
self.optim.backward(loss)
total_stats.update(batch_stats)
report_stats.update(batch_stats)
except Exception:
traceback.print_exc()
logger.info("At step %d, we removed a batch - accum %d",
self.optim.training_step, k)
# 4. Update the parameters and statistics.
if self.accum_count == 1:
# Multi GPU gradient gather
if self.n_gpu > 1:
grads = [p.grad.data for p in self.model.parameters()
if p.requires_grad
and p.grad is not None]
onmt.utils.distributed.all_reduce_and_rescale_tensors(
grads, float(1))
self.optim.step()
# If truncated, don't backprop fully.
# TO CHECK
# if dec_state is not None:
# dec_state.detach()
if self.model.decoder.state is not None:
self.model.decoder.detach_state()
# in case of multi step gradient accumulation,
# update only after accum batches
if self.accum_count > 1:
if self.n_gpu > 1:
grads = [p.grad.data for p in self.model.parameters()
if p.requires_grad
and p.grad is not None]
onmt.utils.distributed.all_reduce_and_rescale_tensors(
grads, float(1))
self.optim.step()
def _start_report_manager(self, start_time=None):
"""
Simple function to start report manager (if any)
"""
if self.report_manager is not None:
if start_time is None:
self.report_manager.start()
else:
self.report_manager.start_time = start_time
def _maybe_gather_stats(self, stat):
"""
Gather statistics in multi-processes cases
Args:
stat(:obj:onmt.utils.Statistics): a Statistics object to gather
or None (it returns None in this case)
Returns:
stat: the updated (or unchanged) stat object
"""
if stat is not None and self.n_gpu > 1:
return onmt.utils.Statistics.all_gather_stats(stat)
return stat
def _maybe_report_training(self, step, num_steps, learning_rate,
report_stats):
"""
Simple function to report training stats (if report_manager is set)
see `onmt.utils.ReportManagerBase.report_training` for doc
"""
if self.report_manager is not None:
return self.report_manager.report_training(
step, num_steps, learning_rate, report_stats,
multigpu=self.n_gpu > 1)
def _report_step(self, learning_rate, step, train_stats=None,
valid_stats=None):
"""
Simple function to report stats (if report_manager is set)
see `onmt.utils.ReportManagerBase.report_step` for doc
"""
if self.report_manager is not None:
return self.report_manager.report_step(
learning_rate, step, train_stats=train_stats,
valid_stats=valid_stats)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/translate/__init__.py
================================================
""" Modules for translation """
from onmt.translate.translator import Translator
from onmt.translate.translation import Translation, TranslationBuilder
from onmt.translate.beam_search import BeamSearch, GNMTGlobalScorer
from onmt.translate.decode_strategy import DecodeStrategy
from onmt.translate.greedy_search import GreedySearch
from onmt.translate.penalties import PenaltyBuilder
from onmt.translate.translation_server import TranslationServer, \
ServerModelError
__all__ = ['Translator', 'Translation', 'BeamSearch',
'GNMTGlobalScorer', 'TranslationBuilder',
'PenaltyBuilder', 'TranslationServer', 'ServerModelError',
"DecodeStrategy", "GreedySearch"]
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/translate/beam_search.py
================================================
import torch
from onmt.translate import penalties
from onmt.translate.decode_strategy import DecodeStrategy
from onmt.utils.misc import tile
import warnings
class BeamSearch(DecodeStrategy):
"""Generation beam search.
Note that the attributes list is not exhaustive. Rather, it highlights
tensors to document their shape. (Since the state variables' "batch"
size decreases as beams finish, we denote this axis with a B rather than
``batch_size``).
Args:
beam_size (int): Number of beams to use (see base ``parallel_paths``).
batch_size (int): See base.
pad (int): See base.
bos (int): See base.
eos (int): See base.
n_best (int): Don't stop until at least this many beams have
reached EOS.
global_scorer (onmt.translate.GNMTGlobalScorer): Scorer instance.
min_length (int): See base.
max_length (int): See base.
return_attention (bool): See base.
block_ngram_repeat (int): See base.
exclusion_tokens (set[int]): See base.
Attributes:
top_beam_finished (ByteTensor): Shape ``(B,)``.
_batch_offset (LongTensor): Shape ``(B,)``.
_beam_offset (LongTensor): Shape ``(batch_size x beam_size,)``.
alive_seq (LongTensor): See base.
topk_log_probs (FloatTensor): Shape ``(B x beam_size,)``. These
are the scores used for the topk operation.
memory_lengths (LongTensor): Lengths of encodings. Used for
masking attentions.
select_indices (LongTensor or NoneType): Shape
``(B x beam_size,)``. This is just a flat view of the
``_batch_index``.
topk_scores (FloatTensor): Shape
``(B, beam_size)``. These are the
scores a sequence will receive if it finishes.
topk_ids (LongTensor): Shape ``(B, beam_size)``. These are the
word indices of the topk predictions.
_batch_index (LongTensor): Shape ``(B, beam_size)``.
_prev_penalty (FloatTensor or NoneType): Shape
``(B, beam_size)``. Initialized to ``None``.
_coverage (FloatTensor or NoneType): Shape
``(1, B x beam_size, inp_seq_len)``.
hypotheses (list[list[Tuple[Tensor]]]): Contains a tuple
of score (float), sequence (long), and attention (float or None).
"""
def __init__(self, beam_size, batch_size, pad, bos, eos, n_best,
global_scorer, min_length, max_length, return_attention,
block_ngram_repeat, exclusion_tokens,
stepwise_penalty, ratio):
super(BeamSearch, self).__init__(
pad, bos, eos, batch_size, beam_size, min_length,
block_ngram_repeat, exclusion_tokens, return_attention,
max_length)
# beam parameters
self.global_scorer = global_scorer
self.beam_size = beam_size
self.n_best = n_best
self.ratio = ratio
# result caching
self.hypotheses = [[] for _ in range(batch_size)]
# beam state
self.top_beam_finished = torch.zeros([batch_size], dtype=torch.uint8)
# BoolTensor was introduced in pytorch 1.2
try:
self.top_beam_finished = self.top_beam_finished.bool()
except AttributeError:
pass
self._batch_offset = torch.arange(batch_size, dtype=torch.long)
self.select_indices = None
self.done = False
# "global state" of the old beam
self._prev_penalty = None
self._coverage = None
self._stepwise_cov_pen = (
stepwise_penalty and self.global_scorer.has_cov_pen)
self._vanilla_cov_pen = (
not stepwise_penalty and self.global_scorer.has_cov_pen)
self._cov_pen = self.global_scorer.has_cov_pen
def initialize(self, memory_bank, src_lengths, src_map=None, device=None):
"""Initialize for decoding.
Repeat src objects `beam_size` times.
"""
def fn_map_state(state, dim):
return tile(state, self.beam_size, dim=dim)
if isinstance(memory_bank, tuple):
memory_bank = tuple(tile(x, self.beam_size, dim=1)
for x in memory_bank)
mb_device = memory_bank[0].device
else:
memory_bank = tile(memory_bank, self.beam_size, dim=1)
mb_device = memory_bank.device
if src_map is not None:
src_map = tile(src_map, self.beam_size, dim=1)
if device is None:
device = mb_device
self.memory_lengths = tile(src_lengths, self.beam_size)
super(BeamSearch, self).initialize(
memory_bank, self.memory_lengths, src_map, device)
self.best_scores = torch.full(
[self.batch_size], -1e10, dtype=torch.float, device=device)
self._beam_offset = torch.arange(
0, self.batch_size * self.beam_size, step=self.beam_size,
dtype=torch.long, device=device)
self.topk_log_probs = torch.tensor(
[0.0] + [float("-inf")] * (self.beam_size - 1), device=device
).repeat(self.batch_size)
# buffers for the topk scores and 'backpointer'
self.topk_scores = torch.empty((self.batch_size, self.beam_size),
dtype=torch.float, device=device)
self.topk_ids = torch.empty((self.batch_size, self.beam_size),
dtype=torch.long, device=device)
self._batch_index = torch.empty([self.batch_size, self.beam_size],
dtype=torch.long, device=device)
return fn_map_state, memory_bank, self.memory_lengths, src_map
@property
def current_predictions(self):
return self.alive_seq[:, -1]
@property
def current_backptr(self):
# for testing
return self.select_indices.view(self.batch_size, self.beam_size)\
.fmod(self.beam_size)
@property
def batch_offset(self):
return self._batch_offset
def advance(self, log_probs, attn):
vocab_size = log_probs.size(-1)
# using integer division to get an integer _B without casting
_B = log_probs.shape[0] // self.beam_size
if self._stepwise_cov_pen and self._prev_penalty is not None:
self.topk_log_probs += self._prev_penalty
self.topk_log_probs -= self.global_scorer.cov_penalty(
self._coverage + attn, self.global_scorer.beta).view(
_B, self.beam_size)
# force the output to be longer than self.min_length
step = len(self)
self.ensure_min_length(log_probs)
# Multiply probs by the beam probability.
log_probs += self.topk_log_probs.view(_B * self.beam_size, 1)
self.block_ngram_repeats(log_probs)
# if the sequence ends now, then the penalty is the current
# length + 1, to include the EOS token
length_penalty = self.global_scorer.length_penalty(
step + 1, alpha=self.global_scorer.alpha)
# Flatten probs into a list of possibilities.
curr_scores = log_probs / length_penalty
curr_scores = curr_scores.reshape(_B, self.beam_size * vocab_size)
torch.topk(curr_scores, self.beam_size, dim=-1,
out=(self.topk_scores, self.topk_ids))
# Recover log probs.
# Length penalty is just a scalar. It doesn't matter if it's applied
# before or after the topk.
torch.mul(self.topk_scores, length_penalty, out=self.topk_log_probs)
# Resolve beam origin and map to batch index flat representation.
torch.div(self.topk_ids, vocab_size, out=self._batch_index)
self._batch_index += self._beam_offset[:_B].unsqueeze(1)
self.select_indices = self._batch_index.view(_B * self.beam_size)
self.topk_ids.fmod_(vocab_size) # resolve true word ids
# Append last prediction.
self.alive_seq = torch.cat(
[self.alive_seq.index_select(0, self.select_indices),
self.topk_ids.view(_B * self.beam_size, 1)], -1)
if self.return_attention or self._cov_pen:
current_attn = attn.index_select(1, self.select_indices)
if step == 1:
self.alive_attn = current_attn
# update global state (step == 1)
if self._cov_pen: # coverage penalty
self._prev_penalty = torch.zeros_like(self.topk_log_probs)
self._coverage = current_attn
else:
self.alive_attn = self.alive_attn.index_select(
1, self.select_indices)
self.alive_attn = torch.cat([self.alive_attn, current_attn], 0)
# update global state (step > 1)
if self._cov_pen:
self._coverage = self._coverage.index_select(
1, self.select_indices)
self._coverage += current_attn
self._prev_penalty = self.global_scorer.cov_penalty(
self._coverage, beta=self.global_scorer.beta).view(
_B, self.beam_size)
if self._vanilla_cov_pen:
# shape: (batch_size x beam_size, 1)
cov_penalty = self.global_scorer.cov_penalty(
self._coverage,
beta=self.global_scorer.beta)
self.topk_scores -= cov_penalty.view(_B, self.beam_size).float()
self.is_finished = self.topk_ids.eq(self.eos)
self.ensure_max_length()
def update_finished(self):
# Penalize beams that finished.
_B_old = self.topk_log_probs.shape[0]
step = self.alive_seq.shape[-1] # 1 greater than the step in advance
self.topk_log_probs.masked_fill_(self.is_finished, -1e10)
# on real data (newstest2017) with the pretrained transformer,
# it's faster to not move this back to the original device
self.is_finished = self.is_finished.to('cpu')
self.top_beam_finished |= self.is_finished[:, 0].eq(1)
predictions = self.alive_seq.view(_B_old, self.beam_size, step)
attention = (
self.alive_attn.view(
step - 1, _B_old, self.beam_size, self.alive_attn.size(-1))
if self.alive_attn is not None else None)
non_finished_batch = []
for i in range(self.is_finished.size(0)): # Batch level
b = self._batch_offset[i]
finished_hyp = self.is_finished[i].nonzero().view(-1)
# Store finished hypotheses for this batch.
for j in finished_hyp: # Beam level: finished beam j in batch i
if self.ratio > 0:
s = self.topk_scores[i, j] / (step + 1)
if self.best_scores[b] < s:
self.best_scores[b] = s
self.hypotheses[b].append((
self.topk_scores[i, j],
predictions[i, j, 1:], # Ignore start_token.
attention[:, i, j, :self.memory_lengths[i]]
if attention is not None else None))
# End condition is the top beam finished and we can return
# n_best hypotheses.
if self.ratio > 0:
pred_len = self.memory_lengths[i] * self.ratio
finish_flag = ((self.topk_scores[i, 0] / pred_len)
<= self.best_scores[b]) or \
self.is_finished[i].all()
else:
finish_flag = self.top_beam_finished[i] != 0
if finish_flag and len(self.hypotheses[b]) >= self.n_best:
best_hyp = sorted(
self.hypotheses[b], key=lambda x: x[0], reverse=True)
for n, (score, pred, attn) in enumerate(best_hyp):
if n >= self.n_best:
break
self.scores[b].append(score)
self.predictions[b].append(pred) # ``(batch, n_best,)``
self.attention[b].append(
attn if attn is not None else [])
else:
non_finished_batch.append(i)
non_finished = torch.tensor(non_finished_batch)
# If all sentences are translated, no need to go further.
if len(non_finished) == 0:
self.done = True
return
_B_new = non_finished.shape[0]
# Remove finished batches for the next step.
self.top_beam_finished = self.top_beam_finished.index_select(
0, non_finished)
self._batch_offset = self._batch_offset.index_select(0, non_finished)
non_finished = non_finished.to(self.topk_ids.device)
self.topk_log_probs = self.topk_log_probs.index_select(0,
non_finished)
self._batch_index = self._batch_index.index_select(0, non_finished)
self.select_indices = self._batch_index.view(_B_new * self.beam_size)
self.alive_seq = predictions.index_select(0, non_finished) \
.view(-1, self.alive_seq.size(-1))
self.topk_scores = self.topk_scores.index_select(0, non_finished)
self.topk_ids = self.topk_ids.index_select(0, non_finished)
if self.alive_attn is not None:
inp_seq_len = self.alive_attn.size(-1)
self.alive_attn = attention.index_select(1, non_finished) \
.view(step - 1, _B_new * self.beam_size, inp_seq_len)
if self._cov_pen:
self._coverage = self._coverage \
.view(1, _B_old, self.beam_size, inp_seq_len) \
.index_select(1, non_finished) \
.view(1, _B_new * self.beam_size, inp_seq_len)
if self._stepwise_cov_pen:
self._prev_penalty = self._prev_penalty.index_select(
0, non_finished)
class GNMTGlobalScorer(object):
"""NMT re-ranking.
Args:
alpha (float): Length parameter.
beta (float): Coverage parameter.
length_penalty (str): Length penalty strategy.
coverage_penalty (str): Coverage penalty strategy.
Attributes:
alpha (float): See above.
beta (float): See above.
length_penalty (callable): See :class:`penalties.PenaltyBuilder`.
coverage_penalty (callable): See :class:`penalties.PenaltyBuilder`.
has_cov_pen (bool): See :class:`penalties.PenaltyBuilder`.
has_len_pen (bool): See :class:`penalties.PenaltyBuilder`.
"""
@classmethod
def from_opt(cls, opt):
return cls(
opt.alpha,
opt.beta,
opt.length_penalty,
opt.coverage_penalty)
def __init__(self, alpha, beta, length_penalty, coverage_penalty):
self._validate(alpha, beta, length_penalty, coverage_penalty)
self.alpha = alpha
self.beta = beta
penalty_builder = penalties.PenaltyBuilder(coverage_penalty,
length_penalty)
self.has_cov_pen = penalty_builder.has_cov_pen
# Term will be subtracted from probability
self.cov_penalty = penalty_builder.coverage_penalty
self.has_len_pen = penalty_builder.has_len_pen
# Probability will be divided by this
self.length_penalty = penalty_builder.length_penalty
@classmethod
def _validate(cls, alpha, beta, length_penalty, coverage_penalty):
# these warnings indicate that either the alpha/beta
# forces a penalty to be a no-op, or a penalty is a no-op but
# the alpha/beta would suggest otherwise.
if length_penalty is None or length_penalty == "none":
if alpha != 0:
warnings.warn("Non-default `alpha` with no length penalty. "
"`alpha` has no effect.")
else:
# using some length penalty
if length_penalty == "wu" and alpha == 0.:
warnings.warn("Using length penalty Wu with alpha==0 "
"is equivalent to using length penalty none.")
if coverage_penalty is None or coverage_penalty == "none":
if beta != 0:
warnings.warn("Non-default `beta` with no coverage penalty. "
"`beta` has no effect.")
else:
# using some coverage penalty
if beta == 0.:
warnings.warn("Non-default coverage penalty with beta==0 "
"is equivalent to using coverage penalty none.")
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/translate/decode_strategy.py
================================================
import torch
class DecodeStrategy(object):
"""Base class for generation strategies.
Args:
pad (int): Magic integer in output vocab.
bos (int): Magic integer in output vocab.
eos (int): Magic integer in output vocab.
batch_size (int): Current batch size.
parallel_paths (int): Decoding strategies like beam search
use parallel paths. Each batch is repeated ``parallel_paths``
times in relevant state tensors.
min_length (int): Shortest acceptable generation, not counting
begin-of-sentence or end-of-sentence.
max_length (int): Longest acceptable sequence, not counting
begin-of-sentence (presumably there has been no EOS
yet if max_length is used as a cutoff).
block_ngram_repeat (int): Block beams where
``block_ngram_repeat``-grams repeat.
exclusion_tokens (set[int]): If a gram contains any of these
tokens, it may repeat.
return_attention (bool): Whether to work with attention too. If this
is true, it is assumed that the decoder is attentional.
Attributes:
pad (int): See above.
bos (int): See above.
eos (int): See above.
predictions (list[list[LongTensor]]): For each batch, holds a
list of beam prediction sequences.
scores (list[list[FloatTensor]]): For each batch, holds a
list of scores.
attention (list[list[FloatTensor or list[]]]): For each
batch, holds a list of attention sequence tensors
(or empty lists) having shape ``(step, inp_seq_len)`` where
``inp_seq_len`` is the length of the sample (not the max
length of all inp seqs).
alive_seq (LongTensor): Shape ``(B x parallel_paths, step)``.
This sequence grows in the ``step`` axis on each call to
:func:`advance()`.
is_finished (ByteTensor or NoneType): Shape
``(B, parallel_paths)``. Initialized to ``None``.
alive_attn (FloatTensor or NoneType): If tensor, shape is
``(step, B x parallel_paths, inp_seq_len)``, where ``inp_seq_len``
is the (max) length of the input sequence.
min_length (int): See above.
max_length (int): See above.
block_ngram_repeat (int): See above.
exclusion_tokens (set[int]): See above.
return_attention (bool): See above.
done (bool): See above.
"""
def __init__(self, pad, bos, eos, batch_size, parallel_paths,
min_length, block_ngram_repeat, exclusion_tokens,
return_attention, max_length):
# magic indices
self.pad = pad
self.bos = bos
self.eos = eos
self.batch_size = batch_size
self.parallel_paths = parallel_paths
# result caching
self.predictions = [[] for _ in range(batch_size)]
self.scores = [[] for _ in range(batch_size)]
self.attention = [[] for _ in range(batch_size)]
self.alive_attn = None
self.min_length = min_length
self.max_length = max_length
self.block_ngram_repeat = block_ngram_repeat
self.exclusion_tokens = exclusion_tokens
self.return_attention = return_attention
self.done = False
def initialize(self, memory_bank, src_lengths, src_map=None, device=None):
"""DecodeStrategy subclasses should override :func:`initialize()`.
`initialize` should be called before all actions.
used to prepare necessary ingredients for decode.
"""
if device is None:
device = torch.device('cpu')
self.alive_seq = torch.full(
[self.batch_size * self.parallel_paths, 1], self.bos,
dtype=torch.long, device=device)
self.is_finished = torch.zeros(
[self.batch_size, self.parallel_paths],
dtype=torch.uint8, device=device)
return None, memory_bank, src_lengths, src_map
def __len__(self):
return self.alive_seq.shape[1]
def ensure_min_length(self, log_probs):
if len(self) <= self.min_length:
log_probs[:, self.eos] = -1e20
def ensure_max_length(self):
# add one to account for BOS. Don't account for EOS because hitting
# this implies it hasn't been found.
if len(self) == self.max_length + 1:
self.is_finished.fill_(1)
def block_ngram_repeats(self, log_probs):
cur_len = len(self)
if self.block_ngram_repeat > 0 and cur_len > 1:
for path_idx in range(self.alive_seq.shape[0]):
# skip BOS
hyp = self.alive_seq[path_idx, 1:]
ngrams = set()
fail = False
gram = []
for i in range(cur_len - 1):
# Last n tokens, n = block_ngram_repeat
gram = (gram + [hyp[i].item()])[-self.block_ngram_repeat:]
# skip the blocking if any token in gram is excluded
if set(gram) & self.exclusion_tokens:
continue
if tuple(gram) in ngrams:
fail = True
ngrams.add(tuple(gram))
if fail:
log_probs[path_idx] = -10e20
def advance(self, log_probs, attn):
"""DecodeStrategy subclasses should override :func:`advance()`.
Advance is used to update ``self.alive_seq``, ``self.is_finished``,
and, when appropriate, ``self.alive_attn``.
"""
raise NotImplementedError()
def update_finished(self):
"""DecodeStrategy subclasses should override :func:`update_finished()`.
``update_finished`` is used to update ``self.predictions``,
``self.scores``, and other "output" attributes.
"""
raise NotImplementedError()
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/translate/greedy_search.py
================================================
import torch
from onmt.translate.decode_strategy import DecodeStrategy
def sample_with_temperature(logits, sampling_temp, keep_topk):
"""Select next tokens randomly from the top k possible next tokens.
Samples from a categorical distribution over the ``keep_topk`` words using
the category probabilities ``logits / sampling_temp``.
Args:
logits (FloatTensor): Shaped ``(batch_size, vocab_size)``.
These can be logits (``(-inf, inf)``) or log-probs (``(-inf, 0]``).
(The distribution actually uses the log-probabilities
``logits - logits.logsumexp(-1)``, which equals the logits if
they are log-probabilities summing to 1.)
sampling_temp (float): Used to scale down logits. The higher the
value, the more likely it is that a non-max word will be
sampled.
keep_topk (int): This many words could potentially be chosen. The
other logits are set to have probability 0.
Returns:
(LongTensor, FloatTensor):
* topk_ids: Shaped ``(batch_size, 1)``. These are
the sampled word indices in the output vocab.
* topk_scores: Shaped ``(batch_size, 1)``. These
are essentially ``(logits / sampling_temp)[topk_ids]``.
"""
if sampling_temp == 0.0 or keep_topk == 1:
# For temp=0.0, take the argmax to avoid divide-by-zero errors.
# keep_topk=1 is also equivalent to argmax.
topk_scores, topk_ids = logits.topk(1, dim=-1)
if sampling_temp > 0:
topk_scores /= sampling_temp
else:
logits = torch.div(logits, sampling_temp)
if keep_topk > 0:
top_values, top_indices = torch.topk(logits, keep_topk, dim=1)
kth_best = top_values[:, -1].view([-1, 1])
kth_best = kth_best.repeat([1, logits.shape[1]]).float()
# Set all logits that are not in the top-k to -10000.
# This puts the probabilities close to 0.
ignore = torch.lt(logits, kth_best)
logits = logits.masked_fill(ignore, -10000)
dist = torch.distributions.Multinomial(
logits=logits, total_count=1)
topk_ids = torch.argmax(dist.sample(), dim=1, keepdim=True)
topk_scores = logits.gather(dim=1, index=topk_ids)
return topk_ids, topk_scores
class GreedySearch(DecodeStrategy):
"""Select next tokens randomly from the top k possible next tokens.
The ``scores`` attribute's lists are the score, after applying temperature,
of the final prediction (either EOS or the final token in the event
that ``max_length`` is reached)
Args:
pad (int): See base.
bos (int): See base.
eos (int): See base.
batch_size (int): See base.
min_length (int): See base.
max_length (int): See base.
block_ngram_repeat (int): See base.
exclusion_tokens (set[int]): See base.
return_attention (bool): See base.
max_length (int): See base.
sampling_temp (float): See
:func:`~onmt.translate.greedy_search.sample_with_temperature()`.
keep_topk (int): See
:func:`~onmt.translate.greedy_search.sample_with_temperature()`.
"""
def __init__(self, pad, bos, eos, batch_size, min_length,
block_ngram_repeat, exclusion_tokens, return_attention,
max_length, sampling_temp, keep_topk):
assert block_ngram_repeat == 0
super(GreedySearch, self).__init__(
pad, bos, eos, batch_size, 1, min_length, block_ngram_repeat,
exclusion_tokens, return_attention, max_length)
self.sampling_temp = sampling_temp
self.keep_topk = keep_topk
self.topk_scores = None
def initialize(self, memory_bank, src_lengths, src_map=None, device=None):
"""Initialize for decoding."""
fn_map_state = None
if isinstance(memory_bank, tuple):
mb_device = memory_bank[0].device
else:
mb_device = memory_bank.device
if device is None:
device = mb_device
self.memory_lengths = src_lengths
super(GreedySearch, self).initialize(
memory_bank, src_lengths, src_map, device)
self.select_indices = torch.arange(
self.batch_size, dtype=torch.long, device=device)
self.original_batch_idx = torch.arange(
self.batch_size, dtype=torch.long, device=device)
return fn_map_state, memory_bank, self.memory_lengths, src_map
@property
def current_predictions(self):
return self.alive_seq[:, -1]
@property
def batch_offset(self):
return self.select_indices
def advance(self, log_probs, attn):
"""Select next tokens randomly from the top k possible next tokens.
Args:
log_probs (FloatTensor): Shaped ``(batch_size, vocab_size)``.
These can be logits (``(-inf, inf)``) or log-probs
(``(-inf, 0]``). (The distribution actually uses the
log-probabilities ``logits - logits.logsumexp(-1)``,
which equals the logits if they are log-probabilities summing
to 1.)
attn (FloatTensor): Shaped ``(1, B, inp_seq_len)``.
"""
self.ensure_min_length(log_probs)
self.block_ngram_repeats(log_probs)
topk_ids, self.topk_scores = sample_with_temperature(
log_probs, self.sampling_temp, self.keep_topk)
self.is_finished = topk_ids.eq(self.eos)
self.alive_seq = torch.cat([self.alive_seq, topk_ids], -1)
if self.return_attention:
if self.alive_attn is None:
self.alive_attn = attn
else:
self.alive_attn = torch.cat([self.alive_attn, attn], 0)
self.ensure_max_length()
def update_finished(self):
"""Finalize scores and predictions."""
# shape: (sum(~ self.is_finished), 1)
finished_batches = self.is_finished.view(-1).nonzero()
for b in finished_batches.view(-1):
b_orig = self.original_batch_idx[b]
self.scores[b_orig].append(self.topk_scores[b, 0])
self.predictions[b_orig].append(self.alive_seq[b, 1:])
self.attention[b_orig].append(
self.alive_attn[:, b, :self.memory_lengths[b]]
if self.alive_attn is not None else [])
self.done = self.is_finished.all()
if self.done:
return
is_alive = ~self.is_finished.view(-1)
self.alive_seq = self.alive_seq[is_alive]
if self.alive_attn is not None:
self.alive_attn = self.alive_attn[:, is_alive]
self.select_indices = is_alive.nonzero().view(-1)
self.original_batch_idx = self.original_batch_idx[is_alive]
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/translate/penalties.py
================================================
from __future__ import division
import torch
class PenaltyBuilder(object):
"""Returns the Length and Coverage Penalty function for Beam Search.
Args:
length_pen (str): option name of length pen
cov_pen (str): option name of cov pen
Attributes:
has_cov_pen (bool): Whether coverage penalty is None (applying it
is a no-op). Note that the converse isn't true. Setting beta
to 0 should force coverage length to be a no-op.
has_len_pen (bool): Whether length penalty is None (applying it
is a no-op). Note that the converse isn't true. Setting alpha
to 1 should force length penalty to be a no-op.
coverage_penalty (callable[[FloatTensor, float], FloatTensor]):
Calculates the coverage penalty.
length_penalty (callable[[int, float], float]): Calculates
the length penalty.
"""
def __init__(self, cov_pen, length_pen):
self.has_cov_pen = not self._pen_is_none(cov_pen)
self.coverage_penalty = self._coverage_penalty(cov_pen)
self.has_len_pen = not self._pen_is_none(length_pen)
self.length_penalty = self._length_penalty(length_pen)
@staticmethod
def _pen_is_none(pen):
return pen == "none" or pen is None
def _coverage_penalty(self, cov_pen):
if cov_pen == "wu":
return self.coverage_wu
elif cov_pen == "summary":
return self.coverage_summary
elif self._pen_is_none(cov_pen):
return self.coverage_none
else:
raise NotImplementedError("No '{:s}' coverage penalty.".format(
cov_pen))
def _length_penalty(self, length_pen):
if length_pen == "wu":
return self.length_wu
elif length_pen == "avg":
return self.length_average
elif self._pen_is_none(length_pen):
return self.length_none
else:
raise NotImplementedError("No '{:s}' length penalty.".format(
length_pen))
# Below are all the different penalty terms implemented so far.
# Subtract coverage penalty from topk log probs.
# Divide topk log probs by length penalty.
def coverage_wu(self, cov, beta=0.):
"""GNMT coverage re-ranking score.
See "Google's Neural Machine Translation System" :cite:`wu2016google`.
``cov`` is expected to be sized ``(*, seq_len)``, where ``*`` is
probably ``batch_size x beam_size`` but could be several
dimensions like ``(batch_size, beam_size)``. If ``cov`` is attention,
then the ``seq_len`` axis probably sums to (almost) 1.
"""
penalty = -torch.min(cov, cov.clone().fill_(1.0)).log().sum(-1)
return beta * penalty
def coverage_summary(self, cov, beta=0.):
"""Our summary penalty."""
penalty = torch.max(cov, cov.clone().fill_(1.0)).sum(-1)
penalty -= cov.size(-1)
return beta * penalty
def coverage_none(self, cov, beta=0.):
"""Returns zero as penalty"""
none = torch.zeros((1,), device=cov.device,
dtype=torch.float)
if cov.dim() == 3:
none = none.unsqueeze(0)
return none
def length_wu(self, cur_len, alpha=0.):
"""GNMT length re-ranking score.
See "Google's Neural Machine Translation System" :cite:`wu2016google`.
"""
return ((5 + cur_len) / 6.0) ** alpha
def length_average(self, cur_len, alpha=0.):
"""Returns the current sequence length."""
return cur_len
def length_none(self, cur_len, alpha=0.):
"""Returns unmodified scores."""
return 1.0
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/translate/process_zh.py
================================================
from pyhanlp import HanLP
from snownlp import SnowNLP
import pkuseg
# Chinese segmentation
def zh_segmentator(line):
return " ".join(pkuseg.pkuseg().cut(line))
# Chinese simplify -> Chinese traditional standard
def zh_traditional_standard(line):
return HanLP.convertToTraditionalChinese(line)
# Chinese simplify -> Chinese traditional (HongKong)
def zh_traditional_hk(line):
return HanLP.s2hk(line)
# Chinese simplify -> Chinese traditional (Taiwan)
def zh_traditional_tw(line):
return HanLP.s2tw(line)
# Chinese traditional -> Chinese simplify (v1)
def zh_simplify(line):
return HanLP.convertToSimplifiedChinese(line)
# Chinese traditional -> Chinese simplify (v2)
def zh_simplify_v2(line):
return SnowNLP(line).han
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/translate/translation.py
================================================
""" Translation main class """
from __future__ import unicode_literals, print_function
import torch
from onmt.inputters.text_dataset import TextMultiField
from onmt.utils.alignment import build_align_pharaoh
class TranslationBuilder(object):
"""
Build a word-based translation from the batch output
of translator and the underlying dictionaries.
Replacement based on "Addressing the Rare Word
Problem in Neural Machine Translation" :cite:`Luong2015b`
Args:
data (onmt.inputters.Dataset): Data.
fields (List[Tuple[str, torchtext.data.Field]]): data fields
n_best (int): number of translations produced
replace_unk (bool): replace unknown words using attention
has_tgt (bool): will the batch have gold targets
"""
def __init__(self, data, fields, n_best=1, replace_unk=False,
has_tgt=False, phrase_table=""):
self.data = data
self.fields = fields
self._has_text_src = isinstance(
dict(self.fields)["src"], TextMultiField)
self.n_best = n_best
self.replace_unk = replace_unk
self.phrase_table = phrase_table
self.has_tgt = has_tgt
def _build_target_tokens(self, src, src_vocab, src_raw, pred, attn):
tgt_field = dict(self.fields)["tgt"].base_field
vocab = tgt_field.vocab
tokens = []
for tok in pred:
if tok < len(vocab):
tokens.append(vocab.itos[tok])
else:
tokens.append(src_vocab.itos[tok - len(vocab)])
if tokens[-1] == tgt_field.eos_token:
tokens = tokens[:-1]
break
if self.replace_unk and attn is not None and src is not None:
for i in range(len(tokens)):
if tokens[i] == tgt_field.unk_token:
_, max_index = attn[i][:len(src_raw)].max(0)
tokens[i] = src_raw[max_index.item()]
if self.phrase_table != "":
with open(self.phrase_table, "r") as f:
for line in f:
if line.startswith(src_raw[max_index.item()]):
tokens[i] = line.split('|||')[1].strip()
return tokens
def from_batch(self, translation_batch):
batch = translation_batch["batch"]
assert(len(translation_batch["gold_score"]) ==
len(translation_batch["predictions"]))
batch_size = batch.batch_size
preds, pred_score, attn, align, gold_score, indices = list(zip(
*sorted(zip(translation_batch["predictions"],
translation_batch["scores"],
translation_batch["attention"],
translation_batch["alignment"],
translation_batch["gold_score"],
batch.indices.data),
key=lambda x: x[-1])))
if not any(align): # when align is a empty nested list
align = [None] * batch_size
# Sorting
inds, perm = torch.sort(batch.indices)
if self._has_text_src:
src = batch.src[0][:, :, 0].index_select(1, perm)
else:
src = None
tgt = batch.tgt[:, :, 0].index_select(1, perm) \
if self.has_tgt else None
translations = []
for b in range(batch_size):
if self._has_text_src:
src_vocab = self.data.src_vocabs[inds[b]] \
if self.data.src_vocabs else None
src_raw = self.data.examples[inds[b]].src[0]
else:
src_vocab = None
src_raw = None
pred_sents = [self._build_target_tokens(
src[:, b] if src is not None else None,
src_vocab, src_raw,
preds[b][n], attn[b][n])
for n in range(self.n_best)]
gold_sent = None
if tgt is not None:
gold_sent = self._build_target_tokens(
src[:, b] if src is not None else None,
src_vocab, src_raw,
tgt[1:, b] if tgt is not None else None, None)
translation = Translation(
src[:, b] if src is not None else None,
src_raw, pred_sents, attn[b], pred_score[b],
gold_sent, gold_score[b], align[b]
)
translations.append(translation)
return translations
class Translation(object):
"""Container for a translated sentence.
Attributes:
src (LongTensor): Source word IDs.
src_raw (List[str]): Raw source words.
pred_sents (List[List[str]]): Words from the n-best translations.
pred_scores (List[List[float]]): Log-probs of n-best translations.
attns (List[FloatTensor]) : Attention distribution for each
translation.
gold_sent (List[str]): Words from gold translation.
gold_score (List[float]): Log-prob of gold translation.
word_aligns (List[FloatTensor]): Words Alignment distribution for
each translation.
"""
__slots__ = ["src", "src_raw", "pred_sents", "attns", "pred_scores",
"gold_sent", "gold_score", "word_aligns"]
def __init__(self, src, src_raw, pred_sents,
attn, pred_scores, tgt_sent, gold_score, word_aligns):
self.src = src
self.src_raw = src_raw
self.pred_sents = pred_sents
self.attns = attn
self.pred_scores = pred_scores
self.gold_sent = tgt_sent
self.gold_score = gold_score
self.word_aligns = word_aligns
def log(self, sent_number):
"""
Log translation.
"""
msg = ['\nSENT {}: {}\n'.format(sent_number, self.src_raw)]
best_pred = self.pred_sents[0]
best_score = self.pred_scores[0]
pred_sent = ' '.join(best_pred)
msg.append('PRED {}: {}\n'.format(sent_number, pred_sent))
msg.append("PRED SCORE: {:.4f}\n".format(best_score))
if self.word_aligns is not None:
pred_align = self.word_aligns[0]
pred_align_pharaoh = build_align_pharaoh(pred_align)
pred_align_sent = ' '.join(pred_align_pharaoh)
msg.append("ALIGN: {}\n".format(pred_align_sent))
if self.gold_sent is not None:
tgt_sent = ' '.join(self.gold_sent)
msg.append('GOLD {}: {}\n'.format(sent_number, tgt_sent))
msg.append(("GOLD SCORE: {:.4f}\n".format(self.gold_score)))
if len(self.pred_sents) > 1:
msg.append('\nBEST HYP:\n')
for score, sent in zip(self.pred_scores, self.pred_sents):
msg.append("[{:.4f}] {}\n".format(score, sent))
return "".join(msg)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/translate/translation_server.py
================================================
#!/usr/bin/env python
"""REST Translation server."""
from __future__ import print_function
import codecs
import sys
import os
import time
import json
import threading
import re
import traceback
import importlib
import torch
import onmt.opts
from onmt.utils.logging import init_logger
from onmt.utils.misc import set_random_seed
from onmt.utils.misc import check_model_config
from onmt.utils.alignment import to_word_align
from onmt.utils.parse import ArgumentParser
from onmt.translate.translator import build_translator
def critical(func):
"""Decorator for critical section (mutually exclusive code)"""
def wrapper(server_model, *args, **kwargs):
if sys.version_info[0] == 3:
if not server_model.running_lock.acquire(True, 120):
raise ServerModelError("Model %d running lock timeout"
% server_model.model_id)
else:
# semaphore doesn't have a timeout arg in Python 2.7
server_model.running_lock.acquire(True)
try:
o = func(server_model, *args, **kwargs)
except (Exception, RuntimeError):
server_model.running_lock.release()
raise
server_model.running_lock.release()
return o
return wrapper
class Timer:
def __init__(self, start=False):
self.stime = -1
self.prev = -1
self.times = {}
if start:
self.start()
def start(self):
self.stime = time.time()
self.prev = self.stime
self.times = {}
def tick(self, name=None, tot=False):
t = time.time()
if not tot:
elapsed = t - self.prev
else:
elapsed = t - self.stime
self.prev = t
if name is not None:
self.times[name] = elapsed
return elapsed
class ServerModelError(Exception):
pass
class TranslationServer(object):
def __init__(self):
self.models = {}
self.next_id = 0
def start(self, config_file):
"""Read the config file and pre-/load the models."""
self.config_file = config_file
with open(self.config_file) as f:
self.confs = json.load(f)
self.models_root = self.confs.get('models_root', './available_models')
for i, conf in enumerate(self.confs["models"]):
if "models" not in conf:
if "model" in conf:
# backwards compatibility for confs
conf["models"] = [conf["model"]]
else:
raise ValueError("""Incorrect config file: missing 'models'
parameter for model #%d""" % i)
check_model_config(conf, self.models_root)
kwargs = {'timeout': conf.get('timeout', None),
'load': conf.get('load', None),
'preprocess_opt': conf.get('preprocess', None),
'tokenizer_opt': conf.get('tokenizer', None),
'postprocess_opt': conf.get('postprocess', None),
'on_timeout': conf.get('on_timeout', None),
'model_root': conf.get('model_root', self.models_root)
}
kwargs = {k: v for (k, v) in kwargs.items() if v is not None}
model_id = conf.get("id", None)
opt = conf["opt"]
opt["models"] = conf["models"]
self.preload_model(opt, model_id=model_id, **kwargs)
def clone_model(self, model_id, opt, timeout=-1):
"""Clone a model `model_id`.
Different options may be passed. If `opt` is None, it will use the
same set of options
"""
if model_id in self.models:
if opt is None:
opt = self.models[model_id].user_opt
opt["models"] = self.models[model_id].opt.models
return self.load_model(opt, timeout)
else:
raise ServerModelError("No such model '%s'" % str(model_id))
def load_model(self, opt, model_id=None, **model_kwargs):
"""Load a model given a set of options
"""
model_id = self.preload_model(opt, model_id=model_id, **model_kwargs)
load_time = self.models[model_id].load_time
return model_id, load_time
def preload_model(self, opt, model_id=None, **model_kwargs):
"""Preloading the model: updating internal datastructure
It will effectively load the model if `load` is set
"""
if model_id is not None:
if model_id in self.models.keys():
raise ValueError("Model ID %d already exists" % model_id)
else:
model_id = self.next_id
while model_id in self.models.keys():
model_id += 1
self.next_id = model_id + 1
print("Pre-loading model %d" % model_id)
model = ServerModel(opt, model_id, **model_kwargs)
self.models[model_id] = model
return model_id
def run(self, inputs):
"""Translate `inputs`
We keep the same format as the Lua version i.e.
``[{"id": model_id, "src": "sequence to translate"},{ ...}]``
We use inputs[0]["id"] as the model id
"""
model_id = inputs[0].get("id", 0)
if model_id in self.models and self.models[model_id] is not None:
return self.models[model_id].run(inputs)
else:
print("Error No such model '%s'" % str(model_id))
raise ServerModelError("No such model '%s'" % str(model_id))
def unload_model(self, model_id):
"""Manually unload a model.
It will free the memory and cancel the timer
"""
if model_id in self.models and self.models[model_id] is not None:
self.models[model_id].unload()
else:
raise ServerModelError("No such model '%s'" % str(model_id))
def list_models(self):
"""Return the list of available models
"""
models = []
for _, model in self.models.items():
models += [model.to_dict()]
return models
class ServerModel(object):
"""Wrap a model with server functionality.
Args:
opt (dict): Options for the Translator
model_id (int): Model ID
preprocess_opt (list): Options for preprocess processus or None
(extend for CJK)
tokenizer_opt (dict): Options for the tokenizer or None
postprocess_opt (list): Options for postprocess processus or None
(extend for CJK)
load (bool): whether to load the model during :func:`__init__()`
timeout (int): Seconds before running :func:`do_timeout()`
Negative values means no timeout
on_timeout (str): Options are ["to_cpu", "unload"]. Set what to do on
timeout (see :func:`do_timeout()`.)
model_root (str): Path to the model directory
it must contain the model and tokenizer file
"""
def __init__(self, opt, model_id, preprocess_opt=None, tokenizer_opt=None,
postprocess_opt=None, load=False, timeout=-1,
on_timeout="to_cpu", model_root="./"):
self.model_root = model_root
self.opt = self.parse_opt(opt)
self.model_id = model_id
self.preprocess_opt = preprocess_opt
self.tokenizer_opt = tokenizer_opt
self.postprocess_opt = postprocess_opt
self.timeout = timeout
self.on_timeout = on_timeout
self.unload_timer = None
self.user_opt = opt
self.tokenizer = None
if len(self.opt.log_file) > 0:
log_file = os.path.join(model_root, self.opt.log_file)
else:
log_file = None
self.logger = init_logger(log_file=log_file,
log_file_level=self.opt.log_file_level)
self.loading_lock = threading.Event()
self.loading_lock.set()
self.running_lock = threading.Semaphore(value=1)
set_random_seed(self.opt.seed, self.opt.cuda)
if load:
self.load()
def parse_opt(self, opt):
"""Parse the option set passed by the user using `onmt.opts`
Args:
opt (dict): Options passed by the user
Returns:
opt (argparse.Namespace): full set of options for the Translator
"""
prec_argv = sys.argv
sys.argv = sys.argv[:1]
parser = ArgumentParser()
onmt.opts.translate_opts(parser)
models = opt['models']
if not isinstance(models, (list, tuple)):
models = [models]
opt['models'] = [os.path.join(self.model_root, model)
for model in models]
opt['src'] = "dummy_src"
for (k, v) in opt.items():
if k == 'models':
sys.argv += ['-model']
sys.argv += [str(model) for model in v]
elif type(v) == bool:
sys.argv += ['-%s' % k]
else:
sys.argv += ['-%s' % k, str(v)]
opt = parser.parse_args()
ArgumentParser.validate_translate_opts(opt)
opt.cuda = opt.gpu > -1
sys.argv = prec_argv
return opt
@property
def loaded(self):
return hasattr(self, 'translator')
def load(self):
self.loading_lock.clear()
timer = Timer()
self.logger.info("Loading model %d" % self.model_id)
timer.start()
try:
self.translator = build_translator(self.opt,
report_score=False,
out_file=codecs.open(
os.devnull, "w", "utf-8"))
except RuntimeError as e:
raise ServerModelError("Runtime Error: %s" % str(e))
timer.tick("model_loading")
if self.preprocess_opt is not None:
self.logger.info("Loading preprocessor")
self.preprocessor = []
for function_path in self.preprocess_opt:
function = get_function_by_path(function_path)
self.preprocessor.append(function)
if self.tokenizer_opt is not None:
self.logger.info("Loading tokenizer")
if "type" not in self.tokenizer_opt:
raise ValueError(
"Missing mandatory tokenizer option 'type'")
if self.tokenizer_opt['type'] == 'sentencepiece':
if "model" not in self.tokenizer_opt:
raise ValueError(
"Missing mandatory tokenizer option 'model'")
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
model_path = os.path.join(self.model_root,
self.tokenizer_opt['model'])
sp.Load(model_path)
self.tokenizer = sp
elif self.tokenizer_opt['type'] == 'pyonmttok':
if "params" not in self.tokenizer_opt:
raise ValueError(
"Missing mandatory tokenizer option 'params'")
import pyonmttok
if self.tokenizer_opt["mode"] is not None:
mode = self.tokenizer_opt["mode"]
else:
mode = None
# load can be called multiple times: modify copy
tokenizer_params = dict(self.tokenizer_opt["params"])
for key, value in self.tokenizer_opt["params"].items():
if key.endswith("path"):
tokenizer_params[key] = os.path.join(
self.model_root, value)
tokenizer = pyonmttok.Tokenizer(mode,
**tokenizer_params)
self.tokenizer = tokenizer
else:
raise ValueError("Invalid value for tokenizer type")
if self.postprocess_opt is not None:
self.logger.info("Loading postprocessor")
self.postprocessor = []
for function_path in self.postprocess_opt:
function = get_function_by_path(function_path)
self.postprocessor.append(function)
self.load_time = timer.tick()
self.reset_unload_timer()
self.loading_lock.set()
@critical
def run(self, inputs):
"""Translate `inputs` using this model
Args:
inputs (List[dict[str, str]]): [{"src": "..."},{"src": ...}]
Returns:
result (list): translations
times (dict): containing times
"""
self.stop_unload_timer()
timer = Timer()
timer.start()
self.logger.info("Running translation using %d" % self.model_id)
if not self.loading_lock.is_set():
self.logger.info(
"Model #%d is being loaded by another thread, waiting"
% self.model_id)
if not self.loading_lock.wait(timeout=30):
raise ServerModelError("Model %d loading timeout"
% self.model_id)
else:
if not self.loaded:
self.load()
timer.tick(name="load")
elif self.opt.cuda:
self.to_gpu()
timer.tick(name="to_gpu")
texts = []
head_spaces = []
tail_spaces = []
sslength = []
for i, inp in enumerate(inputs):
src = inp['src']
if src.strip() == "":
head_spaces.append(src)
texts.append("")
tail_spaces.append("")
else:
whitespaces_before, whitespaces_after = "", ""
match_before = re.search(r'^\s+', src)
match_after = re.search(r'\s+$', src)
if match_before is not None:
whitespaces_before = match_before.group(0)
if match_after is not None:
whitespaces_after = match_after.group(0)
head_spaces.append(whitespaces_before)
preprocessed_src = self.maybe_preprocess(src.strip())
tok = self.maybe_tokenize(preprocessed_src)
texts.append(tok)
sslength.append(len(tok.split()))
tail_spaces.append(whitespaces_after)
empty_indices = [i for i, x in enumerate(texts) if x == ""]
texts_to_translate = [x for x in texts if x != ""]
scores = []
predictions = []
if len(texts_to_translate) > 0:
try:
scores, predictions = self.translator.translate(
texts_to_translate,
batch_size=len(texts_to_translate)
if self.opt.batch_size == 0
else self.opt.batch_size)
except (RuntimeError, Exception) as e:
err = "Error: %s" % str(e)
self.logger.error(err)
self.logger.error("repr(text_to_translate): "
+ repr(texts_to_translate))
self.logger.error("model: #%s" % self.model_id)
self.logger.error("model opt: " + str(self.opt.__dict__))
self.logger.error(traceback.format_exc())
raise ServerModelError(err)
timer.tick(name="translation")
self.logger.info("""Using model #%d\t%d inputs
\ttranslation time: %f""" % (self.model_id, len(texts),
timer.times['translation']))
self.reset_unload_timer()
# NOTE: translator returns lists of `n_best` list
def flatten_list(_list): return sum(_list, [])
tiled_texts = [t for t in texts_to_translate
for _ in range(self.opt.n_best)]
results = flatten_list(predictions)
scores = [score_tensor.item()
for score_tensor in flatten_list(scores)]
results = [self.maybe_detokenize_with_align(result, src)
for result, src in zip(results, tiled_texts)]
aligns = [align for _, align in results]
results = [self.maybe_postprocess(seq) for seq, _ in results]
# build back results with empty texts
for i in empty_indices:
j = i * self.opt.n_best
results = results[:j] + [""] * self.opt.n_best + results[j:]
aligns = aligns[:j] + [None] * self.opt.n_best + aligns[j:]
scores = scores[:j] + [0] * self.opt.n_best + scores[j:]
head_spaces = [h for h in head_spaces for i in range(self.opt.n_best)]
tail_spaces = [h for h in tail_spaces for i in range(self.opt.n_best)]
results = ["".join(items)
for items in zip(head_spaces, results, tail_spaces)]
self.logger.info("Translation Results: %d", len(results))
return results, scores, self.opt.n_best, timer.times, aligns
def do_timeout(self):
"""Timeout function that frees GPU memory.
Moves the model to CPU or unloads it; depending on
attr`self.on_timemout` value
"""
if self.on_timeout == "unload":
self.logger.info("Timeout: unloading model %d" % self.model_id)
self.unload()
if self.on_timeout == "to_cpu":
self.logger.info("Timeout: sending model %d to CPU"
% self.model_id)
self.to_cpu()
@critical
def unload(self):
self.logger.info("Unloading model %d" % self.model_id)
del self.translator
if self.opt.cuda:
torch.cuda.empty_cache()
self.unload_timer = None
def stop_unload_timer(self):
if self.unload_timer is not None:
self.unload_timer.cancel()
def reset_unload_timer(self):
if self.timeout < 0:
return
self.stop_unload_timer()
self.unload_timer = threading.Timer(self.timeout, self.do_timeout)
self.unload_timer.start()
def to_dict(self):
hide_opt = ["models", "src"]
d = {"model_id": self.model_id,
"opt": {k: self.user_opt[k] for k in self.user_opt.keys()
if k not in hide_opt},
"models": self.user_opt["models"],
"loaded": self.loaded,
"timeout": self.timeout,
}
if self.tokenizer_opt is not None:
d["tokenizer"] = self.tokenizer_opt
return d
@critical
def to_cpu(self):
"""Move the model to CPU and clear CUDA cache."""
self.translator.model.cpu()
if self.opt.cuda:
torch.cuda.empty_cache()
def to_gpu(self):
"""Move the model to GPU."""
torch.cuda.set_device(self.opt.gpu)
self.translator.model.cuda()
def maybe_preprocess(self, sequence):
"""Preprocess the sequence (or not)
"""
if self.preprocess_opt is not None:
return self.preprocess(sequence)
return sequence
def preprocess(self, sequence):
"""Preprocess a single sequence.
Args:
sequence (str): The sequence to preprocess.
Returns:
sequence (str): The preprocessed sequence.
"""
if self.preprocessor is None:
raise ValueError("No preprocessor loaded")
for function in self.preprocessor:
sequence = function(sequence)
return sequence
def maybe_tokenize(self, sequence):
"""Tokenize the sequence (or not).
Same args/returns as `tokenize`
"""
if self.tokenizer_opt is not None:
return self.tokenize(sequence)
return sequence
def tokenize(self, sequence):
"""Tokenize a single sequence.
Args:
sequence (str): The sequence to tokenize.
Returns:
tok (str): The tokenized sequence.
"""
if self.tokenizer is None:
raise ValueError("No tokenizer loaded")
if self.tokenizer_opt["type"] == "sentencepiece":
tok = self.tokenizer.EncodeAsPieces(sequence)
tok = " ".join(tok)
elif self.tokenizer_opt["type"] == "pyonmttok":
tok, _ = self.tokenizer.tokenize(sequence)
tok = " ".join(tok)
return tok
@property
def tokenizer_marker(self):
marker = None
tokenizer_type = self.tokenizer_opt.get('type', None)
if tokenizer_type == "pyonmttok":
params = self.tokenizer_opt.get('params', None)
if params is not None:
if params.get("joiner_annotate", None) is not None:
marker = 'joiner'
elif params.get("spacer_annotate", None) is not None:
marker = 'spacer'
elif tokenizer_type == "sentencepiece":
marker = 'spacer'
return marker
def maybe_detokenize_with_align(self, sequence, src):
"""De-tokenize (or not) the sequence (with alignment).
Args:
sequence (str): The sequence to detokenize, possible with
alignment seperate by ` ||| `.
Returns:
sequence (str): The detokenized sequence.
align (str): The alignment correspand to detokenized src/tgt
sorted or None if no alignment in output.
"""
align = None
if self.opt.report_align:
# output contain alignment
sequence, align = sequence.split(' ||| ')
align = self.maybe_convert_align(src, sequence, align)
sequence = self.maybe_detokenize(sequence)
return (sequence, align)
def maybe_detokenize(self, sequence):
"""De-tokenize the sequence (or not)
Same args/returns as :func:`tokenize()`
"""
if self.tokenizer_opt is not None and ''.join(sequence.split()) != '':
return self.detokenize(sequence)
return sequence
def detokenize(self, sequence):
"""Detokenize a single sequence
Same args/returns as :func:`tokenize()`
"""
if self.tokenizer is None:
raise ValueError("No tokenizer loaded")
if self.tokenizer_opt["type"] == "sentencepiece":
detok = self.tokenizer.DecodePieces(sequence.split())
elif self.tokenizer_opt["type"] == "pyonmttok":
detok = self.tokenizer.detokenize(sequence.split())
return detok
def maybe_convert_align(self, src, tgt, align):
"""Convert alignment to match detokenized src/tgt (or not).
Args:
src (str): The tokenized source sequence.
tgt (str): The tokenized target sequence.
align (str): The alignment correspand to src/tgt pair.
Returns:
align (str): The alignment correspand to detokenized src/tgt.
"""
if self.tokenizer_marker is not None and ''.join(tgt.split()) != '':
return to_word_align(src, tgt, align, mode=self.tokenizer_marker)
return align
def maybe_postprocess(self, sequence):
"""Postprocess the sequence (or not)
"""
if self.postprocess_opt is not None:
return self.postprocess(sequence)
return sequence
def postprocess(self, sequence):
"""Preprocess a single sequence.
Args:
sequence (str): The sequence to process.
Returns:
sequence (str): The postprocessed sequence.
"""
if self.postprocessor is None:
raise ValueError("No postprocessor loaded")
for function in self.postprocessor:
sequence = function(sequence)
return sequence
def get_function_by_path(path, args=[], kwargs={}):
module_name = ".".join(path.split(".")[:-1])
function_name = path.split(".")[-1]
try:
module = importlib.import_module(module_name)
except ValueError as e:
print("Cannot import module '%s'" % module_name)
raise e
function = getattr(module, function_name)
return function
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/translate/translator.py
================================================
#!/usr/bin/env python
""" Translator Class and builder """
from __future__ import print_function
import codecs
import os
import math
import time
from itertools import count, zip_longest
import torch
import onmt.model_builder
import onmt.inputters as inputters
import onmt.decoders.ensemble
from onmt.translate.beam_search import BeamSearch
from onmt.translate.greedy_search import GreedySearch
from onmt.utils.misc import tile, set_random_seed, report_matrix
from onmt.utils.alignment import extract_alignment, build_align_pharaoh
from onmt.modules.copy_generator import collapse_copy_scores
def build_translator(opt, report_score=True, logger=None, out_file=None):
if out_file is None:
out_file = codecs.open(opt.output, 'w+', 'utf-8')
load_test_model = onmt.decoders.ensemble.load_test_model \
if len(opt.models) > 1 else onmt.model_builder.load_test_model
fields, model, model_opt = load_test_model(opt)
scorer = onmt.translate.GNMTGlobalScorer.from_opt(opt)
translator = Translator.from_opt(
model,
fields,
opt,
model_opt,
global_scorer=scorer,
out_file=out_file,
report_align=opt.report_align,
report_score=report_score,
logger=logger
)
return translator
def max_tok_len(new, count, sofar):
"""
In token batching scheme, the number of sequences is limited
such that the total number of src/tgt tokens (including padding)
in a batch <= batch_size
"""
# Maintains the longest src and tgt length in the current batch
global max_src_in_batch # this is a hack
# Reset current longest length at a new batch (count=1)
if count == 1:
max_src_in_batch = 0
# max_tgt_in_batch = 0
# Src: [ w1 ... wN ]
max_src_in_batch = max(max_src_in_batch, len(new.src[0]) + 2)
# Tgt: [w1 ... wM ]
src_elements = count * max_src_in_batch
return src_elements
class Translator(object):
"""Translate a batch of sentences with a saved model.
Args:
model (onmt.modules.NMTModel): NMT model to use for translation
fields (dict[str, torchtext.data.Field]): A dict
mapping each side to its list of name-Field pairs.
src_reader (onmt.inputters.DataReaderBase): Source reader.
tgt_reader (onmt.inputters.TextDataReader): Target reader.
gpu (int): GPU device. Set to negative for no GPU.
n_best (int): How many beams to wait for.
min_length (int): See
:class:`onmt.translate.decode_strategy.DecodeStrategy`.
max_length (int): See
:class:`onmt.translate.decode_strategy.DecodeStrategy`.
beam_size (int): Number of beams.
random_sampling_topk (int): See
:class:`onmt.translate.greedy_search.GreedySearch`.
random_sampling_temp (int): See
:class:`onmt.translate.greedy_search.GreedySearch`.
stepwise_penalty (bool): Whether coverage penalty is applied every step
or not.
dump_beam (bool): Debugging option.
block_ngram_repeat (int): See
:class:`onmt.translate.decode_strategy.DecodeStrategy`.
ignore_when_blocking (set or frozenset): See
:class:`onmt.translate.decode_strategy.DecodeStrategy`.
replace_unk (bool): Replace unknown token.
data_type (str): Source data type.
verbose (bool): Print/log every translation.
report_bleu (bool): Print/log Bleu metric.
report_rouge (bool): Print/log Rouge metric.
report_time (bool): Print/log total time/frequency.
copy_attn (bool): Use copy attention.
global_scorer (onmt.translate.GNMTGlobalScorer): Translation
scoring/reranking object.
out_file (TextIO or codecs.StreamReaderWriter): Output file.
report_score (bool) : Whether to report scores
logger (logging.Logger or NoneType): Logger.
"""
def __init__(
self,
model,
fields,
src_reader,
tgt_reader,
gpu=-1,
n_best=1,
min_length=0,
max_length=100,
ratio=0.,
beam_size=30,
random_sampling_topk=1,
random_sampling_temp=1,
stepwise_penalty=None,
dump_beam=False,
block_ngram_repeat=0,
ignore_when_blocking=frozenset(),
replace_unk=False,
phrase_table="",
data_type="text",
verbose=False,
report_bleu=False,
report_rouge=False,
report_time=False,
copy_attn=False,
global_scorer=None,
out_file=None,
report_align=False,
report_score=True,
logger=None,
seed=-1):
self.model = model
self.fields = fields
tgt_field = dict(self.fields)["tgt"].base_field
self._tgt_vocab = tgt_field.vocab
self._tgt_eos_idx = self._tgt_vocab.stoi[tgt_field.eos_token]
self._tgt_pad_idx = self._tgt_vocab.stoi[tgt_field.pad_token]
self._tgt_bos_idx = self._tgt_vocab.stoi[tgt_field.init_token]
self._tgt_unk_idx = self._tgt_vocab.stoi[tgt_field.unk_token]
self._tgt_vocab_len = len(self._tgt_vocab)
self._gpu = gpu
self._use_cuda = gpu > -1
self._dev = torch.device("cuda", self._gpu) \
if self._use_cuda else torch.device("cpu")
self.n_best = n_best
self.max_length = max_length
self.beam_size = beam_size
self.random_sampling_temp = random_sampling_temp
self.sample_from_topk = random_sampling_topk
self.min_length = min_length
self.ratio = ratio
self.stepwise_penalty = stepwise_penalty
self.dump_beam = dump_beam
self.block_ngram_repeat = block_ngram_repeat
self.ignore_when_blocking = ignore_when_blocking
self._exclusion_idxs = {
self._tgt_vocab.stoi[t] for t in self.ignore_when_blocking}
self.src_reader = src_reader
self.tgt_reader = tgt_reader
self.replace_unk = replace_unk
if self.replace_unk and not self.model.decoder.attentional:
raise ValueError(
"replace_unk requires an attentional decoder.")
self.phrase_table = phrase_table
self.data_type = data_type
self.verbose = verbose
self.report_bleu = report_bleu
self.report_rouge = report_rouge
self.report_time = report_time
self.copy_attn = copy_attn
self.global_scorer = global_scorer
if self.global_scorer.has_cov_pen and \
not self.model.decoder.attentional:
raise ValueError(
"Coverage penalty requires an attentional decoder.")
self.out_file = out_file
self.report_align = report_align
self.report_score = report_score
self.logger = logger
self.use_filter_pred = False
self._filter_pred = None
# for debugging
self.beam_trace = self.dump_beam != ""
self.beam_accum = None
if self.beam_trace:
self.beam_accum = {
"predicted_ids": [],
"beam_parent_ids": [],
"scores": [],
"log_probs": []}
set_random_seed(seed, self._use_cuda)
@classmethod
def from_opt(
cls,
model,
fields,
opt,
model_opt,
global_scorer=None,
out_file=None,
report_align=False,
report_score=True,
logger=None):
"""Alternate constructor.
Args:
model (onmt.modules.NMTModel): See :func:`__init__()`.
fields (dict[str, torchtext.data.Field]): See
:func:`__init__()`.
opt (argparse.Namespace): Command line options
model_opt (argparse.Namespace): Command line options saved with
the model checkpoint.
global_scorer (onmt.translate.GNMTGlobalScorer): See
:func:`__init__()`..
out_file (TextIO or codecs.StreamReaderWriter): See
:func:`__init__()`.
report_align (bool) : See :func:`__init__()`.
report_score (bool) : See :func:`__init__()`.
logger (logging.Logger or NoneType): See :func:`__init__()`.
"""
src_reader = inputters.str2reader[opt.data_type].from_opt(opt)
tgt_reader = inputters.str2reader["text"].from_opt(opt)
return cls(
model,
fields,
src_reader,
tgt_reader,
gpu=opt.gpu,
n_best=opt.n_best,
min_length=opt.min_length,
max_length=opt.max_length,
ratio=opt.ratio,
beam_size=opt.beam_size,
random_sampling_topk=opt.random_sampling_topk,
random_sampling_temp=opt.random_sampling_temp,
stepwise_penalty=opt.stepwise_penalty,
dump_beam=opt.dump_beam,
block_ngram_repeat=opt.block_ngram_repeat,
ignore_when_blocking=set(opt.ignore_when_blocking),
replace_unk=opt.replace_unk,
phrase_table=opt.phrase_table,
data_type=opt.data_type,
verbose=opt.verbose,
report_bleu=opt.report_bleu,
report_rouge=opt.report_rouge,
report_time=opt.report_time,
copy_attn=model_opt.copy_attn,
global_scorer=global_scorer,
out_file=out_file,
report_align=report_align,
report_score=report_score,
logger=logger,
seed=opt.seed)
def _log(self, msg):
if self.logger:
self.logger.info(msg)
else:
print(msg)
def _gold_score(self, batch, memory_bank, src_lengths, src_vocabs,
use_src_map, enc_states, batch_size, src):
if "tgt" in batch.__dict__:
gs = self._score_target(
batch, memory_bank, src_lengths, src_vocabs,
batch.src_map if use_src_map else None)
self.model.decoder.init_state(src, memory_bank, enc_states)
else:
gs = [0] * batch_size
return gs
def translate(
self,
src,
tgt=None,
src_dir=None,
batch_size=None,
batch_type="sents",
attn_debug=False,
align_debug=False,
phrase_table=""):
"""Translate content of ``src`` and get gold scores from ``tgt``.
Args:
src: See :func:`self.src_reader.read()`.
tgt: See :func:`self.tgt_reader.read()`.
src_dir: See :func:`self.src_reader.read()` (only relevant
for certain types of data).
batch_size (int): size of examples per mini-batch
attn_debug (bool): enables the attention logging
align_debug (bool): enables the word alignment logging
Returns:
(`list`, `list`)
* all_scores is a list of `batch_size` lists of `n_best` scores
* all_predictions is a list of `batch_size` lists
of `n_best` predictions
"""
if batch_size is None:
raise ValueError("batch_size must be set")
src_data = {"reader": self.src_reader, "data": src, "dir": src_dir}
tgt_data = {"reader": self.tgt_reader, "data": tgt, "dir": None}
_readers, _data, _dir = inputters.Dataset.config(
[('src', src_data), ('tgt', tgt_data)])
data = inputters.Dataset(
self.fields, readers=_readers, data=_data, dirs=_dir,
sort_key=inputters.str2sortkey[self.data_type],
filter_pred=self._filter_pred
)
data_iter = inputters.OrderedIterator(
dataset=data,
device=self._dev,
batch_size=batch_size,
batch_size_fn=max_tok_len if batch_type == "tokens" else None,
train=False,
sort=False,
sort_within_batch=True,
shuffle=False
)
xlation_builder = onmt.translate.TranslationBuilder(
data, self.fields, self.n_best, self.replace_unk, tgt,
self.phrase_table
)
# Statistics
counter = count(1)
pred_score_total, pred_words_total = 0, 0
gold_score_total, gold_words_total = 0, 0
all_scores = []
all_predictions = []
start_time = time.time()
for batch in data_iter:
batch_data = self.translate_batch(
batch, data.src_vocabs, attn_debug
)
translations = xlation_builder.from_batch(batch_data)
for trans in translations:
all_scores += [trans.pred_scores[:self.n_best]]
pred_score_total += trans.pred_scores[0]
pred_words_total += len(trans.pred_sents[0])
if tgt is not None:
gold_score_total += trans.gold_score
gold_words_total += len(trans.gold_sent) + 1
n_best_preds = [" ".join(pred)
for pred in trans.pred_sents[:self.n_best]]
if self.report_align:
align_pharaohs = [build_align_pharaoh(align) for align
in trans.word_aligns[:self.n_best]]
n_best_preds_align = [" ".join(align) for align
in align_pharaohs]
n_best_preds = [pred + " ||| " + align
for pred, align in zip(
n_best_preds, n_best_preds_align)]
all_predictions += [n_best_preds]
self.out_file.write('\n'.join(n_best_preds) + '\n')
self.out_file.flush()
if self.verbose:
sent_number = next(counter)
output = trans.log(sent_number)
if self.logger:
self.logger.info(output)
else:
os.write(1, output.encode('utf-8'))
if attn_debug:
preds = trans.pred_sents[0]
preds.append('')
attns = trans.attns[0].tolist()
if self.data_type == 'text':
srcs = trans.src_raw
else:
srcs = [str(item) for item in range(len(attns[0]))]
output = report_matrix(srcs, preds, attns)
if self.logger:
self.logger.info(output)
else:
os.write(1, output.encode('utf-8'))
if align_debug:
if trans.gold_sent is not None:
tgts = trans.gold_sent
else:
tgts = trans.pred_sents[0]
align = trans.word_aligns[0].tolist()
if self.data_type == 'text':
srcs = trans.src_raw
else:
srcs = [str(item) for item in range(len(align[0]))]
output = report_matrix(srcs, tgts, align)
if self.logger:
self.logger.info(output)
else:
os.write(1, output.encode('utf-8'))
end_time = time.time()
if self.report_score:
msg = self._report_score('PRED', pred_score_total,
pred_words_total)
self._log(msg)
if tgt is not None:
msg = self._report_score('GOLD', gold_score_total,
gold_words_total)
self._log(msg)
if self.report_bleu:
msg = self._report_bleu(tgt)
self._log(msg)
if self.report_rouge:
msg = self._report_rouge(tgt)
self._log(msg)
if self.report_time:
total_time = end_time - start_time
self._log("Total translation time (s): %f" % total_time)
self._log("Average translation time (s): %f" % (
total_time / len(all_predictions)))
self._log("Tokens per second: %f" % (
pred_words_total / total_time))
if self.dump_beam:
import json
json.dump(self.translator.beam_accum,
codecs.open(self.dump_beam, 'w', 'utf-8'))
return all_scores, all_predictions
def _align_pad_prediction(self, predictions, bos, pad):
"""
Padding predictions in batch and add BOS.
Args:
predictions (List[List[Tensor]]): `(batch, n_best,)`, for each src
sequence contain n_best tgt predictions all of which ended with
eos id.
bos (int): bos index to be used.
pad (int): pad index to be used.
Return:
batched_nbest_predict (torch.LongTensor): `(batch, n_best, tgt_l)`
"""
dtype, device = predictions[0][0].dtype, predictions[0][0].device
flatten_tgt = [best.tolist() for bests in predictions
for best in bests]
paded_tgt = torch.tensor(
list(zip_longest(*flatten_tgt, fillvalue=pad)),
dtype=dtype, device=device).T
bos_tensor = torch.full([paded_tgt.size(0), 1], bos,
dtype=dtype, device=device)
full_tgt = torch.cat((bos_tensor, paded_tgt), dim=-1)
batched_nbest_predict = full_tgt.view(
len(predictions), -1, full_tgt.size(-1)) # (batch, n_best, tgt_l)
return batched_nbest_predict
def _align_forward(self, batch, predictions):
"""
For a batch of input and its prediction, return a list of batch predict
alignment src indice Tensor in size ``(batch, n_best,)``.
"""
# (0) add BOS and padding to tgt prediction
if hasattr(batch, 'tgt'):
batch_tgt_idxs = batch.tgt.transpose(1, 2).transpose(0, 2)
else:
batch_tgt_idxs = self._align_pad_prediction(
predictions, bos=self._tgt_bos_idx, pad=self._tgt_pad_idx)
tgt_mask = (batch_tgt_idxs.eq(self._tgt_pad_idx) |
batch_tgt_idxs.eq(self._tgt_eos_idx) |
batch_tgt_idxs.eq(self._tgt_bos_idx))
n_best = batch_tgt_idxs.size(1)
# (1) Encoder forward.
src, enc_states, memory_bank, src_lengths = self._run_encoder(batch)
# (2) Repeat src objects `n_best` times.
# We use batch_size x n_best, get ``(src_len, batch * n_best, nfeat)``
src = tile(src, n_best, dim=1)
enc_states = tile(enc_states, n_best, dim=1)
if isinstance(memory_bank, tuple):
memory_bank = tuple(tile(x, n_best, dim=1) for x in memory_bank)
else:
memory_bank = tile(memory_bank, n_best, dim=1)
src_lengths = tile(src_lengths, n_best) # ``(batch * n_best,)``
# (3) Init decoder with n_best src,
self.model.decoder.init_state(src, memory_bank, enc_states)
# reshape tgt to ``(len, batch * n_best, nfeat)``
tgt = batch_tgt_idxs.view(-1, batch_tgt_idxs.size(-1)).T.unsqueeze(-1)
dec_in = tgt[:-1] # exclude last target from inputs
_, attns = self.model.decoder(
dec_in, memory_bank, memory_lengths=src_lengths, with_align=True)
alignment_attn = attns["align"] # ``(B, tgt_len-1, src_len)``
# masked_select
align_tgt_mask = tgt_mask.view(-1, tgt_mask.size(-1))
prediction_mask = align_tgt_mask[:, 1:] # exclude bos to match pred
# get aligned src id for each prediction's valid tgt tokens
alignement = extract_alignment(
alignment_attn, prediction_mask, src_lengths, n_best)
return alignement
def translate_batch(self, batch, src_vocabs, attn_debug):
"""Translate a batch of sentences."""
with torch.no_grad():
if self.beam_size == 1:
decode_strategy = GreedySearch(
pad=self._tgt_pad_idx,
bos=self._tgt_bos_idx,
eos=self._tgt_eos_idx,
batch_size=batch.batch_size,
min_length=self.min_length, max_length=self.max_length,
block_ngram_repeat=self.block_ngram_repeat,
exclusion_tokens=self._exclusion_idxs,
return_attention=attn_debug or self.replace_unk,
sampling_temp=self.random_sampling_temp,
keep_topk=self.sample_from_topk)
else:
# TODO: support these blacklisted features
assert not self.dump_beam
decode_strategy = BeamSearch(
self.beam_size,
batch_size=batch.batch_size,
pad=self._tgt_pad_idx,
bos=self._tgt_bos_idx,
eos=self._tgt_eos_idx,
n_best=self.n_best,
global_scorer=self.global_scorer,
min_length=self.min_length, max_length=self.max_length,
return_attention=attn_debug or self.replace_unk,
block_ngram_repeat=self.block_ngram_repeat,
exclusion_tokens=self._exclusion_idxs,
stepwise_penalty=self.stepwise_penalty,
ratio=self.ratio)
return self._translate_batch_with_strategy(batch, src_vocabs,
decode_strategy)
def _run_encoder(self, batch):
src, src_lengths = batch.src if isinstance(batch.src, tuple) \
else (batch.src, None)
enc_states, memory_bank, src_lengths = self.model.encoder(
src, src_lengths)
if src_lengths is None:
assert not isinstance(memory_bank, tuple), \
'Ensemble decoding only supported for text data'
src_lengths = torch.Tensor(batch.batch_size) \
.type_as(memory_bank) \
.long() \
.fill_(memory_bank.size(0))
return src, enc_states, memory_bank, src_lengths
def _decode_and_generate(
self,
decoder_in,
memory_bank,
batch,
src_vocabs,
memory_lengths,
src_map=None,
step=None,
batch_offset=None):
if self.copy_attn:
# Turn any copied words into UNKs.
decoder_in = decoder_in.masked_fill(
decoder_in.gt(self._tgt_vocab_len - 1), self._tgt_unk_idx
)
# Decoder forward, takes [tgt_len, batch, nfeats] as input
# and [src_len, batch, hidden] as memory_bank
# in case of inference tgt_len = 1, batch = beam times batch_size
# in case of Gold Scoring tgt_len = actual length, batch = 1 batch
dec_out, dec_attn = self.model.decoder(
decoder_in, memory_bank, memory_lengths=memory_lengths, step=step
)
# Generator forward.
if not self.copy_attn:
if "std" in dec_attn:
attn = dec_attn["std"]
else:
attn = None
log_probs = self.model.generator(dec_out.squeeze(0))
# returns [(batch_size x beam_size) , vocab ] when 1 step
# or [ tgt_len, batch_size, vocab ] when full sentence
else:
attn = dec_attn["copy"]
scores = self.model.generator(dec_out.view(-1, dec_out.size(2)),
attn.view(-1, attn.size(2)),
src_map)
# here we have scores [tgt_lenxbatch, vocab] or [beamxbatch, vocab]
if batch_offset is None:
scores = scores.view(-1, batch.batch_size, scores.size(-1))
scores = scores.transpose(0, 1).contiguous()
else:
scores = scores.view(-1, self.beam_size, scores.size(-1))
scores = collapse_copy_scores(
scores,
batch,
self._tgt_vocab,
src_vocabs,
batch_dim=0,
batch_offset=batch_offset
)
scores = scores.view(decoder_in.size(0), -1, scores.size(-1))
log_probs = scores.squeeze(0).log()
# returns [(batch_size x beam_size) , vocab ] when 1 step
# or [ tgt_len, batch_size, vocab ] when full sentence
return log_probs, attn
def _translate_batch_with_strategy(
self,
batch,
src_vocabs,
decode_strategy):
"""Translate a batch of sentences step by step using cache.
Args:
batch: a batch of sentences, yield by data iterator.
src_vocabs (list): list of torchtext.data.Vocab if can_copy.
decode_strategy (DecodeStrategy): A decode strategy to use for
generate translation step by step.
Returns:
results (dict): The translation results.
"""
# (0) Prep the components of the search.
use_src_map = self.copy_attn
parallel_paths = decode_strategy.parallel_paths # beam_size
batch_size = batch.batch_size
# (1) Run the encoder on the src.
src, enc_states, memory_bank, src_lengths = self._run_encoder(batch)
self.model.decoder.init_state(src, memory_bank, enc_states)
results = {
"predictions": None,
"scores": None,
"attention": None,
"batch": batch,
"gold_score": self._gold_score(
batch, memory_bank, src_lengths, src_vocabs, use_src_map,
enc_states, batch_size, src)}
# (2) prep decode_strategy. Possibly repeat src objects.
src_map = batch.src_map if use_src_map else None
fn_map_state, memory_bank, memory_lengths, src_map = \
decode_strategy.initialize(memory_bank, src_lengths, src_map)
if fn_map_state is not None:
self.model.decoder.map_state(fn_map_state)
# (3) Begin decoding step by step:
for step in range(decode_strategy.max_length):
decoder_input = decode_strategy.current_predictions.view(1, -1, 1)
log_probs, attn = self._decode_and_generate(
decoder_input,
memory_bank,
batch,
src_vocabs,
memory_lengths=memory_lengths,
src_map=src_map,
step=step,
batch_offset=decode_strategy.batch_offset)
decode_strategy.advance(log_probs, attn)
any_finished = decode_strategy.is_finished.any()
if any_finished:
decode_strategy.update_finished()
if decode_strategy.done:
break
select_indices = decode_strategy.select_indices
if any_finished:
# Reorder states.
if isinstance(memory_bank, tuple):
memory_bank = tuple(x.index_select(1, select_indices)
for x in memory_bank)
else:
memory_bank = memory_bank.index_select(1, select_indices)
memory_lengths = memory_lengths.index_select(0, select_indices)
if src_map is not None:
src_map = src_map.index_select(1, select_indices)
if parallel_paths > 1 or any_finished:
self.model.decoder.map_state(
lambda state, dim: state.index_select(dim, select_indices))
results["scores"] = decode_strategy.scores
results["predictions"] = decode_strategy.predictions
results["attention"] = decode_strategy.attention
if self.report_align:
results["alignment"] = self._align_forward(
batch, decode_strategy.predictions)
else:
results["alignment"] = [[] for _ in range(batch_size)]
return results
def _score_target(self, batch, memory_bank, src_lengths,
src_vocabs, src_map):
tgt = batch.tgt
tgt_in = tgt[:-1]
log_probs, attn = self._decode_and_generate(
tgt_in, memory_bank, batch, src_vocabs,
memory_lengths=src_lengths, src_map=src_map)
log_probs[:, :, self._tgt_pad_idx] = 0
gold = tgt[1:]
gold_scores = log_probs.gather(2, gold)
gold_scores = gold_scores.sum(dim=0).view(-1)
return gold_scores
def _report_score(self, name, score_total, words_total):
if words_total == 0:
msg = "%s No words predicted" % (name,)
else:
msg = ("%s AVG SCORE: %.4f, %s PPL: %.4f" % (
name, score_total / words_total,
name, math.exp(-score_total / words_total)))
return msg
def _report_bleu(self, tgt_path):
import subprocess
base_dir = os.path.abspath(__file__ + "/../../..")
# Rollback pointer to the beginning.
self.out_file.seek(0)
print()
res = subprocess.check_output(
"perl %s/tools/multi-bleu.perl %s" % (base_dir, tgt_path),
stdin=self.out_file, shell=True
).decode("utf-8")
msg = ">> " + res.strip()
return msg
def _report_rouge(self, tgt_path):
import subprocess
path = os.path.split(os.path.realpath(__file__))[0]
msg = subprocess.check_output(
"python %s/tools/test_rouge.py -r %s -c STDIN" % (path, tgt_path),
shell=True, stdin=self.out_file
).decode("utf-8").strip()
return msg
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/utils/__init__.py
================================================
"""Module defining various utilities."""
from onmt.utils.misc import split_corpus, aeq, use_gpu, set_random_seed
from onmt.utils.alignment import make_batch_align_matrix
from onmt.utils.report_manager import ReportMgr, build_report_manager
from onmt.utils.statistics import Statistics
from onmt.utils.optimizers import MultipleOptimizer, \
Optimizer, AdaFactor
from onmt.utils.earlystopping import EarlyStopping, scorers_from_opts
__all__ = ["split_corpus", "aeq", "use_gpu", "set_random_seed", "ReportMgr",
"build_report_manager", "Statistics",
"MultipleOptimizer", "Optimizer", "AdaFactor", "EarlyStopping",
"scorers_from_opts", "make_batch_align_matrix"]
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/utils/alignment.py
================================================
# -*- coding: utf-8 -*-
import torch
from itertools import accumulate
def make_batch_align_matrix(index_tensor, size=None, normalize=False):
"""
Convert a sparse index_tensor into a batch of alignment matrix,
with row normalize to the sum of 1 if set normalize.
Args:
index_tensor (LongTensor): ``(N, 3)`` of [batch_id, tgt_id, src_id]
size (List[int]): Size of the sparse tensor.
normalize (bool): if normalize the 2nd dim of resulting tensor.
"""
n_fill, device = index_tensor.size(0), index_tensor.device
value_tensor = torch.ones([n_fill], dtype=torch.float)
dense_tensor = torch.sparse_coo_tensor(
index_tensor.t(), value_tensor, size=size, device=device).to_dense()
if normalize:
row_sum = dense_tensor.sum(-1, keepdim=True) # sum by row(tgt)
# threshold on 1 to avoid div by 0
torch.nn.functional.threshold(row_sum, 1, 1, inplace=True)
dense_tensor.div_(row_sum)
return dense_tensor
def extract_alignment(align_matrix, tgt_mask, src_lens, n_best):
"""
Extract a batched align_matrix into its src indice alignment lists,
with tgt_mask to filter out invalid tgt position as EOS/PAD.
BOS already excluded from tgt_mask in order to match prediction.
Args:
align_matrix (Tensor): ``(B, tgt_len, src_len)``,
attention head normalized by Softmax(dim=-1)
tgt_mask (BoolTensor): ``(B, tgt_len)``, True for EOS, PAD.
src_lens (LongTensor): ``(B,)``, containing valid src length
n_best (int): a value indicating number of parallel translation.
* B: denote flattened batch as B = batch_size * n_best.
Returns:
alignments (List[List[FloatTensor]]): ``(batch_size, n_best,)``,
containing valid alignment matrix for each translation.
"""
batch_size_n_best = align_matrix.size(0)
assert batch_size_n_best % n_best == 0
alignments = [[] for _ in range(batch_size_n_best // n_best)]
# treat alignment matrix one by one as each have different lengths
for i, (am_b, tgt_mask_b, src_len) in enumerate(
zip(align_matrix, tgt_mask, src_lens)):
valid_tgt = ~tgt_mask_b
valid_tgt_len = valid_tgt.sum()
# get valid alignment (sub-matrix from full paded aligment matrix)
am_valid_tgt = am_b.masked_select(valid_tgt.unsqueeze(-1)) \
.view(valid_tgt_len, -1)
valid_alignment = am_valid_tgt[:, :src_len] # only keep valid src
alignments[i // n_best].append(valid_alignment)
return alignments
def build_align_pharaoh(valid_alignment):
"""Convert valid alignment matrix to i-j Pharaoh format.(0 indexed)"""
align_pairs = []
tgt_align_src_id = valid_alignment.argmax(dim=-1)
for tgt_id, src_id in enumerate(tgt_align_src_id.tolist()):
align_pairs.append(str(src_id) + "-" + str(tgt_id))
align_pairs.sort(key=lambda x: int(x.split('-')[-1])) # sort by tgt_id
align_pairs.sort(key=lambda x: int(x.split('-')[0])) # sort by src_id
return align_pairs
def to_word_align(src, tgt, subword_align, mode):
"""Convert subword alignment to word alignment.
Args:
src (string): tokenized sentence in source language.
tgt (string): tokenized sentence in target language.
subword_align (string): align_pharaoh correspond to src-tgt.
mode (string): tokenization mode used by src and tgt,
choose from ["joiner", "spacer"].
Returns:
word_align (string): converted alignments correspand to
detokenized src-tgt.
"""
src, tgt = src.strip().split(), tgt.strip().split()
subword_align = {(int(a), int(b)) for a, b in (x.split("-")
for x in subword_align.split())}
if mode == 'joiner':
src_map = subword_map_by_joiner(src, marker='■')
tgt_map = subword_map_by_joiner(tgt, marker='■')
elif mode == 'spacer':
src_map = subword_map_by_spacer(src, marker='▁')
tgt_map = subword_map_by_spacer(tgt, marker='▁')
else:
raise ValueError("Invalid value for argument mode!")
word_align = list({"{}-{}".format(src_map[a], tgt_map[b])
for a, b in subword_align})
word_align.sort(key=lambda x: int(x.split('-')[-1])) # sort by tgt_id
word_align.sort(key=lambda x: int(x.split('-')[0])) # sort by src_id
return " ".join(word_align)
def subword_map_by_joiner(subwords, marker='■'):
"""Return word id for each subword token (annotate by joiner)."""
flags = [0] * len(subwords)
for i, tok in enumerate(subwords):
if tok.endswith(marker):
flags[i] = 1
if tok.startswith(marker):
assert i >= 1 and flags[i-1] != 1, \
"Sentence `{}` not correct!".format(" ".join(subwords))
flags[i-1] = 1
marker_acc = list(accumulate([0] + flags[:-1]))
word_group = [(i - maker_sofar) for i, maker_sofar
in enumerate(marker_acc)]
return word_group
def subword_map_by_spacer(subwords, marker='▁'):
"""Return word id for each subword token (annotate by spacer)."""
word_group = list(accumulate([int(marker in x) for x in subwords]))
if word_group[0] == 1: # when dummy prefix is set
word_group = [item - 1 for item in word_group]
return word_group
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/utils/cnn_factory.py
================================================
"""
Implementation of "Convolutional Sequence to Sequence Learning"
"""
import torch
import torch.nn as nn
import torch.nn.init as init
import onmt.modules
SCALE_WEIGHT = 0.5 ** 0.5
def shape_transform(x):
""" Tranform the size of the tensors to fit for conv input. """
return torch.unsqueeze(torch.transpose(x, 1, 2), 3)
class GatedConv(nn.Module):
""" Gated convolution for CNN class """
def __init__(self, input_size, width=3, dropout=0.2, nopad=False):
super(GatedConv, self).__init__()
self.conv = onmt.modules.WeightNormConv2d(
input_size, 2 * input_size, kernel_size=(width, 1), stride=(1, 1),
padding=(width // 2 * (1 - nopad), 0))
init.xavier_uniform_(self.conv.weight, gain=(4 * (1 - dropout))**0.5)
self.dropout = nn.Dropout(dropout)
def forward(self, x_var):
x_var = self.dropout(x_var)
x_var = self.conv(x_var)
out, gate = x_var.split(int(x_var.size(1) / 2), 1)
out = out * torch.sigmoid(gate)
return out
class StackedCNN(nn.Module):
""" Stacked CNN class """
def __init__(self, num_layers, input_size, cnn_kernel_width=3,
dropout=0.2):
super(StackedCNN, self).__init__()
self.dropout = dropout
self.num_layers = num_layers
self.layers = nn.ModuleList()
for _ in range(num_layers):
self.layers.append(
GatedConv(input_size, cnn_kernel_width, dropout))
def forward(self, x):
for conv in self.layers:
x = x + conv(x)
x *= SCALE_WEIGHT
return x
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/utils/distributed.py
================================================
""" Pytorch Distributed utils
This piece of code was heavily inspired by the equivalent of Fairseq-py
https://github.com/pytorch/fairseq
"""
from __future__ import print_function
import math
import pickle
import torch.distributed
from onmt.utils.logging import logger
def is_master(opt, device_id):
return opt.gpu_ranks[device_id] == 0
def multi_init(opt, device_id):
dist_init_method = 'tcp://{master_ip}:{master_port}'.format(
master_ip=opt.master_ip,
master_port=opt.master_port)
dist_world_size = opt.world_size
torch.distributed.init_process_group(
backend=opt.gpu_backend, init_method=dist_init_method,
world_size=dist_world_size, rank=opt.gpu_ranks[device_id])
gpu_rank = torch.distributed.get_rank()
if not is_master(opt, device_id):
logger.disabled = True
return gpu_rank
def all_reduce_and_rescale_tensors(tensors, rescale_denom,
buffer_size=10485760):
"""All-reduce and rescale tensors in chunks of the specified size.
Args:
tensors: list of Tensors to all-reduce
rescale_denom: denominator for rescaling summed Tensors
buffer_size: all-reduce chunk size in bytes
"""
# buffer size in bytes, determine equiv. # of elements based on data type
buffer_t = tensors[0].new(
math.ceil(buffer_size / tensors[0].element_size())).zero_()
buffer = []
def all_reduce_buffer():
# copy tensors into buffer_t
offset = 0
for t in buffer:
numel = t.numel()
buffer_t[offset:offset+numel].copy_(t.view(-1))
offset += numel
# all-reduce and rescale
torch.distributed.all_reduce(buffer_t[:offset])
buffer_t.div_(rescale_denom)
# copy all-reduced buffer back into tensors
offset = 0
for t in buffer:
numel = t.numel()
t.view(-1).copy_(buffer_t[offset:offset+numel])
offset += numel
filled = 0
for t in tensors:
sz = t.numel() * t.element_size()
if sz > buffer_size:
# tensor is bigger than buffer, all-reduce and rescale directly
torch.distributed.all_reduce(t)
t.div_(rescale_denom)
elif filled + sz > buffer_size:
# buffer is full, all-reduce and replace buffer with grad
all_reduce_buffer()
buffer = [t]
filled = sz
else:
# add tensor to buffer
buffer.append(t)
filled += sz
if len(buffer) > 0:
all_reduce_buffer()
def all_gather_list(data, max_size=4096):
"""Gathers arbitrary data from all nodes into a list."""
world_size = torch.distributed.get_world_size()
if not hasattr(all_gather_list, '_in_buffer') or \
max_size != all_gather_list._in_buffer.size():
all_gather_list._in_buffer = torch.cuda.ByteTensor(max_size)
all_gather_list._out_buffers = [
torch.cuda.ByteTensor(max_size)
for i in range(world_size)
]
in_buffer = all_gather_list._in_buffer
out_buffers = all_gather_list._out_buffers
enc = pickle.dumps(data)
enc_size = len(enc)
if enc_size + 2 > max_size:
raise ValueError(
'encoded data exceeds max_size: {}'.format(enc_size + 2))
assert max_size < 255*256
in_buffer[0] = enc_size // 255 # this encoding works for max_size < 65k
in_buffer[1] = enc_size % 255
in_buffer[2:enc_size+2] = torch.ByteTensor(list(enc))
torch.distributed.all_gather(out_buffers, in_buffer.cuda())
results = []
for i in range(world_size):
out_buffer = out_buffers[i]
size = (255 * out_buffer[0].item()) + out_buffer[1].item()
bytes_list = bytes(out_buffer[2:size+2].tolist())
result = pickle.loads(bytes_list)
results.append(result)
return results
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/utils/earlystopping.py
================================================
from enum import Enum
from onmt.utils.logging import logger
class PatienceEnum(Enum):
IMPROVING = 0
DECREASING = 1
STOPPED = 2
class Scorer(object):
def __init__(self, best_score, name):
self.best_score = best_score
self.name = name
def is_improving(self, stats):
raise NotImplementedError()
def is_decreasing(self, stats):
raise NotImplementedError()
def update(self, stats):
self.best_score = self._caller(stats)
def __call__(self, stats, **kwargs):
return self._caller(stats)
def _caller(self, stats):
raise NotImplementedError()
class PPLScorer(Scorer):
def __init__(self):
super(PPLScorer, self).__init__(float("inf"), "ppl")
def is_improving(self, stats):
return stats.ppl() < self.best_score
def is_decreasing(self, stats):
return stats.ppl() > self.best_score
def _caller(self, stats):
return stats.ppl()
class AccuracyScorer(Scorer):
def __init__(self):
super(AccuracyScorer, self).__init__(float("-inf"), "acc")
def is_improving(self, stats):
return stats.accuracy() > self.best_score
def is_decreasing(self, stats):
return stats.accuracy() < self.best_score
def _caller(self, stats):
return stats.accuracy()
DEFAULT_SCORERS = [PPLScorer(), AccuracyScorer()]
SCORER_BUILDER = {
"ppl": PPLScorer,
"accuracy": AccuracyScorer
}
def scorers_from_opts(opt):
if opt.early_stopping_criteria is None:
return DEFAULT_SCORERS
else:
scorers = []
for criterion in set(opt.early_stopping_criteria):
assert criterion in SCORER_BUILDER.keys(), \
"Criterion {} not found".format(criterion)
scorers.append(SCORER_BUILDER[criterion]())
return scorers
class EarlyStopping(object):
def __init__(self, tolerance, scorers=DEFAULT_SCORERS):
"""
Callable class to keep track of early stopping.
Args:
tolerance(int): number of validation steps without improving
scorer(fn): list of scorers to validate performance on dev
"""
self.tolerance = tolerance
self.stalled_tolerance = self.tolerance
self.current_tolerance = self.tolerance
self.early_stopping_scorers = scorers
self.status = PatienceEnum.IMPROVING
self.current_step_best = 0
def __call__(self, valid_stats, step):
"""
Update the internal state of early stopping mechanism, whether to
continue training or stop the train procedure.
Checks whether the scores from all pre-chosen scorers improved. If
every metric improve, then the status is switched to improving and the
tolerance is reset. If every metric deteriorate, then the status is
switched to decreasing and the tolerance is also decreased; if the
tolerance reaches 0, then the status is changed to stopped.
Finally, if some improved and others not, then it's considered stalled;
after tolerance number of stalled, the status is switched to stopped.
:param valid_stats: Statistics of dev set
"""
if self.status == PatienceEnum.STOPPED:
# Don't do anything
return
if all([scorer.is_improving(valid_stats) for scorer
in self.early_stopping_scorers]):
self._update_increasing(valid_stats, step)
elif all([scorer.is_decreasing(valid_stats) for scorer
in self.early_stopping_scorers]):
self._update_decreasing()
else:
self._update_stalled()
def _update_stalled(self):
self.stalled_tolerance -= 1
logger.info(
"Stalled patience: {}/{}".format(self.stalled_tolerance,
self.tolerance))
if self.stalled_tolerance == 0:
logger.info(
"Training finished after stalled validations. Early Stop!"
)
self._log_best_step()
self._decreasing_or_stopped_status_update(self.stalled_tolerance)
def _update_increasing(self, valid_stats, step):
self.current_step_best = step
for scorer in self.early_stopping_scorers:
logger.info(
"Model is improving {}: {:g} --> {:g}.".format(
scorer.name, scorer.best_score, scorer(valid_stats))
)
# Update best score of each criteria
scorer.update(valid_stats)
# Reset tolerance
self.current_tolerance = self.tolerance
self.stalled_tolerance = self.tolerance
# Update current status
self.status = PatienceEnum.IMPROVING
def _update_decreasing(self):
# Decrease tolerance
self.current_tolerance -= 1
# Log
logger.info(
"Decreasing patience: {}/{}".format(self.current_tolerance,
self.tolerance)
)
# Log
if self.current_tolerance == 0:
logger.info("Training finished after not improving. Early Stop!")
self._log_best_step()
self._decreasing_or_stopped_status_update(self.current_tolerance)
def _log_best_step(self):
logger.info("Best model found at step {}".format(
self.current_step_best))
def _decreasing_or_stopped_status_update(self, tolerance):
self.status = PatienceEnum.DECREASING \
if tolerance > 0 \
else PatienceEnum.STOPPED
def is_improving(self):
return self.status == PatienceEnum.IMPROVING
def has_stopped(self):
return self.status == PatienceEnum.STOPPED
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/utils/logging.py
================================================
# -*- coding: utf-8 -*-
from __future__ import absolute_import
import logging
from logging.handlers import RotatingFileHandler
logger = logging.getLogger()
def init_logger(log_file=None, log_file_level=logging.NOTSET):
log_format = logging.Formatter("[%(asctime)s %(levelname)s] %(message)s")
logger = logging.getLogger()
logger.setLevel(logging.INFO)
console_handler = logging.StreamHandler()
console_handler.setFormatter(log_format)
logger.handlers = [console_handler]
if log_file and log_file != '':
file_handler = RotatingFileHandler(
log_file, maxBytes=1000, backupCount=10)
file_handler.setLevel(log_file_level)
file_handler.setFormatter(log_format)
logger.addHandler(file_handler)
return logger
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/utils/loss.py
================================================
"""
This includes: LossComputeBase and the standard NMTLossCompute, and
sharded loss compute stuff.
"""
from __future__ import division
import torch
import torch.nn as nn
import torch.nn.functional as F
import onmt
from onmt.modules.sparse_losses import SparsemaxLoss
from onmt.modules.sparse_activations import LogSparsemax
def build_loss_compute(model, tgt_field, opt, train=True):
"""
Returns a LossCompute subclass which wraps around an nn.Module subclass
(such as nn.NLLLoss) which defines the loss criterion. The LossCompute
object allows this loss to be computed in shards and passes the relevant
data to a Statistics object which handles training/validation logging.
Currently, the NMTLossCompute class handles all loss computation except
for when using a copy mechanism.
"""
device = torch.device("cuda" if onmt.utils.misc.use_gpu(opt) else "cpu")
padding_idx = tgt_field.vocab.stoi[tgt_field.pad_token]
unk_idx = tgt_field.vocab.stoi[tgt_field.unk_token]
if opt.lambda_coverage != 0:
assert opt.coverage_attn, "--coverage_attn needs to be set in " \
"order to use --lambda_coverage != 0"
if opt.copy_attn:
criterion = onmt.modules.CopyGeneratorLoss(
len(tgt_field.vocab), opt.copy_attn_force,
unk_index=unk_idx, ignore_index=padding_idx
)
elif opt.label_smoothing > 0 and train:
criterion = LabelSmoothingLoss(
opt.label_smoothing, len(tgt_field.vocab), ignore_index=padding_idx
)
elif isinstance(model.generator[-1], LogSparsemax):
criterion = SparsemaxLoss(ignore_index=padding_idx, reduction='sum')
else:
criterion = nn.NLLLoss(ignore_index=padding_idx, reduction='sum')
# if the loss function operates on vectors of raw logits instead of
# probabilities, only the first part of the generator needs to be
# passed to the NMTLossCompute. At the moment, the only supported
# loss function of this kind is the sparsemax loss.
use_raw_logits = isinstance(criterion, SparsemaxLoss)
loss_gen = model.generator[0] if use_raw_logits else model.generator
if opt.copy_attn:
compute = onmt.modules.CopyGeneratorLossCompute(
criterion, loss_gen, tgt_field.vocab, opt.copy_loss_by_seqlength,
lambda_coverage=opt.lambda_coverage
)
else:
compute = NMTLossCompute(
criterion, loss_gen, lambda_coverage=opt.lambda_coverage,
lambda_align=opt.lambda_align)
compute.to(device)
return compute
class LossComputeBase(nn.Module):
"""
Class for managing efficient loss computation. Handles
sharding next step predictions and accumulating multiple
loss computations
Users can implement their own loss computation strategy by making
subclass of this one. Users need to implement the _compute_loss()
and make_shard_state() methods.
Args:
generator (:obj:`nn.Module`) :
module that maps the output of the decoder to a
distribution over the target vocabulary.
tgt_vocab (:obj:`Vocab`) :
torchtext vocab object representing the target output
normalzation (str): normalize by "sents" or "tokens"
"""
def __init__(self, criterion, generator):
super(LossComputeBase, self).__init__()
self.criterion = criterion
self.generator = generator
@property
def padding_idx(self):
return self.criterion.ignore_index
def _make_shard_state(self, batch, output, range_, attns=None):
"""
Make shard state dictionary for shards() to return iterable
shards for efficient loss computation. Subclass must define
this method to match its own _compute_loss() interface.
Args:
batch: the current batch.
output: the predict output from the model.
range_: the range of examples for computing, the whole
batch or a trunc of it?
attns: the attns dictionary returned from the model.
"""
return NotImplementedError
def _compute_loss(self, batch, output, target, **kwargs):
"""
Compute the loss. Subclass must define this method.
Args:
batch: the current batch.
output: the predict output from the model.
target: the validate target to compare output with.
**kwargs(optional): additional info for computing loss.
"""
return NotImplementedError
def __call__(self,
batch,
output,
attns,
normalization=1.0,
shard_size=0,
trunc_start=0,
trunc_size=None):
"""Compute the forward loss, possibly in shards in which case this
method also runs the backward pass and returns ``None`` as the loss
value.
Also supports truncated BPTT for long sequences by taking a
range in the decoder output sequence to back propagate in.
Range is from `(trunc_start, trunc_start + trunc_size)`.
Note sharding is an exact efficiency trick to relieve memory
required for the generation buffers. Truncation is an
approximate efficiency trick to relieve the memory required
in the RNN buffers.
Args:
batch (batch) : batch of labeled examples
output (:obj:`FloatTensor`) :
output of decoder model `[tgt_len x batch x hidden]`
attns (dict) : dictionary of attention distributions
`[tgt_len x batch x src_len]`
normalization: Optional normalization factor.
shard_size (int) : maximum number of examples in a shard
trunc_start (int) : starting position of truncation window
trunc_size (int) : length of truncation window
Returns:
A tuple with the loss and a :obj:`onmt.utils.Statistics` instance.
"""
if trunc_size is None:
trunc_size = batch.tgt.size(0) - trunc_start
trunc_range = (trunc_start, trunc_start + trunc_size)
shard_state = self._make_shard_state(batch, output, trunc_range, attns)
if shard_size == 0:
loss, stats = self._compute_loss(batch, **shard_state)
return loss / float(normalization), stats
batch_stats = onmt.utils.Statistics()
for shard in shards(shard_state, shard_size):
loss, stats = self._compute_loss(batch, **shard)
loss.div(float(normalization)).backward()
batch_stats.update(stats)
return None, batch_stats
def _stats(self, loss, scores, target):
"""
Args:
loss (:obj:`FloatTensor`): the loss computed by the loss criterion.
scores (:obj:`FloatTensor`): a score for each possible output
target (:obj:`FloatTensor`): true targets
Returns:
:obj:`onmt.utils.Statistics` : statistics for this batch.
"""
pred = scores.max(1)[1]
non_padding = target.ne(self.padding_idx)
num_correct = pred.eq(target).masked_select(non_padding).sum().item()
num_non_padding = non_padding.sum().item()
return onmt.utils.Statistics(loss.item(), num_non_padding, num_correct)
def _bottle(self, _v):
return _v.view(-1, _v.size(2))
def _unbottle(self, _v, batch_size):
return _v.view(-1, batch_size, _v.size(1))
class LabelSmoothingLoss(nn.Module):
"""
With label smoothing,
KL-divergence between q_{smoothed ground truth prob.}(w)
and p_{prob. computed by model}(w) is minimized.
"""
def __init__(self, label_smoothing, tgt_vocab_size, ignore_index=-100):
assert 0.0 < label_smoothing <= 1.0
self.ignore_index = ignore_index
super(LabelSmoothingLoss, self).__init__()
smoothing_value = label_smoothing / (tgt_vocab_size - 2)
one_hot = torch.full((tgt_vocab_size,), smoothing_value)
one_hot[self.ignore_index] = 0
self.register_buffer('one_hot', one_hot.unsqueeze(0))
self.confidence = 1.0 - label_smoothing
def forward(self, output, target):
"""
output (FloatTensor): batch_size x n_classes
target (LongTensor): batch_size
"""
model_prob = self.one_hot.repeat(target.size(0), 1)
model_prob.scatter_(1, target.unsqueeze(1), self.confidence)
model_prob.masked_fill_((target == self.ignore_index).unsqueeze(1), 0)
return F.kl_div(output, model_prob, reduction='sum')
class NMTLossCompute(LossComputeBase):
"""
Standard NMT Loss Computation.
"""
def __init__(self, criterion, generator, normalization="sents",
lambda_coverage=0.0, lambda_align=0.0):
super(NMTLossCompute, self).__init__(criterion, generator)
self.lambda_coverage = lambda_coverage
self.lambda_align = lambda_align
def _make_shard_state(self, batch, output, range_, attns=None):
shard_state = {
"output": output,
"target": batch.tgt[range_[0] + 1: range_[1], :, 0],
}
if self.lambda_coverage != 0.0:
coverage = attns.get("coverage", None)
std = attns.get("std", None)
assert attns is not None
assert std is not None, "lambda_coverage != 0.0 requires " \
"attention mechanism"
assert coverage is not None, "lambda_coverage != 0.0 requires " \
"coverage attention"
shard_state.update({
"std_attn": attns.get("std"),
"coverage_attn": coverage
})
if self.lambda_align != 0.0:
# attn_align should be in (batch_size, pad_tgt_size, pad_src_size)
attn_align = attns.get("align", None)
# align_idx should be a Tensor in size([N, 3]), N is total number
# of align src-tgt pair in current batch, each as
# ['sent_N°_in_batch', 'tgt_id+1', 'src_id'] (check AlignField)
align_idx = batch.align
assert attns is not None
assert attn_align is not None, "lambda_align != 0.0 requires " \
"alignement attention head"
assert align_idx is not None, "lambda_align != 0.0 requires " \
"provide guided alignement"
pad_tgt_size, batch_size, _ = batch.tgt.size()
pad_src_size = batch.src[0].size(0)
align_matrix_size = [batch_size, pad_tgt_size, pad_src_size]
ref_align = onmt.utils.make_batch_align_matrix(
align_idx, align_matrix_size, normalize=True)
# NOTE: tgt-src ref alignement that in range_ of shard
# (coherent with batch.tgt)
shard_state.update({
"align_head": attn_align,
"ref_align": ref_align[:, range_[0] + 1: range_[1], :]
})
return shard_state
def _compute_loss(self, batch, output, target, std_attn=None,
coverage_attn=None, align_head=None, ref_align=None):
bottled_output = self._bottle(output)
scores = self.generator(bottled_output)
gtruth = target.view(-1)
loss = self.criterion(scores, gtruth)
if self.lambda_coverage != 0.0:
coverage_loss = self._compute_coverage_loss(
std_attn=std_attn, coverage_attn=coverage_attn)
loss += coverage_loss
if self.lambda_align != 0.0:
if align_head.dtype != loss.dtype: # Fix FP16
align_head = align_head.to(loss.dtype)
if ref_align.dtype != loss.dtype:
ref_align = ref_align.to(loss.dtype)
align_loss = self._compute_alignement_loss(
align_head=align_head, ref_align=ref_align)
loss += align_loss
stats = self._stats(loss.clone(), scores, gtruth)
return loss, stats
def _compute_coverage_loss(self, std_attn, coverage_attn):
covloss = torch.min(std_attn, coverage_attn).sum()
covloss *= self.lambda_coverage
return covloss
def _compute_alignement_loss(self, align_head, ref_align):
"""Compute loss between 2 partial alignment matrix."""
# align_head contains value in [0, 1) presenting attn prob,
# 0 was resulted by the context attention src_pad_mask
# So, the correspand position in ref_align should also be 0
# Therefore, clip align_head to > 1e-18 should be bias free.
align_loss = -align_head.clamp(min=1e-18).log().mul(ref_align).sum()
align_loss *= self.lambda_align
return align_loss
def filter_shard_state(state, shard_size=None):
for k, v in state.items():
if shard_size is None:
yield k, v
if v is not None:
v_split = []
if isinstance(v, torch.Tensor):
for v_chunk in torch.split(v, shard_size):
v_chunk = v_chunk.data.clone()
v_chunk.requires_grad = v.requires_grad
v_split.append(v_chunk)
yield k, (v, v_split)
def shards(state, shard_size, eval_only=False):
"""
Args:
state: A dictionary which corresponds to the output of
*LossCompute._make_shard_state(). The values for
those keys are Tensor-like or None.
shard_size: The maximum size of the shards yielded by the model.
eval_only: If True, only yield the state, nothing else.
Otherwise, yield shards.
Yields:
Each yielded shard is a dict.
Side effect:
After the last shard, this function does back-propagation.
"""
if eval_only:
yield filter_shard_state(state)
else:
# non_none: the subdict of the state dictionary where the values
# are not None.
non_none = dict(filter_shard_state(state, shard_size))
# Now, the iteration:
# state is a dictionary of sequences of tensor-like but we
# want a sequence of dictionaries of tensors.
# First, unzip the dictionary into a sequence of keys and a
# sequence of tensor-like sequences.
keys, values = zip(*((k, [v_chunk for v_chunk in v_split])
for k, (_, v_split) in non_none.items()))
# Now, yield a dictionary for each shard. The keys are always
# the same. values is a sequence of length #keys where each
# element is a sequence of length #shards. We want to iterate
# over the shards, not over the keys: therefore, the values need
# to be re-zipped by shard and then each shard can be paired
# with the keys.
for shard_tensors in zip(*values):
yield dict(zip(keys, shard_tensors))
# Assumed backprop'd
variables = []
for k, (v, v_split) in non_none.items():
if isinstance(v, torch.Tensor) and state[k].requires_grad:
variables.extend(zip(torch.split(state[k], shard_size),
[v_chunk.grad for v_chunk in v_split]))
inputs, grads = zip(*variables)
torch.autograd.backward(inputs, grads)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/utils/misc.py
================================================
# -*- coding: utf-8 -*-
import torch
import random
import inspect
from itertools import islice, repeat
import os
def split_corpus(path, shard_size, default=None):
"""yield a `list` containing `shard_size` line of `path`,
or repeatly generate `default` if `path` is None.
"""
if path is not None:
return _split_corpus(path, shard_size)
else:
return repeat(default)
def _split_corpus(path, shard_size):
"""Yield a `list` containing `shard_size` line of `path`.
"""
with open(path, "rb") as f:
if shard_size <= 0:
yield f.readlines()
else:
while True:
shard = list(islice(f, shard_size))
if not shard:
break
yield shard
def aeq(*args):
"""
Assert all arguments have the same value
"""
arguments = (arg for arg in args)
first = next(arguments)
assert all(arg == first for arg in arguments), \
"Not all arguments have the same value: " + str(args)
def sequence_mask(lengths, max_len=None):
"""
Creates a boolean mask from sequence lengths.
"""
batch_size = lengths.numel()
max_len = max_len or lengths.max()
return (torch.arange(0, max_len, device=lengths.device)
.type_as(lengths)
.repeat(batch_size, 1)
.lt(lengths.unsqueeze(1)))
def tile(x, count, dim=0):
"""
Tiles x on dimension dim count times.
"""
perm = list(range(len(x.size())))
if dim != 0:
perm[0], perm[dim] = perm[dim], perm[0]
x = x.permute(perm).contiguous()
out_size = list(x.size())
out_size[0] *= count
batch = x.size(0)
x = x.view(batch, -1) \
.transpose(0, 1) \
.repeat(count, 1) \
.transpose(0, 1) \
.contiguous() \
.view(*out_size)
if dim != 0:
x = x.permute(perm).contiguous()
return x
def use_gpu(opt):
"""
Creates a boolean if gpu used
"""
return (hasattr(opt, 'gpu_ranks') and len(opt.gpu_ranks) > 0) or \
(hasattr(opt, 'gpu') and opt.gpu > -1)
def set_random_seed(seed, is_cuda):
"""Sets the random seed."""
if seed > 0:
torch.manual_seed(seed)
# this one is needed for torchtext random call (shuffled iterator)
# in multi gpu it ensures datasets are read in the same order
random.seed(seed)
# some cudnn methods can be random even after fixing the seed
# unless you tell it to be deterministic
torch.backends.cudnn.deterministic = True
if is_cuda and seed > 0:
# These ensure same initialization in multi gpu mode
torch.cuda.manual_seed(seed)
def generate_relative_positions_matrix(length, max_relative_positions,
cache=False):
"""Generate the clipped relative positions matrix
for a given length and maximum relative positions"""
if cache:
distance_mat = torch.arange(-length+1, 1, 1).unsqueeze(0)
else:
range_vec = torch.arange(length)
range_mat = range_vec.unsqueeze(-1).expand(-1, length).transpose(0, 1)
distance_mat = range_mat - range_mat.transpose(0, 1)
distance_mat_clipped = torch.clamp(distance_mat,
min=-max_relative_positions,
max=max_relative_positions)
# Shift values to be >= 0
final_mat = distance_mat_clipped + max_relative_positions
return final_mat
def relative_matmul(x, z, transpose):
"""Helper function for relative positions attention."""
batch_size = x.shape[0]
heads = x.shape[1]
length = x.shape[2]
x_t = x.permute(2, 0, 1, 3)
x_t_r = x_t.reshape(length, heads * batch_size, -1)
if transpose:
z_t = z.transpose(1, 2)
x_tz_matmul = torch.matmul(x_t_r, z_t)
else:
x_tz_matmul = torch.matmul(x_t_r, z)
x_tz_matmul_r = x_tz_matmul.reshape(length, batch_size, heads, -1)
x_tz_matmul_r_t = x_tz_matmul_r.permute(1, 2, 0, 3)
return x_tz_matmul_r_t
def fn_args(fun):
"""Returns the list of function arguments name."""
return inspect.getfullargspec(fun).args
def report_matrix(row_label, column_label, matrix):
header_format = "{:>10.10} " + "{:>10.7} " * len(row_label)
row_format = "{:>10.10} " + "{:>10.7f} " * len(row_label)
output = header_format.format("", *row_label) + '\n'
for word, row in zip(column_label, matrix):
max_index = row.index(max(row))
row_format = row_format.replace(
"{:>10.7f} ", "{:*>10.7f} ", max_index + 1)
row_format = row_format.replace(
"{:*>10.7f} ", "{:>10.7f} ", max_index)
output += row_format.format(word, *row) + '\n'
row_format = "{:>10.10} " + "{:>10.7f} " * len(row_label)
return output
def check_model_config(model_config, root):
# we need to check the model path + any tokenizer path
for model in model_config["models"]:
model_path = os.path.join(root, model)
if not os.path.exists(model_path):
raise FileNotFoundError(
"{} from model {} does not exist".format(
model_path, model_config["id"]))
if "tokenizer" in model_config.keys():
if "params" in model_config["tokenizer"].keys():
for k, v in model_config["tokenizer"]["params"].items():
if k.endswith("path"):
tok_path = os.path.join(root, v)
if not os.path.exists(tok_path):
raise FileNotFoundError(
"{} from model {} does not exist".format(
tok_path, model_config["id"]))
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/utils/optimizers.py
================================================
""" Optimizers class """
import torch
import torch.optim as optim
from torch.nn.utils import clip_grad_norm_
import operator
import functools
from copy import copy
from math import sqrt
import types
import importlib
from onmt.utils.misc import fn_args
def build_torch_optimizer(model, opt):
"""Builds the PyTorch optimizer.
We use the default parameters for Adam that are suggested by
the original paper https://arxiv.org/pdf/1412.6980.pdf
These values are also used by other established implementations,
e.g. https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer
https://keras.io/optimizers/
Recently there are slightly different values used in the paper
"Attention is all you need"
https://arxiv.org/pdf/1706.03762.pdf, particularly the value beta2=0.98
was used there however, beta2=0.999 is still arguably the more
established value, so we use that here as well
Args:
model: The model to optimize.
opt. The dictionary of options.
Returns:
A ``torch.optim.Optimizer`` instance.
"""
params = [p for p in model.parameters() if p.requires_grad]
betas = [opt.adam_beta1, opt.adam_beta2]
if opt.optim == 'sgd':
optimizer = optim.SGD(params, lr=opt.learning_rate)
elif opt.optim == 'adagrad':
optimizer = optim.Adagrad(
params,
lr=opt.learning_rate,
initial_accumulator_value=opt.adagrad_accumulator_init)
elif opt.optim == 'adadelta':
optimizer = optim.Adadelta(params, lr=opt.learning_rate)
elif opt.optim == 'adafactor':
optimizer = AdaFactor(
params,
non_constant_decay=True,
enable_factorization=True,
weight_decay=0)
elif opt.optim == 'adam':
optimizer = optim.Adam(
params,
lr=opt.learning_rate,
betas=betas,
eps=1e-9)
elif opt.optim == 'sparseadam':
dense = []
sparse = []
for name, param in model.named_parameters():
if not param.requires_grad:
continue
# TODO: Find a better way to check for sparse gradients.
if 'embed' in name:
sparse.append(param)
else:
dense.append(param)
optimizer = MultipleOptimizer(
[optim.Adam(
dense,
lr=opt.learning_rate,
betas=betas,
eps=1e-8),
optim.SparseAdam(
sparse,
lr=opt.learning_rate,
betas=betas,
eps=1e-8)])
elif opt.optim == 'fusedadam':
# we use here a FusedAdam() copy of an old Apex repo
optimizer = FusedAdam(
params,
lr=opt.learning_rate,
betas=betas)
else:
raise ValueError('Invalid optimizer type: ' + opt.optim)
if opt.model_dtype == 'fp16':
import apex
if opt.optim != 'fusedadam':
# In this case use the new AMP API from apex
loss_scale = "dynamic" if opt.loss_scale == 0 else opt.loss_scale
model, optimizer = apex.amp.initialize(
[model, model.generator],
optimizer,
opt_level=opt.apex_opt_level,
loss_scale=loss_scale,
keep_batchnorm_fp32=None)
else:
# In this case use the old FusedAdam with FP16_optimizer wrapper
static_loss_scale = opt.loss_scale
dynamic_loss_scale = opt.loss_scale == 0
optimizer = apex.optimizers.FP16_Optimizer(
optimizer,
static_loss_scale=static_loss_scale,
dynamic_loss_scale=dynamic_loss_scale)
return optimizer
def make_learning_rate_decay_fn(opt):
"""Returns the learning decay function from options."""
if opt.decay_method == 'noam':
return functools.partial(
noam_decay,
warmup_steps=opt.warmup_steps,
model_size=opt.rnn_size)
elif opt.decay_method == 'noamwd':
return functools.partial(
noamwd_decay,
warmup_steps=opt.warmup_steps,
model_size=opt.rnn_size,
rate=opt.learning_rate_decay,
decay_steps=opt.decay_steps,
start_step=opt.start_decay_steps)
elif opt.decay_method == 'rsqrt':
return functools.partial(
rsqrt_decay, warmup_steps=opt.warmup_steps)
elif opt.start_decay_steps is not None:
return functools.partial(
exponential_decay,
rate=opt.learning_rate_decay,
decay_steps=opt.decay_steps,
start_step=opt.start_decay_steps)
def noam_decay(step, warmup_steps, model_size):
"""Learning rate schedule described in
https://arxiv.org/pdf/1706.03762.pdf.
"""
return (
model_size ** (-0.5) *
min(step ** (-0.5), step * warmup_steps**(-1.5)))
def noamwd_decay(step, warmup_steps,
model_size, rate, decay_steps, start_step=0):
"""Learning rate schedule optimized for huge batches
"""
return (
model_size ** (-0.5) *
min(step ** (-0.5), step * warmup_steps**(-1.5)) *
rate ** (max(step - start_step + decay_steps, 0) // decay_steps))
def exponential_decay(step, rate, decay_steps, start_step=0):
"""A standard exponential decay, scaling the learning rate by :obj:`rate`
every :obj:`decay_steps` steps.
"""
return rate ** (max(step - start_step + decay_steps, 0) // decay_steps)
def rsqrt_decay(step, warmup_steps):
"""Decay based on the reciprocal of the step square root."""
return 1.0 / sqrt(max(step, warmup_steps))
class MultipleOptimizer(object):
""" Implement multiple optimizers needed for sparse adam """
def __init__(self, op):
""" ? """
self.optimizers = op
@property
def param_groups(self):
param_groups = []
for optimizer in self.optimizers:
param_groups.extend(optimizer.param_groups)
return param_groups
def zero_grad(self):
""" ? """
for op in self.optimizers:
op.zero_grad()
def step(self):
""" ? """
for op in self.optimizers:
op.step()
@property
def state(self):
""" ? """
return {k: v for op in self.optimizers for k, v in op.state.items()}
def state_dict(self):
""" ? """
return [op.state_dict() for op in self.optimizers]
def load_state_dict(self, state_dicts):
""" ? """
assert len(state_dicts) == len(self.optimizers)
for i in range(len(state_dicts)):
self.optimizers[i].load_state_dict(state_dicts[i])
class Optimizer(object):
"""
Controller class for optimization. Mostly a thin
wrapper for `optim`, but also useful for implementing
rate scheduling beyond what is currently available.
Also implements necessary methods for training RNNs such
as grad manipulations.
"""
def __init__(self,
optimizer,
learning_rate,
learning_rate_decay_fn=None,
max_grad_norm=None):
"""Initializes the controller.
Args:
optimizer: A ``torch.optim.Optimizer`` instance.
learning_rate: The initial learning rate.
learning_rate_decay_fn: An optional callable taking the current step
as argument and return a learning rate scaling factor.
max_grad_norm: Clip gradients to this global norm.
"""
self._optimizer = optimizer
self._learning_rate = learning_rate
self._learning_rate_decay_fn = learning_rate_decay_fn
self._max_grad_norm = max_grad_norm or 0
self._training_step = 1
self._decay_step = 1
self._fp16 = None
@classmethod
def from_opt(cls, model, opt, checkpoint=None):
"""Builds the optimizer from options.
Args:
cls: The ``Optimizer`` class to instantiate.
model: The model to optimize.
opt: The dict of user options.
checkpoint: An optional checkpoint to load states from.
Returns:
An ``Optimizer`` instance.
"""
optim_opt = opt
optim_state_dict = None
if opt.train_from and checkpoint is not None:
optim = checkpoint['optim']
ckpt_opt = checkpoint['opt']
ckpt_state_dict = {}
if isinstance(optim, Optimizer): # Backward compatibility.
ckpt_state_dict['training_step'] = optim._step + 1
ckpt_state_dict['decay_step'] = optim._step + 1
ckpt_state_dict['optimizer'] = optim.optimizer.state_dict()
else:
ckpt_state_dict = optim
if opt.reset_optim == 'none':
# Load everything from the checkpoint.
optim_opt = ckpt_opt
optim_state_dict = ckpt_state_dict
elif opt.reset_optim == 'all':
# Build everything from scratch.
pass
elif opt.reset_optim == 'states':
# Reset optimizer, keep options.
optim_opt = ckpt_opt
optim_state_dict = ckpt_state_dict
del optim_state_dict['optimizer']
elif opt.reset_optim == 'keep_states':
# Reset options, keep optimizer.
optim_state_dict = ckpt_state_dict
optimizer = cls(
build_torch_optimizer(model, optim_opt),
optim_opt.learning_rate,
learning_rate_decay_fn=make_learning_rate_decay_fn(optim_opt),
max_grad_norm=optim_opt.max_grad_norm)
if opt.model_dtype == "fp16":
if opt.optim == "fusedadam":
optimizer._fp16 = "legacy"
else:
optimizer._fp16 = "amp"
if optim_state_dict:
optimizer.load_state_dict(optim_state_dict)
return optimizer
@property
def training_step(self):
"""The current training step."""
return self._training_step
def learning_rate(self):
"""Returns the current learning rate."""
if self._learning_rate_decay_fn is None:
return self._learning_rate
scale = self._learning_rate_decay_fn(self._decay_step)
return scale * self._learning_rate
def state_dict(self):
return {
'training_step': self._training_step,
'decay_step': self._decay_step,
'optimizer': self._optimizer.state_dict()
}
def load_state_dict(self, state_dict):
self._training_step = state_dict['training_step']
# State can be partially restored.
if 'decay_step' in state_dict:
self._decay_step = state_dict['decay_step']
if 'optimizer' in state_dict:
self._optimizer.load_state_dict(state_dict['optimizer'])
def zero_grad(self):
"""Zero the gradients of optimized parameters."""
self._optimizer.zero_grad()
def backward(self, loss):
"""Wrapper for backward pass. Some optimizer requires ownership of the
backward pass."""
if self._fp16 == "amp":
import apex
with apex.amp.scale_loss(loss, self._optimizer) as scaled_loss:
scaled_loss.backward()
elif self._fp16 == "legacy":
kwargs = {}
if "update_master_grads" in fn_args(self._optimizer.backward):
kwargs["update_master_grads"] = True
self._optimizer.backward(loss, **kwargs)
else:
loss.backward()
def step(self):
"""Update the model parameters based on current gradients.
Optionally, will employ gradient modification or update learning
rate.
"""
learning_rate = self.learning_rate()
if self._fp16 == "legacy":
if hasattr(self._optimizer, "update_master_grads"):
self._optimizer.update_master_grads()
if hasattr(self._optimizer, "clip_master_grads") and \
self._max_grad_norm > 0:
self._optimizer.clip_master_grads(self._max_grad_norm)
for group in self._optimizer.param_groups:
group['lr'] = learning_rate
if self._fp16 is None and self._max_grad_norm > 0:
clip_grad_norm_(group['params'], self._max_grad_norm)
self._optimizer.step()
self._decay_step += 1
self._training_step += 1
# Code below is an implementation of https://arxiv.org/pdf/1804.04235.pdf
# inspired but modified from https://github.com/DeadAt0m/adafactor-pytorch
class AdaFactor(torch.optim.Optimizer):
def __init__(self, params, lr=None, beta1=0.9, beta2=0.999, eps1=1e-30,
eps2=1e-3, cliping_threshold=1, non_constant_decay=True,
enable_factorization=True, ams_grad=True, weight_decay=0):
enable_momentum = beta1 != 0
if non_constant_decay:
ams_grad = False
defaults = dict(lr=lr, beta1=beta1, beta2=beta2, eps1=eps1,
eps2=eps2, cliping_threshold=cliping_threshold,
weight_decay=weight_decay, ams_grad=ams_grad,
enable_factorization=enable_factorization,
enable_momentum=enable_momentum,
non_constant_decay=non_constant_decay)
super(AdaFactor, self).__init__(params, defaults)
def __setstate__(self, state):
super(AdaFactor, self).__setstate__(state)
def _experimental_reshape(self, shape):
temp_shape = shape[2:]
if len(temp_shape) == 1:
new_shape = (shape[0], shape[1]*shape[2])
else:
tmp_div = len(temp_shape) // 2 + len(temp_shape) % 2
new_shape = (shape[0]*functools.reduce(operator.mul,
temp_shape[tmp_div:], 1),
shape[1]*functools.reduce(operator.mul,
temp_shape[:tmp_div], 1))
return new_shape, copy(shape)
def _check_shape(self, shape):
'''
output1 - True - algorithm for matrix, False - vector;
output2 - need reshape
'''
if len(shape) > 2:
return True, True
elif len(shape) == 2:
return True, False
elif len(shape) == 2 and (shape[0] == 1 or shape[1] == 1):
return False, False
else:
return False, False
def _rms(self, x):
return sqrt(torch.mean(x.pow(2)))
def step(self, closure=None):
loss = None
if closure is not None:
loss = closure()
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad.data
if grad.is_sparse:
raise RuntimeError('Adam does not support sparse \
gradients, use SparseAdam instead')
is_matrix, is_need_reshape = self._check_shape(grad.size())
new_shape = p.data.size()
if is_need_reshape and group['enable_factorization']:
new_shape, old_shape = \
self._experimental_reshape(p.data.size())
grad = grad.view(new_shape)
state = self.state[p]
if len(state) == 0:
state['step'] = 0
if group['enable_momentum']:
state['exp_avg'] = torch.zeros(new_shape,
dtype=torch.float32,
device=p.grad.device)
if is_matrix and group['enable_factorization']:
state['exp_avg_sq_R'] = \
torch.zeros((1, new_shape[1]),
dtype=torch.float32,
device=p.grad.device)
state['exp_avg_sq_C'] = \
torch.zeros((new_shape[0], 1),
dtype=torch.float32,
device=p.grad.device)
else:
state['exp_avg_sq'] = torch.zeros(new_shape,
dtype=torch.float32,
device=p.grad.device)
if group['ams_grad']:
state['exp_avg_sq_hat'] = \
torch.zeros(new_shape, dtype=torch.float32,
device=p.grad.device)
if group['enable_momentum']:
exp_avg = state['exp_avg']
if is_matrix and group['enable_factorization']:
exp_avg_sq_r = state['exp_avg_sq_R']
exp_avg_sq_c = state['exp_avg_sq_C']
else:
exp_avg_sq = state['exp_avg_sq']
if group['ams_grad']:
exp_avg_sq_hat = state['exp_avg_sq_hat']
state['step'] += 1
lr_t = group['lr']
lr_t *= max(group['eps2'], self._rms(p.data))
if group['enable_momentum']:
if group['non_constant_decay']:
beta1_t = group['beta1'] * \
(1 - group['beta1'] ** (state['step'] - 1)) \
/ (1 - group['beta1'] ** state['step'])
else:
beta1_t = group['beta1']
exp_avg.mul_(beta1_t).add_(1 - beta1_t, grad)
if group['non_constant_decay']:
beta2_t = group['beta2'] * \
(1 - group['beta2'] ** (state['step'] - 1)) / \
(1 - group['beta2'] ** state['step'])
else:
beta2_t = group['beta2']
if is_matrix and group['enable_factorization']:
exp_avg_sq_r.mul_(beta2_t). \
add_(1 - beta2_t, torch.sum(torch.mul(grad, grad).
add_(group['eps1']),
dim=0, keepdim=True))
exp_avg_sq_c.mul_(beta2_t). \
add_(1 - beta2_t, torch.sum(torch.mul(grad, grad).
add_(group['eps1']),
dim=1, keepdim=True))
v = torch.mul(exp_avg_sq_c,
exp_avg_sq_r).div_(torch.sum(exp_avg_sq_r))
else:
exp_avg_sq.mul_(beta2_t). \
addcmul_(1 - beta2_t, grad, grad). \
add_((1 - beta2_t)*group['eps1'])
v = exp_avg_sq
g = grad
if group['enable_momentum']:
g = torch.div(exp_avg, 1 - beta1_t ** state['step'])
if group['ams_grad']:
torch.max(exp_avg_sq_hat, v, out=exp_avg_sq_hat)
v = exp_avg_sq_hat
u = torch.div(g, (torch.div(v, 1 - beta2_t **
state['step'])).sqrt().add_(group['eps1']))
else:
u = torch.div(g, v.sqrt())
u.div_(max(1, self._rms(u) / group['cliping_threshold']))
p.data.add_(-lr_t * (u.view(old_shape) if is_need_reshape and
group['enable_factorization'] else u))
if group['weight_decay'] != 0:
p.data.add_(-group['weight_decay'] * lr_t, p.data)
return loss
class FusedAdam(torch.optim.Optimizer):
"""Implements Adam algorithm. Currently GPU-only.
Requires Apex to be installed via
``python setup.py install --cuda_ext --cpp_ext``.
It has been proposed in `Adam: A Method for Stochastic Optimization`_.
Arguments:
params (iterable): iterable of parameters to optimize or dicts defining
parameter groups.
lr (float, optional): learning rate. (default: 1e-3)
betas (Tuple[float, float], optional): coefficients used for computing
running averages of gradient and its square.
(default: (0.9, 0.999))
eps (float, optional): term added to the denominator to improve
numerical stability. (default: 1e-8)
weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
amsgrad (boolean, optional): whether to use the AMSGrad variant of this
algorithm from the paper `On the Convergence of Adam and Beyond`_
(default: False) NOT SUPPORTED in FusedAdam!
eps_inside_sqrt (boolean, optional): in the 'update parameters' step,
adds eps to the bias-corrected second moment estimate before
evaluating square root instead of adding it to the square root of
second moment estimate as in the original paper. (default: False)
.. _Adam: A Method for Stochastic Optimization:
https://arxiv.org/abs/1412.6980
.. _On the Convergence of Adam and Beyond:
https://openreview.net/forum?id=ryQu7f-RZ
"""
def __init__(self, params,
lr=1e-3, bias_correction=True,
betas=(0.9, 0.999), eps=1e-8, eps_inside_sqrt=False,
weight_decay=0., max_grad_norm=0., amsgrad=False):
global fused_adam_cuda
fused_adam_cuda = importlib.import_module("fused_adam_cuda")
if amsgrad:
raise RuntimeError('AMSGrad variant not supported.')
defaults = dict(lr=lr, bias_correction=bias_correction,
betas=betas, eps=eps, weight_decay=weight_decay,
max_grad_norm=max_grad_norm)
super(FusedAdam, self).__init__(params, defaults)
self.eps_mode = 0 if eps_inside_sqrt else 1
def step(self, closure=None, grads=None, output_params=None,
scale=1., grad_norms=None):
"""Performs a single optimization step.
Arguments:
closure (callable, optional): A closure that reevaluates the model
and returns the loss.
grads (list of tensors, optional): weight gradient to use for the
optimizer update. If gradients have type torch.half, parameters
are expected to be in type torch.float. (default: None)
output params (list of tensors, optional): A reduced precision copy
of the updated weights written out in addition to the regular
updated weights. Have to be of same type as gradients.
(default: None)
scale (float, optional): factor to divide gradient tensor values
by before applying to weights. (default: 1)
"""
loss = None
if closure is not None:
loss = closure()
if grads is None:
grads_group = [None]*len(self.param_groups)
# backward compatibility
# assuming a list/generator of parameter means single group
elif isinstance(grads, types.GeneratorType):
grads_group = [grads]
elif type(grads[0]) != list:
grads_group = [grads]
else:
grads_group = grads
if output_params is None:
output_params_group = [None]*len(self.param_groups)
elif isinstance(output_params, types.GeneratorType):
output_params_group = [output_params]
elif type(output_params[0]) != list:
output_params_group = [output_params]
else:
output_params_group = output_params
if grad_norms is None:
grad_norms = [None]*len(self.param_groups)
for group, grads_this_group, output_params_this_group, \
grad_norm in zip(self.param_groups, grads_group,
output_params_group, grad_norms):
if grads_this_group is None:
grads_this_group = [None]*len(group['params'])
if output_params_this_group is None:
output_params_this_group = [None]*len(group['params'])
# compute combined scale factor for this group
combined_scale = scale
if group['max_grad_norm'] > 0:
# norm is in fact norm*scale
clip = ((grad_norm / scale) + 1e-6) / group['max_grad_norm']
if clip > 1:
combined_scale = clip * scale
bias_correction = 1 if group['bias_correction'] else 0
for p, grad, output_param in zip(group['params'],
grads_this_group,
output_params_this_group):
# note: p.grad should not ever be set for correct operation of
# mixed precision optimizer that sometimes sends None gradients
if p.grad is None and grad is None:
continue
if grad is None:
grad = p.grad.data
if grad.is_sparse:
raise RuntimeError('FusedAdam does not support sparse \
gradients, please consider \
SparseAdam instead')
state = self.state[p]
# State initialization
if len(state) == 0:
state['step'] = 0
# Exponential moving average of gradient values
state['exp_avg'] = torch.zeros_like(p.data)
# Exponential moving average of squared gradient values
state['exp_avg_sq'] = torch.zeros_like(p.data)
exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
beta1, beta2 = group['betas']
state['step'] += 1
out_p = torch.tensor([], dtype=torch.float) if output_param \
is None else output_param
fused_adam_cuda.adam(p.data,
out_p,
exp_avg,
exp_avg_sq,
grad,
group['lr'],
beta1,
beta2,
group['eps'],
combined_scale,
state['step'],
self.eps_mode,
bias_correction,
group['weight_decay'])
return loss
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/utils/parse.py
================================================
import configargparse as cfargparse
import os
import torch
import onmt.opts as opts
from onmt.utils.logging import logger
class ArgumentParser(cfargparse.ArgumentParser):
def __init__(
self,
config_file_parser_class=cfargparse.YAMLConfigFileParser,
formatter_class=cfargparse.ArgumentDefaultsHelpFormatter,
**kwargs):
super(ArgumentParser, self).__init__(
config_file_parser_class=config_file_parser_class,
formatter_class=formatter_class,
**kwargs)
@classmethod
def defaults(cls, *args):
"""Get default arguments added to a parser by all ``*args``."""
dummy_parser = cls()
for callback in args:
callback(dummy_parser)
defaults = dummy_parser.parse_known_args([])[0]
return defaults
@classmethod
def update_model_opts(cls, model_opt):
if model_opt.word_vec_size > 0:
model_opt.src_word_vec_size = model_opt.word_vec_size
model_opt.tgt_word_vec_size = model_opt.word_vec_size
if model_opt.layers > 0:
model_opt.enc_layers = model_opt.layers
model_opt.dec_layers = model_opt.layers
if model_opt.rnn_size > 0:
model_opt.enc_rnn_size = model_opt.rnn_size
model_opt.dec_rnn_size = model_opt.rnn_size
model_opt.brnn = model_opt.encoder_type == "brnn"
if model_opt.copy_attn_type is None:
model_opt.copy_attn_type = model_opt.global_attention
if model_opt.alignment_layer is None:
model_opt.alignment_layer = -2
model_opt.lambda_align = 0.0
model_opt.full_context_alignment = False
@classmethod
def validate_model_opts(cls, model_opt):
assert model_opt.model_type in ["text", "img", "audio", "vec"], \
"Unsupported model type %s" % model_opt.model_type
# this check is here because audio allows the encoder and decoder to
# be different sizes, but other model types do not yet
same_size = model_opt.enc_rnn_size == model_opt.dec_rnn_size
assert model_opt.model_type == 'audio' or same_size, \
"The encoder and decoder rnns must be the same size for now"
assert model_opt.rnn_type != "SRU" or model_opt.gpu_ranks, \
"Using SRU requires -gpu_ranks set."
if model_opt.share_embeddings:
if model_opt.model_type != "text":
raise AssertionError(
"--share_embeddings requires --model_type text.")
if model_opt.lambda_align > 0.0:
assert model_opt.decoder_type == 'transformer', \
"Only transformer is supported to joint learn alignment."
assert model_opt.alignment_layer < model_opt.dec_layers and \
model_opt.alignment_layer >= -model_opt.dec_layers, \
"N° alignment_layer should be smaller than number of layers."
logger.info("Joint learn alignment at layer [{}] "
"with {} heads in full_context '{}'.".format(
model_opt.alignment_layer,
model_opt.alignment_heads,
model_opt.full_context_alignment))
@classmethod
def ckpt_model_opts(cls, ckpt_opt):
# Load default opt values, then overwrite with the opts in
# the checkpoint. That way, if there are new options added,
# the defaults are used.
opt = cls.defaults(opts.model_opts)
opt.__dict__.update(ckpt_opt.__dict__)
return opt
@classmethod
def validate_train_opts(cls, opt):
if opt.epochs:
raise AssertionError(
"-epochs is deprecated please use -train_steps.")
if opt.truncated_decoder > 0 and max(opt.accum_count) > 1:
raise AssertionError("BPTT is not compatible with -accum > 1")
if opt.gpuid:
raise AssertionError(
"gpuid is deprecated see world_size and gpu_ranks")
if torch.cuda.is_available() and not opt.gpu_ranks:
logger.warn("You have a CUDA device, should run with -gpu_ranks")
if opt.world_size < len(opt.gpu_ranks):
raise AssertionError(
"parameter counts of -gpu_ranks must be less or equal "
"than -world_size.")
if opt.world_size == len(opt.gpu_ranks) and \
min(opt.gpu_ranks) > 0:
raise AssertionError(
"-gpu_ranks should have master(=0) rank "
"unless -world_size is greater than len(gpu_ranks).")
assert len(opt.data_ids) == len(opt.data_weights), \
"Please check -data_ids and -data_weights options!"
assert len(opt.dropout) == len(opt.dropout_steps), \
"Number of dropout values must match accum_steps values"
assert len(opt.attention_dropout) == len(opt.dropout_steps), \
"Number of attention_dropout values must match accum_steps values"
@classmethod
def validate_translate_opts(cls, opt):
if opt.beam_size != 1 and opt.random_sampling_topk != 1:
raise ValueError('Can either do beam search OR random sampling.')
@classmethod
def validate_preprocess_args(cls, opt):
assert opt.max_shard_size == 0, \
"-max_shard_size is deprecated. Please use \
-shard_size (number of examples) instead."
assert opt.shuffle == 0, \
"-shuffle is not implemented. Please shuffle \
your data before pre-processing."
assert len(opt.train_src) == len(opt.train_tgt), \
"Please provide same number of src and tgt train files!"
assert len(opt.train_src) == len(opt.train_ids), \
"Please provide proper -train_ids for your data!"
for file in opt.train_src + opt.train_tgt:
assert os.path.isfile(file), "Please check path of %s" % file
if len(opt.train_align) == 1 and opt.train_align[0] is None:
opt.train_align = [None] * len(opt.train_src)
else:
assert len(opt.train_align) == len(opt.train_src), \
"Please provide same number of word alignment train \
files as src/tgt!"
for file in opt.train_align:
assert os.path.isfile(file), "Please check path of %s" % file
assert not opt.valid_align or os.path.isfile(opt.valid_align), \
"Please check path of your valid alignment file!"
assert not opt.valid_src or os.path.isfile(opt.valid_src), \
"Please check path of your valid src file!"
assert not opt.valid_tgt or os.path.isfile(opt.valid_tgt), \
"Please check path of your valid tgt file!"
assert not opt.src_vocab or os.path.isfile(opt.src_vocab), \
"Please check path of your src vocab!"
assert not opt.tgt_vocab or os.path.isfile(opt.tgt_vocab), \
"Please check path of your tgt vocab!"
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/utils/report_manager.py
================================================
""" Report manager utility """
from __future__ import print_function
import time
from datetime import datetime
import onmt
from onmt.utils.logging import logger
def build_report_manager(opt, gpu_rank):
if opt.tensorboard and gpu_rank == 0:
from torch.utils.tensorboard import SummaryWriter
tensorboard_log_dir = opt.tensorboard_log_dir
if not opt.train_from:
tensorboard_log_dir += datetime.now().strftime("/%b-%d_%H-%M-%S")
writer = SummaryWriter(tensorboard_log_dir, comment="Unmt")
else:
writer = None
report_mgr = ReportMgr(opt.report_every, start_time=-1,
tensorboard_writer=writer)
return report_mgr
class ReportMgrBase(object):
"""
Report Manager Base class
Inherited classes should override:
* `_report_training`
* `_report_step`
"""
def __init__(self, report_every, start_time=-1.):
"""
Args:
report_every(int): Report status every this many sentences
start_time(float): manually set report start time. Negative values
means that you will need to set it later or use `start()`
"""
self.report_every = report_every
self.start_time = start_time
def start(self):
self.start_time = time.time()
def log(self, *args, **kwargs):
logger.info(*args, **kwargs)
def report_training(self, step, num_steps, learning_rate,
report_stats, multigpu=False):
"""
This is the user-defined batch-level traing progress
report function.
Args:
step(int): current step count.
num_steps(int): total number of batches.
learning_rate(float): current learning rate.
report_stats(Statistics): old Statistics instance.
Returns:
report_stats(Statistics): updated Statistics instance.
"""
if self.start_time < 0:
raise ValueError("""ReportMgr needs to be started
(set 'start_time' or use 'start()'""")
if step % self.report_every == 0:
if multigpu:
report_stats = \
onmt.utils.Statistics.all_gather_stats(report_stats)
self._report_training(
step, num_steps, learning_rate, report_stats)
return onmt.utils.Statistics()
else:
return report_stats
def _report_training(self, *args, **kwargs):
""" To be overridden """
raise NotImplementedError()
def report_step(self, lr, step, train_stats=None, valid_stats=None):
"""
Report stats of a step
Args:
train_stats(Statistics): training stats
valid_stats(Statistics): validation stats
lr(float): current learning rate
"""
self._report_step(
lr, step, train_stats=train_stats, valid_stats=valid_stats)
def _report_step(self, *args, **kwargs):
raise NotImplementedError()
class ReportMgr(ReportMgrBase):
def __init__(self, report_every, start_time=-1., tensorboard_writer=None):
"""
A report manager that writes statistics on standard output as well as
(optionally) TensorBoard
Args:
report_every(int): Report status every this many sentences
tensorboard_writer(:obj:`tensorboard.SummaryWriter`):
The TensorBoard Summary writer to use or None
"""
super(ReportMgr, self).__init__(report_every, start_time)
self.tensorboard_writer = tensorboard_writer
def maybe_log_tensorboard(self, stats, prefix, learning_rate, step):
if self.tensorboard_writer is not None:
stats.log_tensorboard(
prefix, self.tensorboard_writer, learning_rate, step)
def _report_training(self, step, num_steps, learning_rate,
report_stats):
"""
See base class method `ReportMgrBase.report_training`.
"""
report_stats.output(step, num_steps,
learning_rate, self.start_time)
self.maybe_log_tensorboard(report_stats,
"progress",
learning_rate,
step)
report_stats = onmt.utils.Statistics()
return report_stats
def _report_step(self, lr, step, train_stats=None, valid_stats=None):
"""
See base class method `ReportMgrBase.report_step`.
"""
if train_stats is not None:
self.log('Train perplexity: %g' % train_stats.ppl())
self.log('Train accuracy: %g' % train_stats.accuracy())
self.maybe_log_tensorboard(train_stats,
"train",
lr,
step)
if valid_stats is not None:
self.log('Validation perplexity: %g' % valid_stats.ppl())
self.log('Validation accuracy: %g' % valid_stats.accuracy())
self.maybe_log_tensorboard(valid_stats,
"valid",
lr,
step)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/utils/rnn_factory.py
================================================
"""
RNN tools
"""
import torch.nn as nn
import onmt.models
def rnn_factory(rnn_type, **kwargs):
""" rnn factory, Use pytorch version when available. """
no_pack_padded_seq = False
if rnn_type == "SRU":
# SRU doesn't support PackedSequence.
no_pack_padded_seq = True
rnn = onmt.models.sru.SRU(**kwargs)
else:
rnn = getattr(nn, rnn_type)(**kwargs)
return rnn, no_pack_padded_seq
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/onmt/utils/statistics.py
================================================
""" Statistics calculation utility """
from __future__ import division
import time
import math
import sys
from onmt.utils.logging import logger
class Statistics(object):
"""
Accumulator for loss statistics.
Currently calculates:
* accuracy
* perplexity
* elapsed time
"""
def __init__(self, loss=0, n_words=0, n_correct=0):
self.loss = loss
self.n_words = n_words
self.n_correct = n_correct
self.n_src_words = 0
self.start_time = time.time()
@staticmethod
def all_gather_stats(stat, max_size=4096):
"""
Gather a `Statistics` object accross multiple process/nodes
Args:
stat(:obj:Statistics): the statistics object to gather
accross all processes/nodes
max_size(int): max buffer size to use
Returns:
`Statistics`, the update stats object
"""
stats = Statistics.all_gather_stats_list([stat], max_size=max_size)
return stats[0]
@staticmethod
def all_gather_stats_list(stat_list, max_size=4096):
"""
Gather a `Statistics` list accross all processes/nodes
Args:
stat_list(list([`Statistics`])): list of statistics objects to
gather accross all processes/nodes
max_size(int): max buffer size to use
Returns:
our_stats(list([`Statistics`])): list of updated stats
"""
from torch.distributed import get_rank
from onmt.utils.distributed import all_gather_list
# Get a list of world_size lists with len(stat_list) Statistics objects
all_stats = all_gather_list(stat_list, max_size=max_size)
our_rank = get_rank()
our_stats = all_stats[our_rank]
for other_rank, stats in enumerate(all_stats):
if other_rank == our_rank:
continue
for i, stat in enumerate(stats):
our_stats[i].update(stat, update_n_src_words=True)
return our_stats
def update(self, stat, update_n_src_words=False):
"""
Update statistics by suming values with another `Statistics` object
Args:
stat: another statistic object
update_n_src_words(bool): whether to update (sum) `n_src_words`
or not
"""
self.loss += stat.loss
self.n_words += stat.n_words
self.n_correct += stat.n_correct
if update_n_src_words:
self.n_src_words += stat.n_src_words
def accuracy(self):
""" compute accuracy """
return 100 * (self.n_correct / self.n_words)
def xent(self):
""" compute cross entropy """
return self.loss / self.n_words
def ppl(self):
""" compute perplexity """
return math.exp(min(self.loss / self.n_words, 100))
def elapsed_time(self):
""" compute elapsed time """
return time.time() - self.start_time
def output(self, step, num_steps, learning_rate, start):
"""Write out statistics to stdout.
Args:
step (int): current step
n_batch (int): total batches
start (int): start time of step.
"""
t = self.elapsed_time()
step_fmt = "%2d" % step
if num_steps > 0:
step_fmt = "%s/%5d" % (step_fmt, num_steps)
logger.info(
("Step %s; acc: %6.2f; ppl: %5.2f; xent: %4.2f; " +
"lr: %7.5f; %3.0f/%3.0f tok/s; %6.0f sec")
% (step_fmt,
self.accuracy(),
self.ppl(),
self.xent(),
learning_rate,
self.n_src_words / (t + 1e-5),
self.n_words / (t + 1e-5),
time.time() - start))
sys.stdout.flush()
def log_tensorboard(self, prefix, writer, learning_rate, step):
""" display statistics to tensorboard """
t = self.elapsed_time()
writer.add_scalar(prefix + "/xent", self.xent(), step)
writer.add_scalar(prefix + "/ppl", self.ppl(), step)
writer.add_scalar(prefix + "/accuracy", self.accuracy(), step)
writer.add_scalar(prefix + "/tgtper", self.n_words / t, step)
writer.add_scalar(prefix + "/lr", learning_rate, step)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/preprocess.py
================================================
#!/usr/bin/env python
from onmt.bin.preprocess import main
if __name__ == "__main__":
main()
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/process_ori_data.py
================================================
file=open('train_data.csv','r')
lines=file.readlines()
src_train=open('src-train.txt','w')
tgt_train=open('tgt-train.txt','w')
src_val=open('src-val.txt','w')
tgt_val=open('tgt-val.txt','w')
chinese_lists=[]
english_lists=[]
index=0
for line in lines:
if index ==0:
index+=1
continue
line=line.strip().split(',')
chinese=line[1].strip().split('_')
english=line[2].strip().split('_')
chinese_lists.append(' '.join(chinese))
english_lists.append(' '.join(english))
index+=1
assert len(chinese_lists)==len(english_lists)
split_num=int(0.85*index)
for num in range(len(english_lists)):
if num<=split_num:
src_train.write(chinese_lists[num]+'\n')
tgt_train.write(english_lists[num]+'\n')
else:
src_val.write(chinese_lists[num]+'\n')
tgt_val.write(english_lists[num]+'\n')
src_train.close()
tgt_train.close()
src_val.close()
tgt_val.close()
file=open('test_cs_a.csv','r')
lines=file.readlines()
src_test=open('src-test.txt','w')
for line in lines:
line=line.strip().split(',')
cont=line[2].split('_')
cont=' '.join(cont)
src_test.write(cont+'\n')
src_test.close()
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/requirements.opt.txt
================================================
cffi
torchvision
joblib
librosa
Pillow
git+git://github.com/pytorch/audio.git@d92de5b97fc6204db4b1e3ed20c03ac06f5d53f0
pyrouge
opencv-python
git+https://github.com/NVIDIA/apex
pretrainedmodels
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/server.py
================================================
#!/usr/bin/env python
from onmt.bin.server import main
if __name__ == "__main__":
main()
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/setup.py
================================================
#!/usr/bin/env python
from setuptools import setup, find_packages
from os import path
this_directory = path.abspath(path.dirname(__file__))
with open(path.join(this_directory, 'README.md'), encoding='utf-8') as f:
long_description = f.read()
setup(
name='OpenNMT-py',
description='A python implementation of OpenNMT',
long_description=long_description,
long_description_content_type='text/markdown',
version='1.0.0.rc2',
packages=find_packages(),
project_urls={
"Documentation": "http://opennmt.net/OpenNMT-py/",
"Forum": "http://forum.opennmt.net/",
"Gitter": "https://gitter.im/OpenNMT/OpenNMT-py",
"Source": "https://github.com/OpenNMT/OpenNMT-py/"
},
install_requires=[
"six",
"tqdm~=4.30.0",
"torch>=1.2",
"torchtext==0.4.0",
"future",
"configargparse",
"tensorboard>=1.14",
"flask",
"pyonmttok==1.*;platform_system=='Linux'",
],
entry_points={
"console_scripts": [
"onmt_server=onmt.bin.server:main",
"onmt_train=onmt.bin.train:main",
"onmt_translate=onmt.bin.translate:main",
"onmt_preprocess=onmt.bin.preprocess:main",
"onmt_average_models=onmt.bin.average_models:main"
],
}
)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/README.md
================================================
This directly contains scripts and tools adopted from other open source projects such as Apache Joshua and Moses Decoder.
TODO: credit the authors and resolve license issues (if any)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/apply_bpe.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author: Rico Sennrich
# flake8: noqa
"""Use operations learned with learn_bpe.py to encode a new text.
The text will not be smaller, but use only a fixed vocabulary, with rare words
encoded as variable-length sequences of subword units.
Reference:
Rico Sennrich, Barry Haddow and Alexandra Birch (2015). Neural Machine Translation of Rare Words with Subword Units.
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.
"""
# This file is retrieved from https://github.com/rsennrich/subword-nmt
from __future__ import unicode_literals, division
import sys
import codecs
import io
import argparse
import json
import re
from collections import defaultdict
# hack for python2/3 compatibility
from io import open
argparse.open = open
class BPE(object):
def __init__(self, codes, separator='@@', vocab=None, glossaries=None):
# check version information
firstline = codes.readline()
if firstline.startswith('#version:'):
self.version = tuple([int(x) for x in re.sub(
r'(\.0+)*$', '', firstline.split()[-1]).split(".")])
else:
self.version = (0, 1)
codes.seek(0)
self.bpe_codes = [tuple(item.split()) for item in codes]
# some hacking to deal with duplicates (only consider first instance)
self.bpe_codes = dict(
[(code, i) for (i, code) in reversed(list(enumerate(self.bpe_codes)))])
self.bpe_codes_reverse = dict(
[(pair[0] + pair[1], pair) for pair, i in self.bpe_codes.items()])
self.separator = separator
self.vocab = vocab
self.glossaries = glossaries if glossaries else []
self.cache = {}
def segment(self, sentence):
"""segment single sentence (whitespace-tokenized string) with BPE encoding"""
output = []
for word in sentence.split():
new_word = [out for segment in self._isolate_glossaries(word)
for out in encode(segment,
self.bpe_codes,
self.bpe_codes_reverse,
self.vocab,
self.separator,
self.version,
self.cache,
self.glossaries)]
for item in new_word[:-1]:
output.append(item + self.separator)
output.append(new_word[-1])
return ' '.join(output)
def _isolate_glossaries(self, word):
word_segments = [word]
for gloss in self.glossaries:
word_segments = [out_segments for segment in word_segments
for out_segments in isolate_glossary(segment, gloss)]
return word_segments
def create_parser():
parser = argparse.ArgumentParser(
formatter_class=argparse.RawDescriptionHelpFormatter,
description="learn BPE-based word segmentation")
parser.add_argument(
'--input', '-i', type=argparse.FileType('r'), default=sys.stdin,
metavar='PATH',
help="Input file (default: standard input).")
parser.add_argument(
'--codes', '-c', type=argparse.FileType('r'), metavar='PATH',
required=True,
help="File with BPE codes (created by learn_bpe.py).")
parser.add_argument(
'--output', '-o', type=argparse.FileType('w'), default=sys.stdout,
metavar='PATH',
help="Output file (default: standard output)")
parser.add_argument(
'--separator', '-s', type=str, default='@@', metavar='STR',
help="Separator between non-final subword units (default: '%(default)s'))")
parser.add_argument(
'--vocabulary', type=argparse.FileType('r'), default=None,
metavar="PATH",
help="Vocabulary file (built with get_vocab.py). If provided, this script reverts any merge operations that produce an OOV.")
parser.add_argument(
'--vocabulary-threshold', type=int, default=None,
metavar="INT",
help="Vocabulary threshold. If vocabulary is provided, any word with frequency < threshold will be treated as OOV")
parser.add_argument(
'--glossaries', type=str, nargs='+', default=None,
metavar="STR",
help="Glossaries. The strings provided in glossaries will not be affected" +
"by the BPE (i.e. they will neither be broken into subwords, nor concatenated with other subwords")
return parser
def get_pairs(word):
"""Return set of symbol pairs in a word.
word is represented as tuple of symbols (symbols being variable-length strings)
"""
pairs = set()
prev_char = word[0]
for char in word[1:]:
pairs.add((prev_char, char))
prev_char = char
return pairs
def encode(orig, bpe_codes, bpe_codes_reverse, vocab, separator, version, cache, glossaries=None):
"""Encode word based on list of BPE merge operations, which are applied consecutively
"""
if orig in cache:
return cache[orig]
if orig in glossaries:
cache[orig] = (orig,)
return (orig,)
if version == (0, 1):
word = tuple(orig) + ('',)
elif version == (0, 2): # more consistent handling of word-final segments
word = tuple(orig[:-1]) + (orig[-1] + '',)
else:
raise NotImplementedError
pairs = get_pairs(word)
if not pairs:
return orig
while True:
bigram = min(pairs, key=lambda pair: bpe_codes.get(pair, float('inf')))
if bigram not in bpe_codes:
break
first, second = bigram
new_word = []
i = 0
while i < len(word):
try:
j = word.index(first, i)
new_word.extend(word[i:j])
i = j
except:
new_word.extend(word[i:])
break
if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
new_word.append(first + second)
i += 2
else:
new_word.append(word[i])
i += 1
new_word = tuple(new_word)
word = new_word
if len(word) == 1:
break
else:
pairs = get_pairs(word)
# don't print end-of-word symbols
if word[-1] == '':
word = word[:-1]
elif word[-1].endswith(''):
word = word[:-1] + (word[-1].replace('', ''),)
if vocab:
word = check_vocab_and_split(word, bpe_codes_reverse, vocab, separator)
cache[orig] = word
return word
def recursive_split(segment, bpe_codes, vocab, separator, final=False):
"""Recursively split segment into smaller units (by reversing BPE merges)
until all units are either in-vocabulary, or cannot be split futher."""
try:
if final:
left, right = bpe_codes[segment + '']
right = right[:-4]
else:
left, right = bpe_codes[segment]
except:
#sys.stderr.write('cannot split {0} further.\n'.format(segment))
yield segment
return
if left + separator in vocab:
yield left
else:
for item in recursive_split(left, bpe_codes, vocab, separator, False):
yield item
if (final and right in vocab) or (not final and right + separator in vocab):
yield right
else:
for item in recursive_split(right, bpe_codes, vocab, separator, final):
yield item
def check_vocab_and_split(orig, bpe_codes, vocab, separator):
"""Check for each segment in word if it is in-vocabulary,
and segment OOV segments into smaller units by reversing the BPE merge operations"""
out = []
for segment in orig[:-1]:
if segment + separator in vocab:
out.append(segment)
else:
#sys.stderr.write('OOV: {0}\n'.format(segment))
for item in recursive_split(segment, bpe_codes, vocab, separator, False):
out.append(item)
segment = orig[-1]
if segment in vocab:
out.append(segment)
else:
#sys.stderr.write('OOV: {0}\n'.format(segment))
for item in recursive_split(segment, bpe_codes, vocab, separator, True):
out.append(item)
return out
def read_vocabulary(vocab_file, threshold):
"""read vocabulary file produced by get_vocab.py, and filter according to frequency threshold.
"""
vocabulary = set()
for line in vocab_file:
word, freq = line.split()
freq = int(freq)
if threshold == None or freq >= threshold:
vocabulary.add(word)
return vocabulary
def isolate_glossary(word, glossary):
"""
Isolate a glossary present inside a word.
Returns a list of subwords. In which all 'glossary' glossaries are isolated
For example, if 'USA' is the glossary and '1934USABUSA' the word, the return value is:
['1934', 'USA', 'B', 'USA']
"""
if word == glossary or glossary not in word:
return [word]
else:
splits = word.split(glossary)
segments = [segment.strip() for split in splits[:-1]
for segment in [split, glossary] if segment != '']
return segments + [splits[-1].strip()] if splits[-1] != '' else segments
if __name__ == '__main__':
# python 2/3 compatibility
if sys.version_info < (3, 0):
sys.stderr = codecs.getwriter('UTF-8')(sys.stderr)
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
sys.stdin = codecs.getreader('UTF-8')(sys.stdin)
else:
sys.stdin = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
sys.stdout = io.TextIOWrapper(
sys.stdout.buffer, encoding='utf-8', write_through=True, line_buffering=True)
parser = create_parser()
args = parser.parse_args()
# read/write files as UTF-8
args.codes = codecs.open(args.codes.name, encoding='utf-8')
if args.input.name != '':
args.input = codecs.open(args.input.name, encoding='utf-8')
if args.output.name != '':
args.output = codecs.open(args.output.name, 'w', encoding='utf-8')
if args.vocabulary:
args.vocabulary = codecs.open(args.vocabulary.name, encoding='utf-8')
if args.vocabulary:
vocabulary = read_vocabulary(
args.vocabulary, args.vocabulary_threshold)
else:
vocabulary = None
bpe = BPE(args.codes, args.separator, vocabulary, args.glossaries)
for line in args.input:
args.output.write(bpe.segment(line).strip())
args.output.write('\n')
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/average_models.py
================================================
#!/usr/bin/env python
from onmt.bin.average_models import main
if __name__ == "__main__":
main()
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/bpe_pipeline.sh
================================================
#!/usr/bin/env bash
# Author : Thamme Gowda
# Created : Nov 06, 2017
ONMT="$( cd "$( dirname "${BASH_SOURCE[0]}" )/.." && pwd )"
#======= EXPERIMENT SETUP ======
# Activate python environment if needed
source ~/.bashrc
# source activate py3
# update these variables
NAME="run1"
OUT="onmt-runs/$NAME"
DATA="$ONMT/onmt-runs/data"
TRAIN_SRC=$DATA/*train.src
TRAIN_TGT=$DATA/*train.tgt
VALID_SRC=$DATA/*dev.src
VALID_TGT=$DATA/*dev.tgt
TEST_SRC=$DATA/*test.src
TEST_TGT=$DATA/*test.tgt
BPE="" # default
BPE="src" # src, tgt, src+tgt
# applicable only when BPE="src" or "src+tgt"
BPE_SRC_OPS=10000
# applicable only when BPE="tgt" or "src+tgt"
BPE_TGT_OPS=10000
GPUARG="" # default
GPUARG="0"
#====== EXPERIMENT BEGIN ======
# Check if input exists
for f in $TRAIN_SRC $TRAIN_TGT $VALID_SRC $VALID_TGT $TEST_SRC $TEST_TGT; do
if [[ ! -f "$f" ]]; then
echo "Input File $f doesnt exist. Please fix the paths"
exit 1
fi
done
function lines_check {
l1=`wc -l $1`
l2=`wc -l $2`
if [[ $l1 != $l2 ]]; then
echo "ERROR: Record counts doesnt match between: $1 and $2"
exit 2
fi
}
lines_check $TRAIN_SRC $TRAIN_TGT
lines_check $VALID_SRC $VALID_TGT
lines_check $TEST_SRC $TEST_TGT
echo "Output dir = $OUT"
[ -d $OUT ] || mkdir -p $OUT
[ -d $OUT/data ] || mkdir -p $OUT/data
[ -d $OUT/models ] || mkdir $OUT/models
[ -d $OUT/test ] || mkdir -p $OUT/test
echo "Step 1a: Preprocess inputs"
if [[ "$BPE" == *"src"* ]]; then
echo "BPE on source"
# Here we could use more monolingual data
$ONMT/tools/learn_bpe.py -s $BPE_SRC_OPS < $TRAIN_SRC > $OUT/data/bpe-codes.src
$ONMT/tools/apply_bpe.py -c $OUT/data/bpe-codes.src < $TRAIN_SRC > $OUT/data/train.src
$ONMT/tools/apply_bpe.py -c $OUT/data/bpe-codes.src < $VALID_SRC > $OUT/data/valid.src
$ONMT/tools/apply_bpe.py -c $OUT/data/bpe-codes.src < $TEST_SRC > $OUT/data/test.src
else
ln -sf $TRAIN_SRC $OUT/data/train.src
ln -sf $VALID_SRC $OUT/data/valid.src
ln -sf $TEST_SRC $OUT/data/test.src
fi
if [[ "$BPE" == *"tgt"* ]]; then
echo "BPE on target"
# Here we could use more monolingual data
$ONMT/tools/learn_bpe.py -s $BPE_SRC_OPS < $TRAIN_TGT > $OUT/data/bpe-codes.tgt
$ONMT/tools/apply_bpe.py -c $OUT/data/bpe-codes.tgt < $TRAIN_TGT > $OUT/data/train.tgt
$ONMT/tools/apply_bpe.py -c $OUT/data/bpe-codes.tgt < $VALID_TGT > $OUT/data/valid.tgt
#$ONMT/tools/apply_bpe.py -c $OUT/data/bpe-codes.tgt < $TEST_TGT > $OUT/data/test.tgt
# We dont touch the test References, No BPE on them!
ln -sf $TEST_TGT $OUT/data/test.tgt
else
ln -sf $TRAIN_TGT $OUT/data/train.tgt
ln -sf $VALID_TGT $OUT/data/valid.tgt
ln -sf $TEST_TGT $OUT/data/test.tgt
fi
#: < maxv) {maxv=score; max=$0}} END{ print max}'`
echo "Chosen Model = $model"
if [[ -z "$model" ]]; then
echo "Model not found. Looked in $OUT/models/"
exit 1
fi
GPU_OPTS=""
if [ ! -z $GPUARG ]; then
GPU_OPTS="-gpu $GPUARG"
fi
echo "Step 3a: Translate Test"
python $ONMT/translate.py -model $model \
-src $OUT/data/test.src \
-output $OUT/test/test.out \
-replace_unk -verbose $GPU_OPTS > $OUT/test/test.log
echo "Step 3b: Translate Dev"
python $ONMT/translate.py -model $model \
-src $OUT/data/valid.src \
-output $OUT/test/valid.out \
-replace_unk -verbose $GPU_OPTS > $OUT/test/valid.log
if [[ "$BPE" == *"tgt"* ]]; then
echo "BPE decoding/detokenising target to match with references"
mv $OUT/test/test.out{,.bpe}
mv $OUT/test/valid.out{,.bpe}
cat $OUT/test/valid.out.bpe | sed -E 's/(@@ )|(@@ ?$)//g' > $OUT/test/valid.out
cat $OUT/test/test.out.bpe | sed -E 's/(@@ )|(@@ ?$)//g' > $OUT/test/test.out
fi
echo "Step 4a: Evaluate Test"
$ONMT/tools/multi-bleu-detok.perl $OUT/data/test.tgt < $OUT/test/test.out > $OUT/test/test.tc.bleu
$ONMT/tools/multi-bleu-detok.perl -lc $OUT/data/test.tgt < $OUT/test/test.out > $OUT/test/test.lc.bleu
echo "Step 4b: Evaluate Dev"
$ONMT/tools/multi-bleu-detok.perl $OUT/data/valid.tgt < $OUT/test/valid.out > $OUT/test/valid.tc.bleu
$ONMT/tools/multi-bleu-detok.perl -lc $OUT/data/valid.tgt < $OUT/test/valid.out > $OUT/test/valid.lc.bleu
#===== EXPERIMENT END ======
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/create_vocabulary.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import argparse
import sys
import os
def read_files_batch(file_list):
"""Reads the provided files in batches"""
batch = [] # Keep batch for each file
fd_list = [] # File descriptor list
exit = False # Flag used for quitting the program in case of error
try:
for filename in file_list:
fd_list.append(open(filename))
for lines in zip(*fd_list):
for i, line in enumerate(lines):
line = line.rstrip("\n").split(" ")
batch.append(line)
yield batch
batch = [] # Reset batch
except IOError:
print("Error reading file " + filename + ".")
exit = True # Flag to exit the program
finally:
for fd in fd_list:
fd.close()
if exit: # An error occurred, end execution
sys.exit(-1)
def main():
parser = argparse.ArgumentParser()
parser.add_argument('-file_type', default='text',
choices=['text', 'field'], required=True,
help="""Options for vocabulary creation.
The default is 'text' where the user passes
a corpus or a list of corpora files for which
they want to create a vocabulary from.
If choosing the option 'field', we assume
the file passed is a torch file created during
the preprocessing stage of an already
preprocessed corpus. The vocabulary file created
will just be the vocabulary inside the field
corresponding to the argument 'side'.""")
parser.add_argument("-file", type=str, nargs="+", required=True)
parser.add_argument("-out_file", type=str, required=True)
parser.add_argument("-side", choices=['src', 'tgt'], help="""Specifies
'src' or 'tgt' side for 'field' file_type.""")
opt = parser.parse_args()
vocabulary = {}
if opt.file_type == 'text':
print("Reading input file...")
for batch in read_files_batch(opt.file):
for sentence in batch:
for w in sentence:
if w in vocabulary:
vocabulary[w] += 1
else:
vocabulary[w] = 1
print("Writing vocabulary file...")
with open(opt.out_file, "w") as f:
for w, count in sorted(vocabulary.items(), key=lambda x: x[1],
reverse=True):
f.write("{0}\n".format(w))
else:
if opt.side not in ['src', 'tgt']:
raise ValueError("If using -file_type='field', specifies "
"'src' or 'tgt' argument for -side.")
import torch
try:
from onmt.inputters.inputter import _old_style_vocab
except ImportError:
sys.path.insert(1, os.path.join(sys.path[0], '..'))
from onmt.inputters.inputter import _old_style_vocab
print("Reading input file...")
if not len(opt.file) == 1:
raise ValueError("If using -file_type='field', only pass one "
"argument for -file.")
vocabs = torch.load(opt.file[0])
voc = dict(vocabs)[opt.side]
if _old_style_vocab(voc):
word_list = voc.itos
else:
try:
word_list = voc[0][1].base_field.vocab.itos
except AttributeError:
word_list = voc[0][1].vocab.itos
print("Writing vocabulary file...")
with open(opt.out_file, "wb") as f:
for w in word_list:
f.write(u"{0}\n".format(w).encode("utf-8"))
if __name__ == "__main__":
main()
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/detokenize.perl
================================================
#!/usr/bin/env perl
# Note: retrieved from https://github.com/apache/incubator-joshua/blob/master/scripts/preparation/detokenize.pl
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
use warnings;
use strict;
# Sample De-Tokenizer
# written by Josh Schroeder, based on code by Philipp Koehn
# modified later by ByungGyu Ahn, bahn@cs.jhu.edu, Luke Orland
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
my $language = "en";
my $QUIET = 1;
my $HELP = 0;
while (@ARGV) {
$_ = shift;
/^-l$/ && ($language = shift, next);
/^-v$/ && ($QUIET = 0, next);
/^-h$/ && ($HELP = 1, next);
}
if ($HELP) {
print "Usage ./detokenizer.perl (-l [en|de|...]) < tokenizedfile > detokenizedfile\n";
exit;
}
if (!$QUIET) {
print STDERR "Detokenizer Version 1.1\n";
print STDERR "Language: $language\n";
}
while() {
if (/^<.+>$/ || /^\s*$/) {
#don't try to detokenize XML/HTML tag lines
print $_;
}
else {
print &detokenize($_);
}
}
sub detokenize {
my($text) = @_;
chomp($text);
$text = " $text ";
# convert curly quotes to ASCII e.g. ‘“”’
$text =~ s/\x{2018}/'/gs;
$text =~ s/\x{2019}/'/gs;
$text =~ s/\x{201c}/"/gs;
$text =~ s/\x{201d}/"/gs;
$text =~ s/\x{e2}\x{80}\x{98}/'/gs;
$text =~ s/\x{e2}\x{80}\x{99}/'/gs;
$text =~ s/\x{e2}\x{80}\x{9c}/"/gs;
$text =~ s/\x{e2}\x{80}\x{9d}/"/gs;
$text =~ s/ '\s+' / " /g;
$text =~ s/ ` / ' /g;
$text =~ s/ ' / ' /g;
$text =~ s/ `` / " /g;
$text =~ s/ '' / " /g;
# replace the pipe character, which is
# a special reserved character in Moses
$text =~ s/ -PIPE- / \| /g;
$text =~ s/ -LRB- / \( /g;
$text =~ s/ -RRB- / \) /g;
$text =~ s/ -LSB- / \[ /g;
$text =~ s/ -RSB- / \] /g;
$text =~ s/ -LCB- / \{ /g;
$text =~ s/ -RCB- / \} /g;
$text =~ s/ -lrb- / \( /g;
$text =~ s/ -rrb- / \) /g;
$text =~ s/ -lsb- / \[ /g;
$text =~ s/ -rsb- / \] /g;
$text =~ s/ -lcb- / \{ /g;
$text =~ s/ -rcb- / \} /g;
$text =~ s/ 'll /'ll /g;
$text =~ s/ 're /'re /g;
$text =~ s/ 've /'ve /g;
$text =~ s/ n't /n't /g;
$text =~ s/ 'LL /'LL /g;
$text =~ s/ 'RE /'RE /g;
$text =~ s/ 'VE /'VE /g;
$text =~ s/ N'T /N'T /g;
$text =~ s/ can not / cannot /g;
$text =~ s/ Can not / Cannot /g;
# just in case the contraction was not properly treated
$text =~ s/ ' ll /'ll /g;
$text =~ s/ ' re /'re /g;
$text =~ s/ ' ve /'ve /g;
$text =~ s/n ' t /n't /g;
$text =~ s/ ' LL /'LL /g;
$text =~ s/ ' RE /'RE /g;
$text =~ s/ ' VE /'VE /g;
$text =~ s/N ' T /N'T /g;
my $word;
my $i;
my @words = split(/ /,$text);
$text = "";
my %quoteCount = ("\'"=>0,"\""=>0);
my $prependSpace = " ";
for ($i=0;$i<(scalar(@words));$i++) {
if ($words[$i] =~ /^[\p{IsSc}]+$/) {
#perform shift on currency
if (($i<(scalar(@words)-1)) && ($words[$i+1] =~ /^[0-9]/)) {
$text = $text.$prependSpace.$words[$i];
$prependSpace = "";
} else {
$text=$text.$words[$i];
$prependSpace = " ";
}
} elsif ($words[$i] =~ /^[\(\[\{\¿\¡]+$/) {
#perform right shift on random punctuation items
$text = $text.$prependSpace.$words[$i];
$prependSpace = "";
} elsif ($words[$i] =~ /^[\,\.\?\!\:\;\\\%\}\]\)]+$/){
#perform left shift on punctuation items
$text=$text.$words[$i];
$prependSpace = " ";
} elsif (($language eq "en") && ($i>0) && ($words[$i] =~ /^[\'][\p{IsAlpha}]/) && ($words[$i-1] =~ /[\p{IsAlnum}]$/)) {
#left-shift the contraction for English
$text=$text.$words[$i];
$prependSpace = " ";
} elsif (($language eq "en") && ($i>0) && ($i<(scalar(@words)-1)) && ($words[$i] eq "&") && ($words[$i-1] =~ /^[A-Z]$/) && ($words[$i+1] =~ /^[A-Z]$/)) {
#some contraction with an ampersand e.g. "R&D"
$text .= $words[$i];
$prependSpace = "";
} elsif (($language eq "fr") && ($i<(scalar(@words)-1)) && ($words[$i] =~ /[\p{IsAlpha}][\']$/) && ($words[$i+1] =~ /^[\p{IsAlpha}]/)) {
#right-shift the contraction for French
$text = $text.$prependSpace.$words[$i];
$prependSpace = "";
} elsif ($words[$i] =~ /^[\'\"]+$/) {
#combine punctuation smartly
if (($quoteCount{$words[$i]} % 2) eq 0) {
if(($language eq "en") && ($words[$i] eq "'") && ($i > 0) && ($words[$i-1] =~ /[s]$/)) {
#single quote for posesssives ending in s... "The Jones' house"
#left shift
$text=$text.$words[$i];
$prependSpace = " ";
} elsif (($language eq "en") && ($words[$i] eq "'") && ($i < (scalar(@words)-1)) && ($words[$i+1] eq "s")) {
#single quote for possessive construction. "John's"
$text .= $words[$i];
$prependSpace = "";
} elsif (($quoteCount{$words[$i]} == 0) &&
($language eq "en") && ($words[$i] eq '"') && ($i>1) && ($words[$i-1] =~ /^[,.]$/) && ($words[$i-2] ne "said")) {
#emergency case in which the opening quote is missing
#ending double quote for direct quotes. e.g. Blah," he said. but not like he said, "Blah.
$text .= $words[$i];
$prependSpace = " ";
} elsif (($language eq "en") && ($words[$i] eq '"') && ($i < (scalar(@words)-1)) && ($words[$i+1] =~ /^[,.]$/)) {
$text .= $words[$i];
$prependSpace = " ";
} else {
#right shift
$text = $text.$prependSpace.$words[$i];
$prependSpace = "";
$quoteCount{$words[$i]} = $quoteCount{$words[$i]} + 1;
}
} else {
#left shift
$text=$text.$words[$i];
$prependSpace = " ";
$quoteCount{$words[$i]} = $quoteCount{$words[$i]} + 1;
}
} else {
$text=$text.$prependSpace.$words[$i];
$prependSpace = " ";
}
}
#clean continuing spaces
$text =~ s/ +/ /g;
#delete spaces around double angle brackets «»
# Uh-oh. not a good idea. it is not consistent.
$text =~ s/(\x{c2}\x{ab}|\x{ab}) /$1/g;
$text =~ s/ (\x{c2}\x{bb}|\x{bb})/$1/g;
# delete spaces around all other special characters
# Uh-oh. not a good idea. "Men&Women"
#$text =~ s/ ([^\p{IsAlnum}\s\.\'\`\,\-\"\|]) /$1/g;
$text =~ s/ \/ /\//g;
# clean up spaces at head and tail of each line as well as any double-spacing
$text =~ s/\n /\n/g;
$text =~ s/ \n/\n/g;
$text =~ s/^ //g;
$text =~ s/ $//g;
#add trailing break
$text .= "\n" unless $text =~ /\n$/;
return $text;
}
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/embeddings_to_torch.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import division
import six
import argparse
import torch
from onmt.utils.logging import init_logger, logger
from onmt.inputters.inputter import _old_style_vocab
def get_vocabs(dict_path):
fields = torch.load(dict_path)
vocs = []
for side in ['src', 'tgt']:
if _old_style_vocab(fields):
vocab = next((v for n, v in fields if n == side), None)
else:
try:
vocab = fields[side].base_field.vocab
except AttributeError:
vocab = fields[side].vocab
vocs.append(vocab)
enc_vocab, dec_vocab = vocs
logger.info("From: %s" % dict_path)
logger.info("\t* source vocab: %d words" % len(enc_vocab))
logger.info("\t* target vocab: %d words" % len(dec_vocab))
return enc_vocab, dec_vocab
def read_embeddings(file_enc, skip_lines=0, filter_set=None):
embs = dict()
total_vectors_in_file = 0
with open(file_enc, 'rb') as f:
for i, line in enumerate(f):
if i < skip_lines:
continue
if not line:
break
if len(line) == 0:
# is this reachable?
continue
l_split = line.decode('utf8').strip().split(' ')
if len(l_split) == 2:
continue
total_vectors_in_file += 1
if filter_set is not None and l_split[0] not in filter_set:
continue
embs[l_split[0]] = [float(em) for em in l_split[1:]]
return embs, total_vectors_in_file
def convert_to_torch_tensor(word_to_float_list_dict, vocab):
dim = len(six.next(six.itervalues(word_to_float_list_dict)))
tensor = torch.zeros((len(vocab), dim))
for word, values in word_to_float_list_dict.items():
tensor[vocab.stoi[word]] = torch.Tensor(values)
return tensor
def calc_vocab_load_stats(vocab, loaded_embed_dict):
matching_count = len(
set(vocab.stoi.keys()) & set(loaded_embed_dict.keys()))
missing_count = len(vocab) - matching_count
percent_matching = matching_count / len(vocab) * 100
return matching_count, missing_count, percent_matching
def main():
parser = argparse.ArgumentParser(description='embeddings_to_torch.py')
parser.add_argument('-emb_file_both', required=False,
help="loads Embeddings for both source and target "
"from this file.")
parser.add_argument('-emb_file_enc', required=False,
help="source Embeddings from this file")
parser.add_argument('-emb_file_dec', required=False,
help="target Embeddings from this file")
parser.add_argument('-output_file', required=True,
help="Output file for the prepared data")
parser.add_argument('-dict_file', required=True,
help="Dictionary file")
parser.add_argument('-verbose', action="store_true", default=False)
parser.add_argument('-skip_lines', type=int, default=0,
help="Skip first lines of the embedding file")
parser.add_argument('-type', choices=["GloVe", "word2vec"],
default="GloVe")
opt = parser.parse_args()
enc_vocab, dec_vocab = get_vocabs(opt.dict_file)
# Read in embeddings
skip_lines = 1 if opt.type == "word2vec" else opt.skip_lines
if opt.emb_file_both is not None:
if opt.emb_file_enc is not None:
raise ValueError("If --emb_file_both is passed in, you should not"
"set --emb_file_enc.")
if opt.emb_file_dec is not None:
raise ValueError("If --emb_file_both is passed in, you should not"
"set --emb_file_dec.")
set_of_src_and_tgt_vocab = \
set(enc_vocab.stoi.keys()) | set(dec_vocab.stoi.keys())
logger.info("Reading encoder and decoder embeddings from {}".format(
opt.emb_file_both))
src_vectors, total_vec_count = \
read_embeddings(opt.emb_file_both, skip_lines,
set_of_src_and_tgt_vocab)
tgt_vectors = src_vectors
logger.info("\tFound {} total vectors in file".format(total_vec_count))
else:
if opt.emb_file_enc is None:
raise ValueError("If --emb_file_enc not provided. Please specify "
"the file with encoder embeddings, or pass in "
"--emb_file_both")
if opt.emb_file_dec is None:
raise ValueError("If --emb_file_dec not provided. Please specify "
"the file with encoder embeddings, or pass in "
"--emb_file_both")
logger.info("Reading encoder embeddings from {}".format(
opt.emb_file_enc))
src_vectors, total_vec_count = read_embeddings(
opt.emb_file_enc, skip_lines,
filter_set=enc_vocab.stoi
)
logger.info("\tFound {} total vectors in file.".format(
total_vec_count))
logger.info("Reading decoder embeddings from {}".format(
opt.emb_file_dec))
tgt_vectors, total_vec_count = read_embeddings(
opt.emb_file_dec, skip_lines,
filter_set=dec_vocab.stoi
)
logger.info("\tFound {} total vectors in file".format(total_vec_count))
logger.info("After filtering to vectors in vocab:")
logger.info("\t* enc: %d match, %d missing, (%.2f%%)"
% calc_vocab_load_stats(enc_vocab, src_vectors))
logger.info("\t* dec: %d match, %d missing, (%.2f%%)"
% calc_vocab_load_stats(dec_vocab, tgt_vectors))
# Write to file
enc_output_file = opt.output_file + ".enc.pt"
dec_output_file = opt.output_file + ".dec.pt"
logger.info("\nSaving embedding as:\n\t* enc: %s\n\t* dec: %s"
% (enc_output_file, dec_output_file))
torch.save(
convert_to_torch_tensor(src_vectors, enc_vocab),
enc_output_file
)
torch.save(
convert_to_torch_tensor(tgt_vectors, dec_vocab),
dec_output_file
)
logger.info("\nDone.")
if __name__ == "__main__":
init_logger('embeddings_to_torch.log')
main()
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/extract_embeddings.py
================================================
import argparse
import torch
import onmt
import onmt.model_builder
import onmt.inputters as inputters
import onmt.opts
from onmt.utils.misc import use_gpu
from onmt.utils.logging import init_logger, logger
parser = argparse.ArgumentParser(description='translate.py')
parser.add_argument('-model', required=True,
help='Path to model .pt file')
parser.add_argument('-output_dir', default='.',
help="""Path to output the embeddings""")
parser.add_argument('-gpu', type=int, default=-1,
help="Device to run on")
def write_embeddings(filename, dict, embeddings):
with open(filename, 'wb') as file:
for i in range(min(len(embeddings), len(dict.itos))):
str = dict.itos[i].encode("utf-8")
for j in range(len(embeddings[0])):
str = str + (" %5f" % (embeddings[i][j])).encode("utf-8")
file.write(str + b"\n")
def main():
dummy_parser = argparse.ArgumentParser(description='train.py')
onmt.opts.model_opts(dummy_parser)
dummy_opt = dummy_parser.parse_known_args([])[0]
opt = parser.parse_args()
opt.cuda = opt.gpu > -1
if opt.cuda:
torch.cuda.set_device(opt.gpu)
# Add in default model arguments, possibly added since training.
checkpoint = torch.load(opt.model,
map_location=lambda storage, loc: storage)
model_opt = checkpoint['opt']
vocab = checkpoint['vocab']
if inputters.old_style_vocab(vocab):
fields = onmt.inputters.load_old_vocab(vocab)
else:
fields = vocab
src_dict = fields['src'].base_field.vocab # assumes src is text
tgt_dict = fields['tgt'].base_field.vocab
model_opt = checkpoint['opt']
for arg in dummy_opt.__dict__:
if arg not in model_opt:
model_opt.__dict__[arg] = dummy_opt.__dict__[arg]
model = onmt.model_builder.build_base_model(
model_opt, fields, use_gpu(opt), checkpoint)
encoder = model.encoder
decoder = model.decoder
encoder_embeddings = encoder.embeddings.word_lut.weight.data.tolist()
decoder_embeddings = decoder.embeddings.word_lut.weight.data.tolist()
logger.info("Writing source embeddings")
write_embeddings(opt.output_dir + "/src_embeddings.txt", src_dict,
encoder_embeddings)
logger.info("Writing target embeddings")
write_embeddings(opt.output_dir + "/tgt_embeddings.txt", tgt_dict,
decoder_embeddings)
logger.info('... done.')
logger.info('Converting model...')
if __name__ == "__main__":
init_logger('extract_embeddings.log')
main()
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/learn_bpe.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author: Rico Sennrich
# flake8: noqa
"""Use byte pair encoding (BPE) to learn a variable-length encoding of the vocabulary in a text.
Unlike the original BPE, it does not compress the plain text, but can be used to reduce the vocabulary
of a text to a configurable number of symbols, with only a small increase in the number of tokens.
Reference:
Rico Sennrich, Barry Haddow and Alexandra Birch (2016). Neural Machine Translation of Rare Words with Subword Units.
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.
"""
# This file is retrieved from https://github.com/rsennrich/subword-nmt
from __future__ import unicode_literals
import sys
import codecs
import re
import copy
import argparse
from collections import defaultdict, Counter
# hack for python2/3 compatibility
from io import open
argparse.open = open
def create_parser():
parser = argparse.ArgumentParser(
formatter_class=argparse.RawDescriptionHelpFormatter,
description="learn BPE-based word segmentation")
parser.add_argument(
'--input', '-i', type=argparse.FileType('r'), default=sys.stdin,
metavar='PATH',
help="Input text (default: standard input).")
parser.add_argument(
'--output', '-o', type=argparse.FileType('w'), default=sys.stdout,
metavar='PATH',
help="Output file for BPE codes (default: standard output)")
parser.add_argument(
'--symbols', '-s', type=int, default=10000,
help="Create this many new symbols (each representing a character n-gram) (default: %(default)s))")
parser.add_argument(
'--min-frequency', type=int, default=2, metavar='FREQ',
help='Stop if no symbol pair has frequency >= FREQ (default: %(default)s))')
parser.add_argument('--dict-input', action="store_true",
help="If set, input file is interpreted as a dictionary where each line contains a word-count pair")
parser.add_argument(
'--verbose', '-v', action="store_true",
help="verbose mode.")
return parser
def get_vocabulary(fobj, is_dict=False):
"""Read text and return dictionary that encodes vocabulary
"""
vocab = Counter()
for line in fobj:
if is_dict:
word, count = line.strip().split()
vocab[word] = int(count)
else:
for word in line.split():
vocab[word] += 1
return vocab
def update_pair_statistics(pair, changed, stats, indices):
"""Minimally update the indices and frequency of symbol pairs
if we merge a pair of symbols, only pairs that overlap with occurrences
of this pair are affected, and need to be updated.
"""
stats[pair] = 0
indices[pair] = defaultdict(int)
first, second = pair
new_pair = first + second
for j, word, old_word, freq in changed:
# find all instances of pair, and update frequency/indices around it
i = 0
while True:
# find first symbol
try:
i = old_word.index(first, i)
except ValueError:
break
# if first symbol is followed by second symbol, we've found an occurrence of pair (old_word[i:i+2])
if i < len(old_word) - 1 and old_word[i + 1] == second:
# assuming a symbol sequence "A B C", if "B C" is merged, reduce the frequency of "A B"
if i:
prev = old_word[i - 1:i + 1]
stats[prev] -= freq
indices[prev][j] -= 1
if i < len(old_word) - 2:
# assuming a symbol sequence "A B C B", if "B C" is merged, reduce the frequency of "C B".
# however, skip this if the sequence is A B C B C, because the frequency of "C B" will be reduced by the previous code block
if old_word[i + 2] != first or i >= len(old_word) - 3 or old_word[i + 3] != second:
nex = old_word[i + 1:i + 3]
stats[nex] -= freq
indices[nex][j] -= 1
i += 2
else:
i += 1
i = 0
while True:
try:
# find new pair
i = word.index(new_pair, i)
except ValueError:
break
# assuming a symbol sequence "A BC D", if "B C" is merged, increase the frequency of "A BC"
if i:
prev = word[i - 1:i + 1]
stats[prev] += freq
indices[prev][j] += 1
# assuming a symbol sequence "A BC B", if "B C" is merged, increase the frequency of "BC B"
# however, if the sequence is A BC BC, skip this step because the count of "BC BC" will be incremented by the previous code block
if i < len(word) - 1 and word[i + 1] != new_pair:
nex = word[i:i + 2]
stats[nex] += freq
indices[nex][j] += 1
i += 1
def get_pair_statistics(vocab):
"""Count frequency of all symbol pairs, and create index"""
# data structure of pair frequencies
stats = defaultdict(int)
# index from pairs to words
indices = defaultdict(lambda: defaultdict(int))
for i, (word, freq) in enumerate(vocab):
prev_char = word[0]
for char in word[1:]:
stats[prev_char, char] += freq
indices[prev_char, char][i] += 1
prev_char = char
return stats, indices
def replace_pair(pair, vocab, indices):
"""Replace all occurrences of a symbol pair ('A', 'B') with a new symbol 'AB'"""
first, second = pair
pair_str = ''.join(pair)
pair_str = pair_str.replace('\\', '\\\\')
changes = []
pattern = re.compile(
r'(?');
# version numbering allows bckward compatibility
outfile.write('#version: 0.2\n')
vocab = get_vocabulary(infile, is_dict)
vocab = dict([(tuple(x[:-1]) + (x[-1] + '',), y)
for (x, y) in vocab.items()])
sorted_vocab = sorted(vocab.items(), key=lambda x: x[1], reverse=True)
stats, indices = get_pair_statistics(sorted_vocab)
big_stats = copy.deepcopy(stats)
# threshold is inspired by Zipfian assumption, but should only affect speed
threshold = max(stats.values()) / 10
for i in range(num_symbols):
if stats:
most_frequent = max(stats, key=lambda x: (stats[x], x))
# we probably missed the best pair because of pruning; go back to full statistics
if not stats or (i and stats[most_frequent] < threshold):
prune_stats(stats, big_stats, threshold)
stats = copy.deepcopy(big_stats)
most_frequent = max(stats, key=lambda x: (stats[x], x))
# threshold is inspired by Zipfian assumption, but should only affect speed
threshold = stats[most_frequent] * i / (i + 10000.0)
prune_stats(stats, big_stats, threshold)
if stats[most_frequent] < min_frequency:
sys.stderr.write(
'no pair has frequency >= {0}. Stopping\n'.format(min_frequency))
break
if verbose:
sys.stderr.write('pair {0}: {1} {2} -> {1}{2} (frequency {3})\n'.format(
i, most_frequent[0], most_frequent[1], stats[most_frequent]))
outfile.write('{0} {1}\n'.format(*most_frequent))
changes = replace_pair(most_frequent, sorted_vocab, indices)
update_pair_statistics(most_frequent, changes, stats, indices)
stats[most_frequent] = 0
if not i % 100:
prune_stats(stats, big_stats, threshold)
if __name__ == '__main__':
# python 2/3 compatibility
if sys.version_info < (3, 0):
sys.stderr = codecs.getwriter('UTF-8')(sys.stderr)
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
sys.stdin = codecs.getreader('UTF-8')(sys.stdin)
else:
sys.stderr = codecs.getwriter('UTF-8')(sys.stderr.buffer)
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout.buffer)
sys.stdin = codecs.getreader('UTF-8')(sys.stdin.buffer)
parser = create_parser()
args = parser.parse_args()
# read/write files as UTF-8
if args.input.name != '':
args.input = codecs.open(args.input.name, encoding='utf-8')
if args.output.name != '':
args.output = codecs.open(args.output.name, 'w', encoding='utf-8')
main(args.input, args.output, args.symbols,
args.min_frequency, args.verbose, is_dict=args.dict_input)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/multi-bleu-detok.perl
================================================
#!/usr/bin/env perl
#
# This file is part of moses. Its use is licensed under the GNU Lesser General
# Public License version 2.1 or, at your option, any later version.
# This file uses the internal tokenization of mteval-v13a.pl,
# giving the exact same (case-sensitive) results on untokenized text.
# Using this script with detokenized output and untokenized references is
# preferrable over multi-bleu.perl, since scores aren't affected by tokenization differences.
#
# like multi-bleu.perl , it supports plain text input and multiple references.
# This file is retrieved from Moses Decoder :: https://github.com/moses-smt/mosesdecoder
# $Id$
use warnings;
use strict;
my $lowercase = 0;
if ($ARGV[0] eq "-lc") {
$lowercase = 1;
shift;
}
my $stem = $ARGV[0];
if (!defined $stem) {
print STDERR "usage: multi-bleu-detok.pl [-lc] reference < hypothesis\n";
print STDERR "Reads the references from reference or reference0, reference1, ...\n";
exit(1);
}
$stem .= ".ref" if !-e $stem && !-e $stem."0" && -e $stem.".ref0";
my @REF;
my $ref=0;
while(-e "$stem$ref") {
&add_to_ref("$stem$ref",\@REF);
$ref++;
}
&add_to_ref($stem,\@REF) if -e $stem;
die("ERROR: could not find reference file $stem") unless scalar @REF;
# add additional references explicitly specified on the command line
shift;
foreach my $stem (@ARGV) {
&add_to_ref($stem,\@REF) if -e $stem;
}
sub add_to_ref {
my ($file,$REF) = @_;
my $s=0;
if ($file =~ /.gz$/) {
open(REF,"gzip -dc $file|") or die "Can't read $file";
} else {
open(REF,$file) or die "Can't read $file";
}
while() {
chop;
$_ = tokenization($_);
push @{$$REF[$s++]}, $_;
}
close(REF);
}
my(@CORRECT,@TOTAL,$length_translation,$length_reference);
my $s=0;
while() {
chop;
$_ = lc if $lowercase;
$_ = tokenization($_);
my @WORD = split;
my %REF_NGRAM = ();
my $length_translation_this_sentence = scalar(@WORD);
my ($closest_diff,$closest_length) = (9999,9999);
foreach my $reference (@{$REF[$s]}) {
# print "$s $_ <=> $reference\n";
$reference = lc($reference) if $lowercase;
my @WORD = split(' ',$reference);
my $length = scalar(@WORD);
my $diff = abs($length_translation_this_sentence-$length);
if ($diff < $closest_diff) {
$closest_diff = $diff;
$closest_length = $length;
# print STDERR "$s: closest diff ".abs($length_translation_this_sentence-$length)." = abs($length_translation_this_sentence-$length), setting len: $closest_length\n";
} elsif ($diff == $closest_diff) {
$closest_length = $length if $length < $closest_length;
# from two references with the same closeness to me
# take the *shorter* into account, not the "first" one.
}
for(my $n=1;$n<=4;$n++) {
my %REF_NGRAM_N = ();
for(my $start=0;$start<=$#WORD-($n-1);$start++) {
my $ngram = "$n";
for(my $w=0;$w<$n;$w++) {
$ngram .= " ".$WORD[$start+$w];
}
$REF_NGRAM_N{$ngram}++;
}
foreach my $ngram (keys %REF_NGRAM_N) {
if (!defined($REF_NGRAM{$ngram}) ||
$REF_NGRAM{$ngram} < $REF_NGRAM_N{$ngram}) {
$REF_NGRAM{$ngram} = $REF_NGRAM_N{$ngram};
# print "$i: REF_NGRAM{$ngram} = $REF_NGRAM{$ngram} \n";
}
}
}
}
$length_translation += $length_translation_this_sentence;
$length_reference += $closest_length;
for(my $n=1;$n<=4;$n++) {
my %T_NGRAM = ();
for(my $start=0;$start<=$#WORD-($n-1);$start++) {
my $ngram = "$n";
for(my $w=0;$w<$n;$w++) {
$ngram .= " ".$WORD[$start+$w];
}
$T_NGRAM{$ngram}++;
}
foreach my $ngram (keys %T_NGRAM) {
$ngram =~ /^(\d+) /;
my $n = $1;
# my $corr = 0;
# print "$i e $ngram $T_NGRAM{$ngram} \n";
$TOTAL[$n] += $T_NGRAM{$ngram};
if (defined($REF_NGRAM{$ngram})) {
if ($REF_NGRAM{$ngram} >= $T_NGRAM{$ngram}) {
$CORRECT[$n] += $T_NGRAM{$ngram};
# $corr = $T_NGRAM{$ngram};
# print "$i e correct1 $T_NGRAM{$ngram} \n";
}
else {
$CORRECT[$n] += $REF_NGRAM{$ngram};
# $corr = $REF_NGRAM{$ngram};
# print "$i e correct2 $REF_NGRAM{$ngram} \n";
}
}
# $REF_NGRAM{$ngram} = 0 if !defined $REF_NGRAM{$ngram};
# print STDERR "$ngram: {$s, $REF_NGRAM{$ngram}, $T_NGRAM{$ngram}, $corr}\n"
}
}
$s++;
}
my $brevity_penalty = 1;
my $bleu = 0;
my @bleu=();
for(my $n=1;$n<=4;$n++) {
if (defined ($TOTAL[$n])){
$bleu[$n]=($TOTAL[$n])?$CORRECT[$n]/$TOTAL[$n]:0;
# print STDERR "CORRECT[$n]:$CORRECT[$n] TOTAL[$n]:$TOTAL[$n]\n";
}else{
$bleu[$n]=0;
}
}
if ($length_reference==0){
printf "BLEU = 0, 0/0/0/0 (BP=0, ratio=0, hyp_len=0, ref_len=0)\n";
exit(1);
}
if ($length_translation<$length_reference) {
$brevity_penalty = exp(1-$length_reference/$length_translation);
}
$bleu = $brevity_penalty * exp((my_log( $bleu[1] ) +
my_log( $bleu[2] ) +
my_log( $bleu[3] ) +
my_log( $bleu[4] ) ) / 4) ;
printf "BLEU = %.2f, %.1f/%.1f/%.1f/%.1f (BP=%.3f, ratio=%.3f, hyp_len=%d, ref_len=%d)\n",
100*$bleu,
100*$bleu[1],
100*$bleu[2],
100*$bleu[3],
100*$bleu[4],
$brevity_penalty,
$length_translation / $length_reference,
$length_translation,
$length_reference;
sub my_log {
return -9999999999 unless $_[0];
return log($_[0]);
}
sub tokenization
{
my ($norm_text) = @_;
# language-independent part:
$norm_text =~ s///g; # strip "skipped" tags
$norm_text =~ s/-\n//g; # strip end-of-line hyphenation and join lines
$norm_text =~ s/\n/ /g; # join lines
$norm_text =~ s/"/"/g; # convert SGML tag for quote to "
$norm_text =~ s/&/&/g; # convert SGML tag for ampersand to &
$norm_text =~ s/</
$norm_text =~ s/>/>/g; # convert SGML tag for greater-than to <
# language-dependent part (assuming Western languages):
$norm_text = " $norm_text ";
$norm_text =~ s/([\{-\~\[-\` -\&\(-\+\:-\@\/])/ $1 /g; # tokenize punctuation
$norm_text =~ s/([^0-9])([\.,])/$1 $2 /g; # tokenize period and comma unless preceded by a digit
$norm_text =~ s/([\.,])([^0-9])/ $1 $2/g; # tokenize period and comma unless followed by a digit
$norm_text =~ s/([0-9])(-)/$1 $2 /g; # tokenize dash when preceded by a digit
$norm_text =~ s/\s+/ /g; # one space only between words
$norm_text =~ s/^\s+//; # no leading space
$norm_text =~ s/\s+$//; # no trailing space
return $norm_text;
}
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/README.txt
================================================
The language suffix can be found here:
http://www.loc.gov/standards/iso639-2/php/code_list.php
This code includes data from Daniel Naber's Language Tools (czech abbreviations).
This code includes data from czech wiktionary (also czech abbreviations).
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.ca
================================================
Dr
Dra
pàg
p
c
av
Sr
Sra
adm
esq
Prof
S.A
S.L
p.e
ptes
Sta
St
pl
màx
cast
dir
nre
fra
admdora
Emm
Excma
espf
dc
admdor
tel
angl
aprox
ca
dept
dj
dl
dt
ds
dg
dv
ed
entl
al
i.e
maj
smin
n
núm
pta
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.cs
================================================
Bc
BcA
Ing
Ing.arch
MUDr
MVDr
MgA
Mgr
JUDr
PhDr
RNDr
PharmDr
ThLic
ThDr
Ph.D
Th.D
prof
doc
CSc
DrSc
dr. h. c
PaedDr
Dr
PhMr
DiS
abt
ad
a.i
aj
angl
anon
apod
atd
atp
aut
bd
biogr
b.m
b.p
b.r
cca
cit
cizojaz
c.k
col
čes
čín
čj
ed
facs
fasc
fol
fot
franc
h.c
hist
hl
hrsg
ibid
il
ind
inv.č
jap
jhdt
jv
koed
kol
korej
kl
krit
lat
lit
m.a
maď
mj
mp
násl
např
nepubl
něm
no
nr
n.s
okr
odd
odp
obr
opr
orig
phil
pl
pokrač
pol
port
pozn
př.kr
př.n.l
přel
přeprac
příl
pseud
pt
red
repr
resp
revid
rkp
roč
roz
rozš
samost
sect
sest
seš
sign
sl
srv
stol
sv
šk
šk.ro
špan
tab
t.č
tis
tj
tř
tzv
univ
uspoř
vol
vl.jm
vs
vyd
vyobr
zal
zejm
zkr
zprac
zvl
n.p
např
než
MUDr
abl
absol
adj
adv
ak
ak. sl
akt
alch
amer
anat
angl
anglosas
arab
arch
archit
arg
astr
astrol
att
bás
belg
bibl
biol
boh
bot
bulh
círk
csl
č
čas
čes
dat
děj
dep
dět
dial
dór
dopr
dosl
ekon
epic
etnonym
eufem
f
fam
fem
fil
film
form
fot
fr
fut
fyz
gen
geogr
geol
geom
germ
gram
hebr
herald
hist
hl
hovor
hud
hut
chcsl
chem
ie
imp
impf
ind
indoevr
inf
instr
interj
ión
iron
it
kanad
katalán
klas
kniž
komp
konj
konkr
kř
kuch
lat
lék
les
lid
lit
liturg
lok
log
m
mat
meteor
metr
mod
ms
mysl
n
náb
námoř
neklas
něm
nesklon
nom
ob
obch
obyč
ojed
opt
part
pas
pejor
pers
pf
pl
plpf
práv
prep
předl
přivl
r
rcsl
refl
reg
rkp
ř
řec
s
samohl
sg
sl
souhl
spec
srov
stfr
střv
stsl
subj
subst
superl
sv
sz
táz
tech
telev
teol
trans
typogr
var
vedl
verb
vl. jm
voj
vok
vůb
vulg
výtv
vztaž
zahr
zájm
zast
zejm
zeměd
zkr
zř
mj
dl
atp
sport
Mgr
horn
MVDr
JUDr
RSDr
Bc
PhDr
ThDr
Ing
aj
apod
PharmDr
pomn
ev
slang
nprap
odp
dop
pol
st
stol
p. n. l
před n. l
n. l
př. Kr
po Kr
př. n. l
odd
RNDr
tzv
atd
tzn
resp
tj
p
br
č. j
čj
č. p
čp
a. s
s. r. o
spol. s r. o
p. o
s. p
v. o. s
k. s
o. p. s
o. s
v. r
v z
ml
vč
kr
mld
hod
popř
ap
event
rus
slov
rum
švýc
P. T
zvl
hor
dol
S.O.S
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.de
================================================
#Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
#Special cases are included for prefixes that ONLY appear before 0-9 numbers.
#any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
#usually upper case letters are initials in a name
#no german words end in single lower-case letters, so we throw those in too.
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
#Roman Numerals. A dot after one of these is not a sentence break in German.
I
II
III
IV
V
VI
VII
VIII
IX
X
XI
XII
XIII
XIV
XV
XVI
XVII
XVIII
XIX
XX
i
ii
iii
iv
v
vi
vii
viii
ix
x
xi
xii
xiii
xiv
xv
xvi
xvii
xviii
xix
xx
#Titles and Honorifics
Adj
Adm
Adv
Asst
Bart
Bldg
Brig
Bros
Capt
Cmdr
Col
Comdr
Con
Corp
Cpl
DR
Dr
Ens
Gen
Gov
Hon
Hosp
Insp
Lt
MM
MR
MRS
MS
Maj
Messrs
Mlle
Mme
Mr
Mrs
Ms
Msgr
Op
Ord
Pfc
Ph
Prof
Pvt
Rep
Reps
Res
Rev
Rt
Sen
Sens
Sfc
Sgt
Sr
St
Supt
Surg
#Misc symbols
Mio
Mrd
bzw
v
vs
usw
d.h
z.B
u.a
etc
Mrd
MwSt
ggf
d.J
D.h
m.E
vgl
I.F
z.T
sogen
ff
u.E
g.U
g.g.A
c.-à-d
Buchst
u.s.w
sog
u.ä
Std
evtl
Zt
Chr
u.U
o.ä
Ltd
b.A
z.Zt
spp
sen
SA
k.o
jun
i.H.v
dgl
dergl
Co
zzt
usf
s.p.a
Dkr
Corp
bzgl
BSE
#Number indicators
# add #NUMERIC_ONLY# after the word if it should ONLY be non-breaking when a 0-9 digit follows it
No
Nos
Art
Nr
pp
ca
Ca
#Ordinals are done with . in German - "1." = "1st" in English
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.el
================================================
# Sigle letters in upper-case are usually abbreviations of names
Α
Β
Γ
Δ
Ε
Ζ
Η
Θ
Ι
Κ
Λ
Μ
Ν
Ξ
Ο
Π
Ρ
Σ
Τ
Υ
Φ
Χ
Ψ
Ω
# Includes abbreviations for the Greek language compiled from various sources (Greek grammar books, Greek language related web content).
Άθαν
Έγχρ
Έκθ
Έσδ
Έφ
Όμ
Α΄Έσδρ
Α΄Έσδ
Α΄Βασ
Α΄Θεσ
Α΄Ιω
Α΄Κορινθ
Α΄Κορ
Α΄Μακκ
Α΄Μακ
Α΄Πέτρ
Α΄Πέτ
Α΄Παραλ
Α΄Πε
Α΄Σαμ
Α΄Τιμ
Α΄Χρον
Α΄Χρ
Α.Β.Α
Α.Β
Α.Ε
Α.Κ.Τ.Ο
Αέθλ
Αέτ
Αίλ.Δ
Αίλ.Τακτ
Αίσ
Αββακ
Αβυδ
Αβ
Αγάκλ
Αγάπ
Αγάπ.Αμαρτ.Σ
Αγάπ.Γεωπ
Αγαθάγγ
Αγαθήμ
Αγαθιν
Αγαθοκλ
Αγαθρχ
Αγαθ
Αγαθ.Ιστ
Αγαλλ
Αγαπητ
Αγγ
Αγησ
Αγλ
Αγορ.Κ
Αγρο.Κωδ
Αγρ.Εξ
Αγρ.Κ
Αγ.Γρ
Αδριαν
Αδρ
Αετ
Αθάν
Αθήν
Αθήν.Επιγρ
Αθήν.Επιτ
Αθήν.Ιατρ
Αθήν.Μηχ
Αθανάσ
Αθαν
Αθηνί
Αθηναγ
Αθηνόδ
Αθ
Αθ.Αρχ
Αιλ
Αιλ.Επιστ
Αιλ.ΖΙ
Αιλ.ΠΙ
Αιλ.απ
Αιμιλ
Αιν.Γαζ
Αιν.Τακτ
Αισχίν
Αισχίν.Επιστ
Αισχ
Αισχ.Αγαμ
Αισχ.Αγ
Αισχ.Αλ
Αισχ.Ελεγ
Αισχ.Επτ.Θ
Αισχ.Ευμ
Αισχ.Ικέτ
Αισχ.Ικ
Αισχ.Περσ
Αισχ.Προμ.Δεσμ
Αισχ.Πρ
Αισχ.Χοηφ
Αισχ.Χο
Αισχ.απ
ΑιτΕ
Αιτ
Αλκ
Αλχιας
Αμ.Π.Ο
Αμβ
Αμμών
Αμ.
Αν.Πειθ.Συμβ.Δικ
Ανακρ
Ανακ
Αναμν.Τόμ
Αναπλ
Ανδ
Ανθλγος
Ανθστης
Αντισθ
Ανχης
Αν
Αποκ
Απρ
Απόδ
Απόφ
Απόφ.Νομ
Απ
Απ.Δαπ
Απ.Διατ
Απ.Επιστ
Αριθ
Αριστοτ
Αριστοφ
Αριστοφ.Όρν
Αριστοφ.Αχ
Αριστοφ.Βάτρ
Αριστοφ.Ειρ
Αριστοφ.Εκκλ
Αριστοφ.Θεσμ
Αριστοφ.Ιππ
Αριστοφ.Λυσ
Αριστοφ.Νεφ
Αριστοφ.Πλ
Αριστοφ.Σφ
Αριστ
Αριστ.Αθ.Πολ
Αριστ.Αισθ
Αριστ.Αν.Πρ
Αριστ.Ζ.Ι
Αριστ.Ηθ.Ευδ
Αριστ.Ηθ.Νικ
Αριστ.Κατ
Αριστ.Μετ
Αριστ.Πολ
Αριστ.Φυσιογν
Αριστ.Φυσ
Αριστ.Ψυχ
Αριστ.Ρητ
Αρμεν
Αρμ
Αρχ.Εκ.Καν.Δ
Αρχ.Ευβ.Μελ
Αρχ.Ιδ.Δ
Αρχ.Νομ
Αρχ.Ν
Αρχ.Π.Ε
Αρ
Αρ.Φορ.Μητρ
Ασμ
Ασμ.ασμ
Αστ.Δ
Αστ.Χρον
Ασ
Ατομ.Γνωμ
Αυγ
Αφρ
Αχ.Νομ
Α
Α.Εγχ.Π
Α.Κ.΄Υδρας
Β΄Έσδρ
Β΄Έσδ
Β΄Βασ
Β΄Θεσ
Β΄Ιω
Β΄Κορινθ
Β΄Κορ
Β΄Μακκ
Β΄Μακ
Β΄Πέτρ
Β΄Πέτ
Β΄Πέ
Β΄Παραλ
Β΄Σαμ
Β΄Τιμ
Β΄Χρον
Β΄Χρ
Β.Ι.Π.Ε
Β.Κ.Τ
Β.Κ.Ψ.Β
Β.Μ
Β.Ο.Α.Κ
Β.Ο.Α
Β.Ο.Δ
Βίβλ
Βαρ
ΒεΘ
Βι.Περ
Βιπερ
Βιργ
Βλγ
Βούλ
Βρ
Γ΄Βασ
Γ΄Μακκ
ΓΕΝμλ
Γέν
Γαλ
Γεν
Γλ
Γν.Ν.Σ.Κρ
Γνωμ
Γν
Γράμμ
Γρηγ.Ναζ
Γρηγ.Νύσ
Γ Νοσ
Γ' Ογκολ
Γ.Ν
Δ΄Βασ
Δ.Β
Δ.Δίκη
Δ.Δίκ
Δ.Ε.Σ
Δ.Ε.Φ.Α
Δ.Ε.Φ
Δ.Εργ.Ν
Δαμ
Δαμ.μνημ.έργ
Δαν
Δασ.Κ
Δεκ
Δελτ.Δικ.Ε.Τ.Ε
Δελτ.Νομ
Δελτ.Συνδ.Α.Ε
Δερμ
Δευτ
Δεύτ
Δημοσθ
Δημόκρ
Δι.Δικ
Διάτ
Διαιτ.Απ
Διαιτ
Διαρκ.Στρατ
Δικ
Διοίκ.Πρωτ
ΔιοικΔνη
Διοικ.Εφ
Διον.Αρ
Διόρθ.Λαθ
Δ.κ.Π
Δνη
Δν
Δογμ.Όρος
Δρ
Δ.τ.Α
Δτ
ΔωδΝομ
Δ.Περ
Δ.Στρ
ΕΔΠολ
ΕΕυρΚ
ΕΙΣ
ΕΝαυτΔ
ΕΣΑμΕΑ
ΕΣΘ
ΕΣυγκΔ
ΕΤρΑξΧρΔ
Ε.Φ.Ε.Τ
Ε.Φ.Ι
Ε.Φ.Ο.Επ.Α
Εβδ
Εβρ
Εγκύκλ.Επιστ
Εγκ
Εε.Αιγ
Εθν.Κ.Τ
Εθν
Ειδ.Δικ.Αγ.Κακ
Εικ
Ειρ.Αθ
Ειρην.Αθ
Ειρην
Έλεγχ
Ειρ
Εισ.Α.Π
Εισ.Ε
Εισ.Ν.Α.Κ
Εισ.Ν.Κ.Πολ.Δ
Εισ.Πρωτ
Εισηγ.Έκθ
Εισ
Εκκλ
Εκκ
Εκ
Ελλ.Δνη
Εν.Ε
Εξ
Επ.Αν
Επ.Εργ.Δ
Επ.Εφ
Επ.Κυπ.Δ
Επ.Μεσ.Αρχ
Επ.Νομ
Επίκτ
Επίκ
Επι.Δ.Ε
Επιθ.Ναυτ.Δικ
Επικ
Επισκ.Ε.Δ
Επισκ.Εμπ.Δικ
Επιστ.Επετ.Αρμ
Επιστ.Επετ
Επιστ.Ιερ
Επιτρ.Προστ.Συνδ.Στελ
Επιφάν
Επτ.Εφ
Επ.Ιρ
Επ.Ι
Εργ.Ασφ.Νομ
Ερμ.Α.Κ
Ερμη.Σ
Εσθ
Εσπερ
Ετρ.Δ
Ευκλ
Ευρ.Δ.Δ.Α
Ευρ.Σ.Δ.Α
Ευρ.ΣτΕ
Ευρατόμ
Ευρ.Άλκ
Ευρ.Ανδρομ
Ευρ.Βάκχ
Ευρ.Εκ
Ευρ.Ελ
Ευρ.Ηλ
Ευρ.Ηρακ
Ευρ.Ηρ
Ευρ.Ηρ.Μαιν
Ευρ.Ικέτ
Ευρ.Ιππόλ
Ευρ.Ιφ.Α
Ευρ.Ιφ.Τ
Ευρ.Ι.Τ
Ευρ.Κύκλ
Ευρ.Μήδ
Ευρ.Ορ
Ευρ.Ρήσ
Ευρ.Τρωάδ
Ευρ.Φοίν
Εφ.Αθ
Εφ.Εν
Εφ.Επ
Εφ.Θρ
Εφ.Θ
Εφ.Ι
Εφ.Κερ
Εφ.Κρ
Εφ.Λ
Εφ.Ν
Εφ.Πατ
Εφ.Πειρ
Εφαρμ.Δ.Δ
Εφαρμ
Εφεσ
Εφημ
Εφ
Ζαχ
Ζιγ
Ζυ
Ζχ
ΗΕ.Δ
Ημερ
Ηράκλ
Ηροδ
Ησίοδ
Ησ
Η.Ε.Γ
ΘΗΣ
ΘΡ
Θαλ
Θεοδ
Θεοφ
Θεσ
Θεόδ.Μοψ
Θεόκρ
Θεόφιλ
Θουκ
Θρ
Θρ.Ε
Θρ.Ιερ
Θρ.Ιρ
Ιακ
Ιαν
Ιβ
Ιδθ
Ιδ
Ιεζ
Ιερ
Ιζ
Ιησ
Ιησ.Ν
Ικ
Ιλ
Ιν
Ιουδ
Ιουστ
Ιούδα
Ιούλ
Ιούν
Ιπποκρ
Ιππόλ
Ιρ
Ισίδ.Πηλ
Ισοκρ
Ισ.Ν
Ιωβ
Ιωλ
Ιων
Ιω
ΚΟΣ
ΚΟ.ΜΕ.ΚΟΝ
ΚΠοινΔ
ΚΠολΔ
ΚαΒ
Καλ
Καλ.Τέχν
ΚανΒ
Καν.Διαδ
Κατάργ
Κλ
ΚοινΔ
Κολσ
Κολ
Κον
Κορ
Κος
ΚριτΕπιθ
ΚριτΕ
Κριτ
Κρ
ΚτΒ
ΚτΕ
ΚτΠ
Κυβ
Κυπρ
Κύριλ.Αλεξ
Κύριλ.Ιερ
Λεβ
Λεξ.Σουίδα
Λευϊτ
Λευ
Λκ
Λογ
ΛουκΑμ
Λουκιαν
Λουκ.Έρωτ
Λουκ.Ενάλ.Διάλ
Λουκ.Ερμ
Λουκ.Εταιρ.Διάλ
Λουκ.Ε.Δ
Λουκ.Θε.Δ
Λουκ.Ικ.
Λουκ.Ιππ
Λουκ.Λεξιφ
Λουκ.Μεν
Λουκ.Μισθ.Συν
Λουκ.Ορχ
Λουκ.Περ
Λουκ.Συρ
Λουκ.Τοξ
Λουκ.Τυρ
Λουκ.Φιλοψ
Λουκ.Φιλ
Λουκ.Χάρ
Λουκ.
Λουκ.Αλ
Λοχ
Λυδ
Λυκ
Λυσ
Λωζ
Λ1
Λ2
ΜΟΕφ
Μάρκ
Μέν
Μαλ
Ματθ
Μα
Μιχ
Μκ
Μλ
Μμ
Μον.Δ.Π
Μον.Πρωτ
Μον
Μρ
Μτ
Μχ
Μ.Βασ
Μ.Πλ
ΝΑ
Ναυτ.Χρον
Να
Νδικ
Νεεμ
Νε
Νικ
ΝκΦ
Νμ
ΝοΒ
Νομ.Δελτ.Τρ.Ελ
Νομ.Δελτ
Νομ.Σ.Κ
Νομ.Χρ
Νομ
Νομ.Διεύθ
Νοσ
Ντ
Νόσων
Ν1
Ν2
Ν3
Ν4
Νtot
Ξενοφ
Ξεν
Ξεν.Ανάβ
Ξεν.Απολ
Ξεν.Απομν
Ξεν.Απομ
Ξεν.Ελλ
Ξεν.Ιέρ
Ξεν.Ιππαρχ
Ξεν.Ιππ
Ξεν.Κυρ.Αν
Ξεν.Κύρ.Παιδ
Ξεν.Κ.Π
Ξεν.Λακ.Πολ
Ξεν.Οικ
Ξεν.Προσ
Ξεν.Συμπόσ
Ξεν.Συμπ
Ο΄
Οβδ
Οβ
ΟικΕ
Οικ
Οικ.Πατρ
Οικ.Σύν.Βατ
Ολομ
Ολ
Ολ.Α.Π
Ομ.Ιλ
Ομ.Οδ
ΟπΤοιχ
Οράτ
Ορθ
ΠΡΟ.ΠΟ
Πίνδ
Πίνδ.Ι
Πίνδ.Νεμ
Πίνδ.Ν
Πίνδ.Ολ
Πίνδ.Παθ
Πίνδ.Πυθ
Πίνδ.Π
ΠαγΝμλγ
Παν
Παρμ
Παροιμ
Παρ
Παυσ
Πειθ.Συμβ
ΠειρΝ
Πελ
ΠεντΣτρ
Πεντ
Πεντ.Εφ
ΠερΔικ
Περ.Γεν.Νοσ
Πετ
Πλάτ
Πλάτ.Αλκ
Πλάτ.Αντ
Πλάτ.Αξίοχ
Πλάτ.Απόλ
Πλάτ.Γοργ
Πλάτ.Ευθ
Πλάτ.Θεαίτ
Πλάτ.Κρατ
Πλάτ.Κριτ
Πλάτ.Λύσ
Πλάτ.Μεν
Πλάτ.Νόμ
Πλάτ.Πολιτ
Πλάτ.Πολ
Πλάτ.Πρωτ
Πλάτ.Σοφ.
Πλάτ.Συμπ
Πλάτ.Τίμ
Πλάτ.Φαίδρ
Πλάτ.Φιλ
Πλημ
Πλούτ
Πλούτ.Άρατ
Πλούτ.Αιμ
Πλούτ.Αλέξ
Πλούτ.Αλκ
Πλούτ.Αντ
Πλούτ.Αρτ
Πλούτ.Ηθ
Πλούτ.Θεμ
Πλούτ.Κάμ
Πλούτ.Καίσ
Πλούτ.Κικ
Πλούτ.Κράσ
Πλούτ.Κ
Πλούτ.Λυκ
Πλούτ.Μάρκ
Πλούτ.Μάρ
Πλούτ.Περ
Πλούτ.Ρωμ
Πλούτ.Σύλλ
Πλούτ.Φλαμ
Πλ
Ποιν.Δικ
Ποιν.Δ
Ποιν.Ν
Ποιν.Χρον
Ποιν.Χρ
Πολ.Δ
Πολ.Πρωτ
Πολ
Πολ.Μηχ
Πολ.Μ
Πρακτ.Αναθ
Πρακτ.Ολ
Πραξ
Πρμ
Πρξ
Πρωτ
Πρ
Πρ.Αν
Πρ.Λογ
Πταισμ
Πυρ.Καλ
Πόλη
Π.Δ
Π.Δ.Άσμ
ΡΜ.Ε
Ρθ
Ρμ
Ρωμ
ΣΠλημ
Σαπφ
Σειρ
Σολ
Σοφ
Σοφ.Αντιγ
Σοφ.Αντ
Σοφ.Αποσ
Σοφ.Απ
Σοφ.Ηλέκ
Σοφ.Ηλ
Σοφ.Οιδ.Κολ
Σοφ.Οιδ.Τύρ
Σοφ.Ο.Τ
Σοφ.Σειρ
Σοφ.Σολ
Σοφ.Τραχ
Σοφ.Φιλοκτ
Σρ
Σ.τ.Ε
Σ.τ.Π
Στρ.Π.Κ
Στ.Ευρ
Συζήτ
Συλλ.Νομολ
Συλ.Νομ
ΣυμβΕπιθ
Συμπ.Ν
Συνθ.Αμ
Συνθ.Ε.Ε
Συνθ.Ε.Κ
Συνθ.Ν
Σφν
Σφ
Σφ.Σλ
Σχ.Πολ.Δ
Σχ.Συντ.Ε
Σωσ
Σύντ
Σ.Πληρ
ΤΘ
ΤΣ.Δ
Τίτ
Τβ
Τελ.Ενημ
Τελ.Κ
Τερτυλ
Τιμ
Τοπ.Α
Τρ.Ο
Τριμ
Τριμ.Πλ
Τρ.Πλημ
Τρ.Π.Δ
Τ.τ.Ε
Ττ
Τωβ
Υγ
Υπερ
Υπ
Υ.Γ
Φιλήμ
Φιλιπ
Φιλ
Φλμ
Φλ
Φορ.Β
Φορ.Δ.Ε
Φορ.Δνη
Φορ.Δ
Φορ.Επ
Φώτ
Χρ.Ι.Δ
Χρ.Ιδ.Δ
Χρ.Ο
Χρυσ
Ψήφ
Ψαλμ
Ψαλ
Ψλ
Ωριγ
Ωσ
Ω.Ρ.Λ
άγν
άγν.ετυμολ
άγ
άκλ
άνθρ
άπ
άρθρ
άρν
άρ
άτ
άψ
ά
έκδ
έκφρ
έμψ
ένθ.αν
έτ
έ.α
ίδ
αβεστ
αβησσ
αγγλ
αγγ
αδημ
αεροναυτ
αερον
αεροπ
αθλητ
αθλ
αθροιστ
αιγυπτ
αιγ
αιτιολ
αιτ
αι
ακαδ
ακκαδ
αλβ
αλλ
αλφαβητ
αμα
αμερικ
αμερ
αμετάβ
αμτβ
αμφιβ
αμφισβ
αμφ
αμ
ανάλ
ανάπτ
ανάτ
αναβ
αναδαν
αναδιπλασ
αναδιπλ
αναδρ
αναλ
αναν
ανασυλλ
ανατολ
ανατομ
ανατυπ
ανατ
αναφορ
αναφ
ανα.ε
ανδρων
ανθρωπολ
ανθρωπ
ανθ
ανομ
αντίτ
αντδ
αντιγρ
αντιθ
αντικ
αντιμετάθ
αντων
αντ
ανωτ
ανόργ
ανών
αορ
απαρέμφ
απαρφ
απαρχ
απαρ
απλολ
απλοπ
αποβ
αποηχηροπ
αποθ
αποκρυφ
αποφ
απρμφ
απρφ
απρόσ
απόδ
απόλ
απόσπ
απόφ
αραβοτουρκ
αραβ
αραμ
αρβαν
αργκ
αριθμτ
αριθμ
αριθ
αρκτικόλ
αρκ
αρμεν
αρμ
αρνητ
αρσ
αρχαιολ
αρχιτεκτ
αρχιτ
αρχκ
αρχ
αρωμουν
αρωμ
αρ
αρ.μετρ
αρ.φ
ασσυρ
αστρολ
αστροναυτ
αστρον
αττ
αυστραλ
αυτοπ
αυτ
αφγαν
αφηρ
αφομ
αφρικ
αχώρ
αόρ
α.α
α/α
α0
βαθμ
βαθ
βαπτ
βασκ
βεβαιωτ
βεβ
βεδ
βενετ
βεν
βερβερ
βιβλγρ
βιολ
βιομ
βιοχημ
βιοχ
βλάχ
βλ
βλ.λ
βοταν
βοτ
βουλγαρ
βουλγ
βούλ
βραζιλ
βρετον
βόρ
γαλλ
γενικότ
γενοβ
γεν
γερμαν
γερμ
γεωγρ
γεωλ
γεωμετρ
γεωμ
γεωπ
γεωργ
γλυπτ
γλωσσολ
γλωσσ
γλ
γνμδ
γνμ
γνωμ
γοτθ
γραμμ
γραμ
γρμ
γρ
γυμν
δίδες
δίκ
δίφθ
δαν
δεικτ
δεκατ
δηλ
δημογρ
δημοτ
δημώδ
δημ
διάγρ
διάκρ
διάλεξ
διάλ
διάσπ
διαλεκτ
διατρ
διαφ
διαχ
διδα
διεθν
διεθ
δικον
διστ
δισύλλ
δισ
διφθογγοπ
δογμ
δολ
δοτ
δρμ
δρχ
δρ(α)
δωρ
δ
εβρ
εγκλπ
εδ
εθνολ
εθν
ειδικότ
ειδ
ειδ.β
εικ
ειρ
εισ
εκατοστμ
εκατοστ
εκατστ.2
εκατστ.3
εκατ
εκδ
εκκλησ
εκκλ
εκ
ελλην
ελλ
ελνστ
ελπ
εμβ
εμφ
εναλλ
ενδ
ενεργ
ενεστ
ενικ
ενν
εν
εξέλ
εξακολ
εξομάλ
εξ
εο
επέκτ
επίδρ
επίθ
επίρρ
επίσ
επαγγελμ
επανάλ
επανέκδ
επιθ
επικ
επιμ
επιρρ
επιστ
επιτατ
επιφ
επών
επ
εργ
ερμ
ερρινοπ
ερωτ
ετρουσκ
ετυμ
ετ
ευφ
ευχετ
εφ
εύχρ
ε.α
ε/υ
ε0
ζωγρ
ζωολ
ηθικ
ηθ
ηλεκτρολ
ηλεκτρον
ηλεκτρ
ημίτ
ημίφ
ημιφ
ηχηροπ
ηχηρ
ηχομιμ
ηχ
η
θέατρ
θεολ
θετ
θηλ
θρακ
θρησκειολ
θρησκ
θ
ιαπων
ιατρ
ιδιωμ
ιδ
ινδ
ιραν
ισπαν
ιστορ
ιστ
ισχυροπ
ιταλ
ιχθυολ
ιων
κάτ
καθ
κακοσ
καν
καρ
κατάλ
κατατ
κατωτ
κατ
κα
κελτ
κεφ
κινεζ
κινημ
κλητ
κλιτ
κλπ
κλ
κν
κοινωνιολ
κοινων
κοπτ
κουτσοβλαχ
κουτσοβλ
κπ
κρ.γν
κτγ
κτην
κτητ
κτλ
κτ
κυριολ
κυρ
κύρ
κ
κ.ά
κ.ά.π
κ.α
κ.εξ
κ.επ
κ.ε
κ.λπ
κ.λ.π
κ.ού.κ
κ.ο.κ
κ.τ.λ
κ.τ.τ
κ.τ.ό
λέξ
λαογρ
λαπ
λατιν
λατ
λαϊκότρ
λαϊκ
λετ
λιθ
λογιστ
λογοτ
λογ
λουβ
λυδ
λόγ
λ
λ.χ
μέλλ
μέσ
μαθημ
μαθ
μαιευτ
μαλαισ
μαλτ
μαμμων
μεγεθ
μεε
μειωτ
μελ
μεξ
μεσν
μεσογ
μεσοπαθ
μεσοφ
μετάθ
μεταβτ
μεταβ
μετακ
μεταπλ
μεταπτωτ
μεταρ
μεταφορ
μετβ
μετεπιθ
μετεπιρρ
μετεωρολ
μετεωρ
μετον
μετουσ
μετοχ
μετρ
μετ
μητρων
μηχανολ
μηχ
μικροβιολ
μογγολ
μορφολ
μουσ
μπενελούξ
μσνλατ
μσν
μτβ
μτγν
μτγ
μτφρδ
μτφρ
μτφ
μτχ
μυθ
μυκην
μυκ
μφ
μ
μ.ε
μ.μ
μ.π.ε
μ.π.π
μ0
ναυτ
νεοελλ
νεολατιν
νεολατ
νεολ
νεότ
νλατ
νομ
νορβ
νοσ
νότ
ν
ξ.λ
οικοδ
οικολ
οικον
οικ
ολλανδ
ολλ
ομηρ
ομόρρ
ονομ
ον
οπτ
ορθογρ
ορθ
οριστ
ορυκτολ
ορυκτ
ορ
οσετ
οσκ
ουαλ
ουγγρ
ουδ
ουσιαστικοπ
ουσιαστ
ουσ
πίν
παθητ
παθολ
παθ
παιδ
παλαιοντ
παλαιότ
παλ
παππων
παράγρ
παράγ
παράλλ
παράλ
παραγ
παρακ
παραλ
παραπ
παρατ
παρβ
παρετυμ
παροξ
παρων
παρωχ
παρ
παρ.φρ
πατριδων
πατρων
πβ
περιθ
περιλ
περιφρ
περσ
περ
πιθ
πληθ
πληροφ
ποδ
ποιητ
πολιτ
πολλαπλ
πολ
πορτογαλ
πορτ
ποσ
πρακριτ
πρβλ
πρβ
πργ
πρκμ
πρκ
πρλ
προέλ
προβηγκ
προελλ
προηγ
προθεμ
προπαραλ
προπαροξ
προπερισπ
προσαρμ
προσηγορ
προσταχτ
προστ
προσφών
προσ
προτακτ
προτ.Εισ
προφ
προχωρ
πρτ
πρόθ
πρόσθ
πρόσ
πρότ
πρ
πρ.Εφ
πτ
πυ
π
π.Χ
π.μ
π.χ
ρήμ
ρίζ
ρηματ
ρητορ
ριν
ρουμ
ρωμ
ρωσ
ρ
σανσκρ
σαξ
σελ
σερβοκρ
σερβ
σημασιολ
σημδ
σημειολ
σημερ
σημιτ
σημ
σκανδ
σκυθ
σκωπτ
σλαβ
σλοβ
σουηδ
σουμερ
σουπ
σπάν
σπανιότ
σπ
σσ
στατ
στερ
στιγμ
στιχ
στρέμ
στρατιωτ
στρατ
στ
συγγ
συγκρ
συγκ
συμπερ
συμπλεκτ
συμπλ
συμπροφ
συμφυρ
συμφ
συνήθ
συνίζ
συναίρ
συναισθ
συνδετ
συνδ
συνεκδ
συνηρ
συνθετ
συνθ
συνοπτ
συντελ
συντομογρ
συντ
συν
συρ
σχημ
σχ
σύγκρ
σύμπλ
σύμφ
σύνδ
σύνθ
σύντμ
σύντ
σ
σ.π
σ/β
τακτ
τελ
τετρ
τετρ.μ
τεχνλ
τεχνολ
τεχν
τεύχ
τηλεπικ
τηλεόρ
τιμ
τιμ.τομ
τοΣ
τον
τοπογρ
τοπων
τοπ
τοσκ
τουρκ
τοχ
τριτοπρόσ
τροποπ
τροπ
τσεχ
τσιγγ
ττ
τυπ
τόμ
τόνν
τ
τ.μ
τ.χλμ
υβρ
υπερθ
υπερσ
υπερ
υπεύθ
υποθ
υποκορ
υποκ
υποσημ
υποτ
υποφ
υποχωρ
υπόλ
υπόχρ
υπ
υστλατ
υψόμ
υψ
φάκ
φαρμακολ
φαρμ
φιλολ
φιλοσ
φιλοτ
φινλ
φοινικ
φράγκ
φρανκον
φριζ
φρ
φυλλ
φυσιολ
φυσ
φωνηεντ
φωνητ
φωνολ
φων
φωτογρ
φ
φ.τ.μ
χαμιτ
χαρτόσ
χαρτ
χασμ
χαϊδ
χγφ
χειλ
χεττ
χημ
χιλ
χλγρ
χλγ
χλμ
χλμ.2
χλμ.3
χλσγρ
χλστγρ
χλστμ
χλστμ.2
χλστμ.3
χλ
χργρ
χρημ
χρον
χρ
χφ
χ.ε
χ.κ
χ.ο
χ.σ
χ.τ
χ.χ
ψευδ
ψυχαν
ψυχιατρ
ψυχολ
ψυχ
ωκεαν
όμ
όν
όπ.παρ
όπ.π
ό.π
ύψ
1Βσ
1Εσ
1Θσ
1Ιν
1Κρ
1Μκ
1Πρ
1Πτ
1Τμ
2Βσ
2Εσ
2Θσ
2Ιν
2Κρ
2Μκ
2Πρ
2Πτ
2Τμ
3Βσ
3Ιν
3Μκ
4Βσ
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.en
================================================
#Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
#Special cases are included for prefixes that ONLY appear before 0-9 numbers.
#any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
#usually upper case letters are initials in a name
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
#List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
Adj
Adm
Adv
Asst
Bart
Bldg
Brig
Bros
Capt
Cmdr
Col
Comdr
Con
Corp
Cpl
DR
Dr
Drs
Ens
Gen
Gov
Hon
Hr
Hosp
Insp
Lt
MM
MR
MRS
MS
Maj
Messrs
Mlle
Mme
Mr
Mrs
Ms
Msgr
Op
Ord
Pfc
Ph
Prof
Pvt
Rep
Reps
Res
Rev
Rt
Sen
Sens
Sfc
Sgt
Sr
St
Supt
Surg
#misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
v
vs
i.e
rev
e.g
#Numbers only. These should only induce breaks when followed by a numeric sequence
# add NUMERIC_ONLY after the word for this function
#This case is mostly for the english "No." which can either be a sentence of its own, or
#if followed by a number, a non-breaking prefix
No #NUMERIC_ONLY#
Nos
Art #NUMERIC_ONLY#
Nr
pp #NUMERIC_ONLY#
#month abbreviations
Jan
Feb
Mar
Apr
#May is a full word
Jun
Jul
Aug
Sep
Oct
Nov
Dec
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.es
================================================
#Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
#Special cases are included for prefixes that ONLY appear before 0-9 numbers.
#any single upper case letter followed by a period is not a sentence ender
#usually upper case letters are initials in a name
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
# Period-final abbreviation list from http://www.ctspanish.com/words/abbreviations.htm
A.C
Apdo
Av
Bco
CC.AA
Da
Dep
Dn
Dr
Dra
EE.UU
Excmo
FF.CC
Fil
Gral
J.C
Let
Lic
N.B
P.D
P.V.P
Prof
Pts
Rte
S.A
S.A.R
S.E
S.L
S.R.C
Sr
Sra
Srta
Sta
Sto
T.V.E
Tel
Ud
Uds
V.B
V.E
Vd
Vds
a/c
adj
admón
afmo
apdo
av
c
c.f
c.g
cap
cm
cta
dcha
doc
ej
entlo
esq
etc
f.c
gr
grs
izq
kg
km
mg
mm
núm
núm
p
p.a
p.ej
ptas
pág
págs
pág
págs
q.e.g.e
q.e.s.m
s
s.s.s
vid
vol
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.fi
================================================
#Anything in this file, followed by a period (and an upper-case word), does NOT
#indicate an end-of-sentence marker. Special cases are included for prefixes
#that ONLY appear before 0-9 numbers.
#This list is compiled from omorfi database
#by Tommi A Pirinen.
#any single upper case letter followed by a period is not a sentence ender
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
Å
Ä
Ö
#List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
alik
alil
amir
apul
apul.prof
arkkit
ass
assist
dipl
dipl.arkkit
dipl.ekon
dipl.ins
dipl.kielenk
dipl.kirjeenv
dipl.kosm
dipl.urk
dos
erikoiseläinl
erikoishammasl
erikoisl
erikoist
ev.luutn
evp
fil
ft
hallinton
hallintot
hammaslääket
jatk
jääk
kansaned
kapt
kapt.luutn
kenr
kenr.luutn
kenr.maj
kers
kirjeenv
kom
kom.kapt
komm
konst
korpr
luutn
maist
maj
Mr
Mrs
Ms
M.Sc
neuv
nimim
Ph.D
prof
puh.joht
pääll
res
san
siht
suom
sähköp
säv
toht
toim
toim.apul
toim.joht
toim.siht
tuom
ups
vänr
vääp
ye.ups
ylik
ylil
ylim
ylimatr
yliop
yliopp
ylip
yliv
#misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall
#into this category - it sometimes ends a sentence)
e.g
ent
esim
huom
i.e
ilm
l
mm
myöh
nk
nyk
par
po
t
v
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.fr
================================================
#Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
#Special cases are included for prefixes that ONLY appear before 0-9 numbers.
#
#any single upper case letter followed by a period is not a sentence ender
#usually upper case letters are initials in a name
#no French words end in single lower-case letters, so we throw those in too?
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
#a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
# Period-final abbreviation list for French
A.C.N
A.M
art
ann
apr
av
auj
lib
B.P
boul
ca
c.-à-d
cf
ch.-l
chap
contr
C.P.I
C.Q.F.D
C.N
C.N.S
C.S
dir
éd
e.g
env
al
etc
E.V
ex
fasc
fém
fig
fr
hab
ibid
id
i.e
inf
LL.AA
LL.AA.II
LL.AA.RR
LL.AA.SS
L.D
LL.EE
LL.MM
LL.MM.II.RR
loc.cit
masc
MM
ms
N.B
N.D.A
N.D.L.R
N.D.T
n/réf
NN.SS
N.S
N.D
N.P.A.I
p.c.c
pl
pp
p.ex
p.j
P.S
R.A.S
R.-V
R.P
R.I.P
SS
S.S
S.A
S.A.I
S.A.R
S.A.S
S.E
sec
sect
sing
S.M
S.M.I.R
sq
sqq
suiv
sup
suppl
tél
T.S.V.P
vb
vol
vs
X.O
Z.I
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.ga
================================================
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
Á
É
Í
Ó
Ú
Uacht
Dr
B.Arch
m.sh
.i
Co
Cf
cf
i.e
r
Chr
lch #NUMERIC_ONLY#
lgh #NUMERIC_ONLY#
uimh #NUMERIC_ONLY#
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.hu
================================================
#Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
#Special cases are included for prefixes that ONLY appear before 0-9 numbers.
#any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
#usually upper case letters are initials in a name
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
Á
É
Í
Ó
Ö
Ő
Ú
Ü
Ű
#List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
Dr
dr
kb
Kb
vö
Vö
pl
Pl
ca
Ca
min
Min
max
Max
ún
Ún
prof
Prof
de
De
du
Du
Szt
St
#Numbers only. These should only induce breaks when followed by a numeric sequence
# add NUMERIC_ONLY after the word for this function
#This case is mostly for the english "No." which can either be a sentence of its own, or
#if followed by a number, a non-breaking prefix
# Month name abbreviations
jan #NUMERIC_ONLY#
Jan #NUMERIC_ONLY#
Feb #NUMERIC_ONLY#
feb #NUMERIC_ONLY#
márc #NUMERIC_ONLY#
Márc #NUMERIC_ONLY#
ápr #NUMERIC_ONLY#
Ápr #NUMERIC_ONLY#
máj #NUMERIC_ONLY#
Máj #NUMERIC_ONLY#
jún #NUMERIC_ONLY#
Jún #NUMERIC_ONLY#
Júl #NUMERIC_ONLY#
júl #NUMERIC_ONLY#
aug #NUMERIC_ONLY#
Aug #NUMERIC_ONLY#
Szept #NUMERIC_ONLY#
szept #NUMERIC_ONLY#
okt #NUMERIC_ONLY#
Okt #NUMERIC_ONLY#
nov #NUMERIC_ONLY#
Nov #NUMERIC_ONLY#
dec #NUMERIC_ONLY#
Dec #NUMERIC_ONLY#
# Other abbreviations
tel #NUMERIC_ONLY#
Tel #NUMERIC_ONLY#
Fax #NUMERIC_ONLY#
fax #NUMERIC_ONLY#
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.is
================================================
no #NUMERIC_ONLY#
No #NUMERIC_ONLY#
nr #NUMERIC_ONLY#
Nr #NUMERIC_ONLY#
nR #NUMERIC_ONLY#
NR #NUMERIC_ONLY#
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
^
í
á
ó
æ
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
ab.fn
a.fn
afs
al
alm
alg
andh
ath
aths
atr
ao
au
aukaf
áfn
áhrl.s
áhrs
ákv.gr
ákv
bh
bls
dr
e.Kr
et
ef
efn
ennfr
eink
end
e.st
erl
fél
fskj
fh
f.hl
físl
fl
fn
fo
forl
frb
frl
frh
frt
fsl
fsh
fs
fsk
fst
f.Kr
ft
fv
fyrrn
fyrrv
germ
gm
gr
hdl
hdr
hf
hl
hlsk
hljsk
hljv
hljóðv
hr
hv
hvk
holl
Hos
höf
hk
hrl
ísl
kaf
kap
Khöfn
kk
kg
kk
km
kl
klst
kr
kt
kgúrsk
kvk
leturbr
lh
lh.nt
lh.þt
lo
ltr
mlja
mljó
millj
mm
mms
m.fl
miðm
mgr
mst
mín
nf
nh
nhm
nl
nk
nmgr
no
núv
nt
o.áfr
o.m.fl
ohf
o.fl
o.s.frv
ófn
ób
óákv.gr
óákv
pfn
PR
pr
Ritstj
Rvík
Rvk
samb
samhlj
samn
samn
sbr
sek
sérn
sf
sfn
sh
sfn
sh
s.hl
sk
skv
sl
sn
so
ss.us
s.st
samþ
sbr
shlj
sign
skál
st
st.s
stk
sþ
teg
tbl
tfn
tl
tvíhlj
tvt
till
to
umr
uh
us
uppl
útg
vb
Vf
vh
vkf
Vl
vl
vlf
vmf
8vo
vsk
vth
þt
þf
þjs
þgf
þlt
þolm
þm
þml
þýð
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.it
================================================
#Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
#Special cases are included for prefixes that ONLY appear before 0-9 numbers.
#any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
#usually upper case letters are initials in a name
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
#List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
Adj
Adm
Adv
Amn
Arch
Asst
Avv
Bart
Bcc
Bldg
Brig
Bros
C.A.P
C.P
Capt
Cc
Cmdr
Co
Col
Comdr
Con
Corp
Cpl
DR
Dott
Dr
Drs
Egr
Ens
Gen
Geom
Gov
Hon
Hosp
Hr
Id
Ing
Insp
Lt
MM
MR
MRS
MS
Maj
Messrs
Mlle
Mme
Mo
Mons
Mr
Mrs
Ms
Msgr
N.B
Op
Ord
P.S
P.T
Pfc
Ph
Prof
Pvt
RP
RSVP
Rag
Rep
Reps
Res
Rev
Rif
Rt
S.A
S.B.F
S.P.M
S.p.A
S.r.l
Sen
Sens
Sfc
Sgt
Sig
Sigg
Soc
Spett
Sr
St
Supt
Surg
V.P
# other
a.c
acc
all
banc
c.a
c.c.p
c.m
c.p
c.s
c.v
corr
dott
e.p.c
ecc
es
fatt
gg
int
lett
ogg
on
p.c
p.c.c
p.es
p.f
p.r
p.v
post
pp
racc
ric
s.n.c
seg
sgg
ss
tel
u.s
v.r
v.s
#misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
v
vs
i.e
rev
e.g
#Numbers only. These should only induce breaks when followed by a numeric sequence
# add NUMERIC_ONLY after the word for this function
#This case is mostly for the english "No." which can either be a sentence of its own, or
#if followed by a number, a non-breaking prefix
No #NUMERIC_ONLY#
Nos
Art #NUMERIC_ONLY#
Nr
pp #NUMERIC_ONLY#
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.lt
================================================
# Anything in this file, followed by a period (and an upper-case word),
# does NOT indicate an end-of-sentence marker.
# Special cases are included for prefixes that ONLY appear before 0-9 numbers.
# Any single upper case letter followed by a period is not a sentence ender
# (excluding I occasionally, but we leave it in)
# usually upper case letters are initials in a name
A
Ā
B
C
Č
D
E
Ē
F
G
Ģ
H
I
Ī
J
K
Ķ
L
Ļ
M
N
Ņ
O
P
Q
R
S
Š
T
U
Ū
V
W
X
Y
Z
Ž
# Initialis -- Džonas
Dz
Dž
Just
# Day and month abbreviations
# m. menesis d. diena g. gimes
m
mėn
d
g
gim
# Pirmadienis Penktadienis
Pr
Pn
Pirm
Antr
Treč
Ketv
Penkt
Šešt
Sekm
Saus
Vas
Kov
Bal
Geg
Birž
Liep
Rugpj
Rugs
Spal
Lapkr
Gruod
# Business, governmental, geographical terms
a
# aikštė
adv
# advokatas
akad
# akademikas
aklg
# akligatvis
akt
# aktorius
al
# alėja
A.V
# antspaudo vieta
aps
apskr
# apskritis
apyg
# apygarda
aps
apskr
# apskritis
asist
# asistentas
asmv
avd
# asmenvardis
a.k
asm
asm.k
# asmens kodas
atsak
# atsakingasis
atsisk
sąsk
# atsiskaitomoji sąskaita
aut
# autorius
b
k
b.k
# banko kodas
bkl
# bakalauras
bt
# butas
buv
# buvęs, -usi
dail
# dailininkas
dek
# dekanas
dėst
# dėstytojas
dir
# direktorius
dirig
# dirigentas
doc
# docentas
drp
# durpynas
dš
# dešinysis
egz
# egzempliorius
eil
# eilutė
ekon
# ekonomika
el
# elektroninis
etc
ež
# ežeras
faks
# faksas
fak
# fakultetas
gen
# generolas
gyd
# gydytojas
gv
# gyvenvietė
įl
# įlanka
Įn
# įnagininkas
insp
# inspektorius
pan
# ir panašiai
t.t
# ir taip toliau
k.a
# kaip antai
kand
# kandidatas
kat
# katedra
kyš
# kyšulys
kl
# klasė
kln
# kalnas
kn
# knyga
koresp
# korespondentas
kpt
# kapitonas
kr
# kairysis
kt
# kitas
kun
# kunigas
l
e
p
l.e.p
# laikinai einantis pareigas
ltn
# leitenantas
m
mst
# miestas
m.e
# mūsų eros
m.m
# mokslo metai
mot
# moteris
mstl
# miestelis
mgr
# magistras
mgnt
# magistrantas
mjr
# majoras
mln
# milijonas
mlrd
# milijardas
mok
# mokinys
mokyt
# mokytojas
moksl
# mokslinis
nkt
# nekaitomas
ntk
# neteiktinas
Nr
nr
# numeris
p
# ponas
p.d
a.d
# pašto dėžutė, abonentinė dėžutė
p.m.e
# prieš mūsų erą
pan
# ir panašiai
pav
# paveikslas
pavad
# pavaduotojas
pirm
# pirmininkas
pl
# plentas
plg
# palygink
plk
# pulkininkas; pelkė
pr
# prospektas
Kr
pr.Kr
# prieš Kristų
prok
# prokuroras
prot
# protokolas
pss
# pusiasalis
pšt
# paštas
pvz
# pavyzdžiui
r
# rajonas
red
# redaktorius
rš
# raštų kalbos
sąs
# sąsiuvinis
saviv
sav
# savivaldybė
sekr
# sekretorius
sen
# seniūnija, seniūnas
sk
# skaityk; skyrius
skg
# skersgatvis
skyr
sk
# skyrius
skv
# skveras
sp
# spauda; spaustuvė
spec
# specialistas
sr
# sritis
st
# stotis
str
# straipsnis
stud
# studentas
š
š.m
# šių metų
šnek
# šnekamosios
tir
# tiražas
tūkst
# tūkstantis
up
# upė
upl
# upelis
vad
# vadinamasis, -oji
vlsč
# valsčius
ved
# vedėjas
vet
# veterinarija
virš
# viršininkas, viršaitis
vyr
# vyriausiasis, -ioji; vyras
vyresn
# vyresnysis
vlsč
# valsčius
vs
# viensėdis
Vt
vt
# vietininkas
vtv
vv
# vietovardis
žml
# žemėlapis
# Technical terms, abbreviations used in guidebooks, advertisments, etc.
# Generally lower-case.
air
# airiškai
amer
# amerikanizmas
anat
# anatomija
angl
# angl. angliskai
arab
# arabų
archeol
archit
asm
# asmuo
astr
# astronomija
austral
# australiškai
aut
# automobilis
av
# aviacija
bažn
bdv
# būdvardis
bibl
# Biblija
biol
# biologija
bot
# botanika
brt
# burtai, burtažodis.
brus
# baltarusių
buh
# buhalterija
chem
# chemija
col
# collectivum
con
conj
# conjunctivus, jungtukas
dab
# dab. dabartine
dgs
# daugiskaita
dial
# dialektizmas
dipl
dktv
# daiktavardis
džn
# dažnai
ekon
el
# elektra
esam
# esamasis laikas
euf
# eufemizmas
fam
# familiariai
farm
# farmacija
filol
# filologija
filos
# filosofija
fin
# finansai
fiz
# fizika
fiziol
# fiziologija
flk
# folkloras
fon
# fonetika
fot
# fotografija
geod
# geodezija
geogr
geol
# geologija
geom
# geometrija
glžk
gr
# graikų
gram
her
# heraldika
hidr
# hidrotechnika
ind
# Indų
iron
# ironiškai
isp
# ispanų
ist
istor
# istorija
it
# italų
įv
reikšm
įv.reikšm
# įvairiomis reikšmėmis
jap
# japonų
juok
# juokaujamai
jūr
# jūrininkystė
kalb
# kalbotyra
kar
# karyba
kas
# kasyba
kin
# kinematografija
klaus
# klausiamasis
knyg
# knyginis
kom
# komercija
komp
# kompiuteris
kosm
# kosmonautika
kt
# kitas
kul
# kulinarija
kuop
# kuopine
l
# laikas
lit
# literatūrinis
lingv
# lingvistika
log
# logika
lot
# lotynų
mat
# matematika
maž
# mažybinis
med
# medicina
medž
# medžioklė
men
# menas
menk
# menkinamai
metal
# metalurgija
meteor
min
# mineralogija
mit
# mitologija
mok
# mokyklinis
ms
# mįslė
muz
# muzikinis
n
# naujasis
neig
# neigiamasis
neol
# neologizmas
niek
# niekinamai
ofic
# oficialus
opt
# optika
orig
# original
p
# pietūs
pan
# panašiai
parl
# parlamentas
pat
# patarlė
paž
# pažodžiui
plg
# palygink
poet
# poetizmas
poez
# poezija
poligr
# poligrafija
polit
# politika
ppr
# paprastai
pranc
pr
# prancūzų, prūsų
priet
# prietaras
prek
# prekyba
prk
# perkeltine
prs
# persona, asmuo
psn
# pasenęs žodis
psich
# psichologija
pvz
# pavyzdžiui
r
# rytai
rad
# radiotechnika
rel
# religija
ret
# retai
rus
# rusų
sen
# senasis
sl
# slengas, slavų
sov
# sovietinis
spec
# specialus
sport
stat
# statyba
sudurt
# sudurtinis
sutr
# sutrumpintas
suv
# suvalkiečių
š
# šiaurė
šach
# šachmatai
šiaur
škot
# škotiškai
šnek
# šnekamoji
teatr
tech
techn
# technika
teig
# teigiamas
teis
# teisė
tekst
# tekstilė
tel
# telefonas
teol
# teologija
v
# tik vyriškosios, vakarai
t.p
t
p
# ir taip pat
t.t
# ir taip toliau
t.y
# tai yra
vaik
# vaikų
vart
# vartojama
vet
# veterinarija
vid
# vidurinis
vksm
# veiksmažodis
vns
# vienaskaita
vok
# vokiečių
vulg
# vulgariai
zool
# zoologija
žr
# žiūrėk
ž.ū
ž
ū
# žemės ūkis
# List of titles. These are often followed by upper-case names, but do
# not indicate sentence breaks
#
# Jo Eminencija
Em.
# Gerbiamasis
Gerb
gerb
# malonus
malon
# profesorius
Prof
prof
# daktaras (mokslų)
Dr
dr
habil
med
# inž inžinierius
inž
Inž
#Numbers only. These should only induce breaks when followed by a numeric sequence
# add NUMERIC_ONLY after the word for this function
#This case is mostly for the english "No." which can either be a sentence of its own, or
#if followed by a number, a non-breaking prefix
No #NUMERIC_ONLY#
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.lv
================================================
#Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
#Special cases are included for prefixes that ONLY appear before 0-9 numbers.
#any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
#usually upper case letters are initials in a name
A
Ā
B
C
Č
D
E
Ē
F
G
Ģ
H
I
Ī
J
K
Ķ
L
Ļ
M
N
Ņ
O
P
Q
R
S
Š
T
U
Ū
V
W
X
Y
Z
Ž
#List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
dr
Dr
med
prof
Prof
inž
Inž
ist.loc
Ist.loc
kor.loc
Kor.loc
v.i
vietn
Vietn
#misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
a.l
t.p
pārb
Pārb
vec
Vec
inv
Inv
sk
Sk
spec
Spec
vienk
Vienk
virz
Virz
māksl
Māksl
mūz
Mūz
akad
Akad
soc
Soc
galv
Galv
vad
Vad
sertif
Sertif
folkl
Folkl
hum
Hum
#Numbers only. These should only induce breaks when followed by a numeric sequence
# add NUMERIC_ONLY after the word for this function
#This case is mostly for the english "No." which can either be a sentence of its own, or
#if followed by a number, a non-breaking prefix
Nr #NUMERIC_ONLY#
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.nl
================================================
#Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
#Special cases are included for prefixes that ONLY appear before 0-9 numbers.
#Sources: http://nl.wikipedia.org/wiki/Lijst_van_afkortingen
# http://nl.wikipedia.org/wiki/Aanspreekvorm
# http://nl.wikipedia.org/wiki/Titulatuur_in_het_Nederlands_hoger_onderwijs
#any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
#usually upper case letters are initials in a name
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
#List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
bacc
bc
bgen
c.i
dhr
dr
dr.h.c
drs
drs
ds
eint
fa
Fa
fam
gen
genm
ing
ir
jhr
jkvr
jr
kand
kol
lgen
lkol
Lt
maj
Mej
mevr
Mme
mr
mr
Mw
o.b.s
plv
prof
ritm
tint
Vz
Z.D
Z.D.H
Z.E
Z.Em
Z.H
Z.K.H
Z.K.M
Z.M
z.v
#misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
#we seem to have a lot of these in dutch i.e.: i.p.v - in plaats van (in stead of) never ends a sentence
a.g.v
bijv
bijz
bv
d.w.z
e.c
e.g
e.k
ev
i.p.v
i.s.m
i.t.t
i.v.m
m.a.w
m.b.t
m.b.v
m.h.o
m.i
m.i.v
v.w.t
#Numbers only. These should only induce breaks when followed by a numeric sequence
# add NUMERIC_ONLY after the word for this function
#This case is mostly for the english "No." which can either be a sentence of its own, or
#if followed by a number, a non-breaking prefix
Nr #NUMERIC_ONLY#
Nrs
nrs
nr #NUMERIC_ONLY#
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.pl
================================================
adw
afr
akad
al
Al
am
amer
arch
art
Art
artyst
astr
austr
bałt
bdb
bł
bm
br
bryg
bryt
centr
ces
chem
chiń
chir
c.k
c.o
cyg
cyw
cyt
czes
czw
cd
Cd
czyt
ćw
ćwicz
daw
dcn
dekl
demokr
det
diec
dł
dn
dot
dol
dop
dost
dosł
h.c
ds
dst
duszp
dypl
egz
ekol
ekon
elektr
em
ew
fab
farm
fot
fr
gat
gastr
geogr
geol
gimn
głęb
gm
godz
górn
gosp
gr
gram
hist
hiszp
hr
Hr
hot
id
in
im
iron
jn
kard
kat
katol
k.k
kk
kol
kl
k.p.a
kpc
k.p.c
kpt
kr
k.r
krak
k.r.o
kryt
kult
laic
łac
niem
woj
nb
np
Nb
Np
pol
pow
m.in
pt
ps
Pt
Ps
cdn
jw
ryc
rys
Ryc
Rys
tj
tzw
Tzw
tzn
zob
ang
ub
ul
pw
pn
pl
al
k
n
nr #NUMERIC_ONLY#
Nr #NUMERIC_ONLY#
ww
wł
ur
zm
żyd
żarg
żyw
wył
bp
bp
wyst
tow
Tow
o
sp
Sp
st
spółdz
Spółdz
społ
spółgł
stoł
stow
Stoł
Stow
zn
zew
zewn
zdr
zazw
zast
zaw
zał
zal
zam
zak
zakł
zagr
zach
adw
Adw
lek
Lek
med
mec
Mec
doc
Doc
dyw
dyr
Dyw
Dyr
inż
Inż
mgr
Mgr
dh
dr
Dh
Dr
p
P
red
Red
prof
prok
Prof
Prok
hab
płk
Płk
nadkom
Nadkom
podkom
Podkom
ks
Ks
gen
Gen
por
Por
reż
Reż
przyp
Przyp
śp
św
śW
Śp
Św
ŚW
szer
Szer
pkt #NUMERIC_ONLY#
str #NUMERIC_ONLY#
tab #NUMERIC_ONLY#
Tab #NUMERIC_ONLY#
tel
ust #NUMERIC_ONLY#
par #NUMERIC_ONLY#
poz
pok
oo
oO
Oo
OO
r #NUMERIC_ONLY#
l #NUMERIC_ONLY#
s #NUMERIC_ONLY#
najśw
Najśw
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
Ś
Ć
Ż
Ź
Dz
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.ro
================================================
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
dpdv
etc
șamd
M.Ap.N
dl
Dl
d-na
D-na
dvs
Dvs
pt
Pt
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.ru
================================================
# added Cyrillic uppercase letters [А-Я]
# removed 000D carriage return (this is not removed by chomp in tokenizer.perl, and prevents recognition of the prefixes)
# edited by Kate Young (nspaceanalysis@earthlink.net) 21 May 2013
А
Б
В
Г
Д
Е
Ж
З
И
Й
К
Л
М
Н
О
П
Р
С
Т
У
Ф
Х
Ц
Ч
Ш
Щ
Ъ
Ы
Ь
Э
Ю
Я
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
0гг
1гг
2гг
3гг
4гг
5гг
6гг
7гг
8гг
9гг
0г
1г
2г
3г
4г
5г
6г
7г
8г
9г
Xвв
Vвв
Iвв
Lвв
Mвв
Cвв
Xв
Vв
Iв
Lв
Mв
Cв
0м
1м
2м
3м
4м
5м
6м
7м
8м
9м
0мм
1мм
2мм
3мм
4мм
5мм
6мм
7мм
8мм
9мм
0см
1см
2см
3см
4см
5см
6см
7см
8см
9см
0дм
1дм
2дм
3дм
4дм
5дм
6дм
7дм
8дм
9дм
0л
1л
2л
3л
4л
5л
6л
7л
8л
9л
0км
1км
2км
3км
4км
5км
6км
7км
8км
9км
0га
1га
2га
3га
4га
5га
6га
7га
8га
9га
0кг
1кг
2кг
3кг
4кг
5кг
6кг
7кг
8кг
9кг
0т
1т
2т
3т
4т
5т
6т
7т
8т
9т
0г
1г
2г
3г
4г
5г
6г
7г
8г
9г
0мг
1мг
2мг
3мг
4мг
5мг
6мг
7мг
8мг
9мг
бульв
в
вв
г
га
гг
гл
гос
д
дм
доп
др
е
ед
ед
зам
и
инд
исп
Исп
к
кап
кг
кв
кл
км
кол
комн
коп
куб
л
лиц
лл
м
макс
мг
мин
мл
млн
млрд
мм
н
наб
нач
неуд
ном
о
обл
обр
общ
ок
ост
отл
п
пер
перераб
пл
пос
пр
просп
проф
р
ред
руб
с
сб
св
см
соч
ср
ст
стр
т
тел
Тел
тех
тт
туп
тыс
уд
ул
уч
физ
х
хор
ч
чел
шт
экз
э
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.sk
================================================
Bc
Mgr
RNDr
PharmDr
PhDr
JUDr
PaedDr
ThDr
Ing
MUDr
MDDr
MVDr
Dr
ThLic
PhD
ArtD
ThDr
Dr
DrSc
CSs
prof
obr
Obr
Č
č
absol
adj
admin
adr
Adr
adv
advok
afr
ak
akad
akc
akuz
et
al
alch
amer
anat
angl
Angl
anglosas
anorg
ap
apod
arch
archeol
archit
arg
art
astr
astrol
astron
atp
atď
austr
Austr
aut
belg
Belg
bibl
Bibl
biol
bot
bud
bás
býv
cest
chem
cirk
csl
čs
Čs
dat
dep
det
dial
diaľ
dipl
distrib
dokl
dosl
dopr
dram
duš
dv
dvojčl
dór
ekol
ekon
el
elektr
elektrotech
energet
epic
est
etc
etonym
eufem
európ
Európ
ev
evid
expr
fa
fam
farm
fem
feud
fil
filat
filoz
fi
fon
form
fot
fr
Fr
franc
Franc
fraz
fut
fyz
fyziol
garb
gen
genet
genpor
geod
geogr
geol
geom
germ
gr
Gr
gréc
Gréc
gréckokat
hebr
herald
hist
hlav
hosp
hromad
hud
hypok
ident
i.e
ident
imp
impf
indoeur
inf
inform
instr
int
interj
inšt
inštr
iron
jap
Jap
jaz
jedn
juhoamer
juhových
juhozáp
juž
kanad
Kanad
kanc
kapit
kpt
kart
katastr
knih
kniž
komp
konj
konkr
kozmet
krajč
kresť
kt
kuch
lat
latinskoamer
lek
lex
lingv
lit
litur
log
lok
max
Max
maď
Maď
medzinár
mest
metr
mil
Mil
min
Min
miner
ml
mld
mn
mod
mytol
napr
nar
Nar
nasl
nedok
neg
negat
neklas
nem
Nem
neodb
neos
neskl
nesklon
nespis
nespráv
neved
než
niekt
niž
nom
náb
nákl
námor
nár
obch
obj
obv
obyč
obč
občian
odb
odd
ods
ojed
okr
Okr
opt
opyt
org
os
osob
ot
ovoc
par
part
pejor
pers
pf
Pf
P.f
p.f
pl
Plk
pod
podst
pokl
polit
politol
polygr
pomn
popl
por
porad
porov
posch
potrav
použ
poz
pozit
poľ
poľno
poľnohosp
poľov
pošt
pož
prac
predl
pren
prep
preuk
priezv
Priezv
privl
prof
práv
príd
príj
prík
príp
prír
prísl
príslov
príč
psych
publ
pís
písm
pôv
refl
reg
rep
resp
rozk
rozlič
rozpráv
roč
Roč
ryb
rádiotech
rím
samohl
semest
sev
severoamer
severových
severozáp
sg
skr
skup
sl
Sloven
soc
soch
sociol
sp
spol
Spol
spoloč
spoluhl
správ
spôs
st
star
starogréc
starorím
s.r.o
stol
stor
str
stredoamer
stredoškol
subj
subst
superl
sv
sz
súkr
súp
súvzť
tal
Tal
tech
tel
Tel
telef
teles
telev
teol
trans
turist
tuzem
typogr
tzn
tzv
ukaz
ul
Ul
umel
univ
ust
ved
vedľ
verb
veter
vin
viď
vl
vod
vodohosp
pnl
vulg
vyj
vys
vysokoškol
vzťaž
vôb
vých
výd
výrob
výsk
výsl
výtv
výtvar
význ
včel
vš
všeob
zahr
zar
zariad
zast
zastar
zastaráv
zb
zdravot
združ
zjemn
zlat
zn
Zn
zool
zr
zried
zv
záhr
zák
zákl
zám
záp
západoeur
zázn
územ
účt
čast
čes
Čes
čl
čísl
živ
pr
fak
Kr
p.n.l
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.sl
================================================
dr
Dr
itd
itn
št #NUMERIC_ONLY#
Št #NUMERIC_ONLY#
d
jan
Jan
feb
Feb
mar
Mar
apr
Apr
jun
Jun
jul
Jul
avg
Avg
sept
Sept
sep
Sep
okt
Okt
nov
Nov
dec
Dec
tj
Tj
npr
Npr
sl
Sl
op
Op
gl
Gl
oz
Oz
prev
dipl
ing
prim
Prim
cf
Cf
gl
Gl
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.sv
================================================
#single upper case letter are usually initials
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
#misc abbreviations
AB
G
VG
dvs
etc
from
iaf
jfr
kl
kr
mao
mfl
mm
osv
pga
tex
tom
vs
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.ta
================================================
#Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
#Special cases are included for prefixes that ONLY appear before 0-9 numbers.
#any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
#usually upper case letters are initials in a name
அ
ஆ
இ
ஈ
உ
ஊ
எ
ஏ
ஐ
ஒ
ஓ
ஔ
ஃ
க
கா
கி
கீ
கு
கூ
கெ
கே
கை
கொ
கோ
கௌ
க்
ச
சா
சி
சீ
சு
சூ
செ
சே
சை
சொ
சோ
சௌ
ச்
ட
டா
டி
டீ
டு
டூ
டெ
டே
டை
டொ
டோ
டௌ
ட்
த
தா
தி
தீ
து
தூ
தெ
தே
தை
தொ
தோ
தௌ
த்
ப
பா
பி
பீ
பு
பூ
பெ
பே
பை
பொ
போ
பௌ
ப்
ற
றா
றி
றீ
று
றூ
றெ
றே
றை
றொ
றோ
றௌ
ற்
ய
யா
யி
யீ
யு
யூ
யெ
யே
யை
யொ
யோ
யௌ
ய்
ர
ரா
ரி
ரீ
ரு
ரூ
ரெ
ரே
ரை
ரொ
ரோ
ரௌ
ர்
ல
லா
லி
லீ
லு
லூ
லெ
லே
லை
லொ
லோ
லௌ
ல்
வ
வா
வி
வீ
வு
வூ
வெ
வே
வை
வொ
வோ
வௌ
வ்
ள
ளா
ளி
ளீ
ளு
ளூ
ளெ
ளே
ளை
ளொ
ளோ
ளௌ
ள்
ழ
ழா
ழி
ழீ
ழு
ழூ
ழெ
ழே
ழை
ழொ
ழோ
ழௌ
ழ்
ங
ஙா
ஙி
ஙீ
ஙு
ஙூ
ஙெ
ஙே
ஙை
ஙொ
ஙோ
ஙௌ
ங்
ஞ
ஞா
ஞி
ஞீ
ஞு
ஞூ
ஞெ
ஞே
ஞை
ஞொ
ஞோ
ஞௌ
ஞ்
ண
ணா
ணி
ணீ
ணு
ணூ
ணெ
ணே
ணை
ணொ
ணோ
ணௌ
ண்
ந
நா
நி
நீ
நு
நூ
நெ
நே
நை
நொ
நோ
நௌ
ந்
ம
மா
மி
மீ
மு
மூ
மெ
மே
மை
மொ
மோ
மௌ
ம்
ன
னா
னி
னீ
னு
னூ
னெ
னே
னை
னொ
னோ
னௌ
ன்
#List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
திரு
திருமதி
வண
கௌரவ
#misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
உ.ம்
#கா.ம்
#எ.ம்
#Numbers only. These should only induce breaks when followed by a numeric sequence
# add NUMERIC_ONLY after the word for this function
#This case is mostly for the english "No." which can either be a sentence of its own, or
#if followed by a number, a non-breaking prefix
No #NUMERIC_ONLY#
Nos
Art #NUMERIC_ONLY#
Nr
pp #NUMERIC_ONLY#
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.yue
================================================
#
# Cantonese (Chinese)
#
# Anything in this file, followed by a period,
# does NOT indicate an end-of-sentence marker.
#
# English/Euro-language given-name initials (appearing in
# news, periodicals, etc.)
A
Ā
B
C
Č
D
E
Ē
F
G
Ģ
H
I
Ī
J
K
Ķ
L
Ļ
M
N
Ņ
O
P
Q
R
S
Š
T
U
Ū
V
W
X
Y
Z
Ž
# Numbers only. These should only induce breaks when followed by
# a numeric sequence.
# Add NUMERIC_ONLY after the word for this function. This case is
# mostly for the english "No." which can either be a sentence of its
# own, or if followed by a number, a non-breaking prefix.
No #NUMERIC_ONLY#
Nr #NUMERIC_ONLY#
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/nonbreaking_prefixes/nonbreaking_prefix.zh
================================================
#
# Mandarin (Chinese)
#
# Anything in this file, followed by a period,
# does NOT indicate an end-of-sentence marker.
#
# English/Euro-language given-name initials (appearing in
# news, periodicals, etc.)
A
Ā
B
C
Č
D
E
Ē
F
G
Ģ
H
I
Ī
J
K
Ķ
L
Ļ
M
N
Ņ
O
P
Q
R
S
Š
T
U
Ū
V
W
X
Y
Z
Ž
# Numbers only. These should only induce breaks when followed by
# a numeric sequence.
# Add NUMERIC_ONLY after the word for this function. This case is
# mostly for the english "No." which can either be a sentence of its
# own, or if followed by a number, a non-breaking prefix.
No #NUMERIC_ONLY#
Nr #NUMERIC_ONLY#
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/release_model.py
================================================
#!/usr/bin/env python
import argparse
import torch
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Removes the optim data of PyTorch models")
parser.add_argument("--model", "-m",
help="The model filename (*.pt)", required=True)
parser.add_argument("--output", "-o",
help="The output filename (*.pt)", required=True)
opt = parser.parse_args()
model = torch.load(opt.model)
model['optim'] = None
torch.save(model, opt.output)
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/test_rouge.py
================================================
# -*- encoding: utf-8 -*-
import argparse
import os
import time
import pyrouge
import shutil
import sys
import codecs
from onmt.utils.logging import init_logger, logger
def test_rouge(cand, ref):
"""Calculate ROUGE scores of sequences passed as an iterator
e.g. a list of str, an open file, StringIO or even sys.stdin
"""
current_time = time.strftime('%Y-%m-%d-%H-%M-%S', time.localtime())
tmp_dir = ".rouge-tmp-{}".format(current_time)
try:
if not os.path.isdir(tmp_dir):
os.mkdir(tmp_dir)
os.mkdir(tmp_dir + "/candidate")
os.mkdir(tmp_dir + "/reference")
candidates = [line.strip() for line in cand]
references = [line.strip() for line in ref]
assert len(candidates) == len(references)
cnt = len(candidates)
for i in range(cnt):
if len(references[i]) < 1:
continue
with open(tmp_dir + "/candidate/cand.{}.txt".format(i), "w",
encoding="utf-8") as f:
f.write(candidates[i])
with open(tmp_dir + "/reference/ref.{}.txt".format(i), "w",
encoding="utf-8") as f:
f.write(references[i])
r = pyrouge.Rouge155()
r.model_dir = tmp_dir + "/reference/"
r.system_dir = tmp_dir + "/candidate/"
r.model_filename_pattern = 'ref.#ID#.txt'
r.system_filename_pattern = r'cand.(\d+).txt'
rouge_results = r.convert_and_evaluate()
results_dict = r.output_to_dict(rouge_results)
return results_dict
finally:
pass
if os.path.isdir(tmp_dir):
shutil.rmtree(tmp_dir)
def rouge_results_to_str(results_dict):
return ">> ROUGE(1/2/3/L/SU4): {:.2f}/{:.2f}/{:.2f}/{:.2f}/{:.2f}".format(
results_dict["rouge_1_f_score"] * 100,
results_dict["rouge_2_f_score"] * 100,
results_dict["rouge_3_f_score"] * 100,
results_dict["rouge_l_f_score"] * 100,
results_dict["rouge_su*_f_score"] * 100)
if __name__ == "__main__":
init_logger('test_rouge.log')
parser = argparse.ArgumentParser()
parser.add_argument('-c', type=str, default="candidate.txt",
help='candidate file')
parser.add_argument('-r', type=str, default="reference.txt",
help='reference file')
args = parser.parse_args()
if args.c.upper() == "STDIN":
candidates = sys.stdin
else:
candidates = codecs.open(args.c, encoding="utf-8")
references = codecs.open(args.r, encoding="utf-8")
results_dict = test_rouge(candidates, references)
logger.info(rouge_results_to_str(results_dict))
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/tokenizer.perl
================================================
#!/usr/bin/env perl
#
# This file is part of moses. Its use is licensed under the GNU Lesser General
# Public License version 2.1 or, at your option, any later version.
use warnings;
# Sample Tokenizer
### Version 1.1
# written by Pidong Wang, based on the code written by Josh Schroeder and Philipp Koehn
# Version 1.1 updates:
# (1) add multithreading option "-threads NUM_THREADS" (default is 1);
# (2) add a timing option "-time" to calculate the average speed of this tokenizer;
# (3) add an option "-lines NUM_SENTENCES_PER_THREAD" to set the number of lines for each thread (default is 2000), and this option controls the memory amount needed: the larger this number is, the larger memory is required (the higher tokenization speed);
### Version 1.0
# $Id: tokenizer.perl 915 2009-08-10 08:15:49Z philipp $
# written by Josh Schroeder, based on code by Philipp Koehn
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
use warnings;
use FindBin qw($RealBin);
use strict;
use Time::HiRes;
if (eval {require Thread;1;}) {
#module loaded
Thread->import();
}
my $mydir = "$RealBin/nonbreaking_prefixes";
my %NONBREAKING_PREFIX = ();
my @protected_patterns = ();
my $protected_patterns_file = "";
my $language = "en";
my $QUIET = 0;
my $HELP = 0;
my $AGGRESSIVE = 0;
my $SKIP_XML = 0;
my $TIMING = 0;
my $NUM_THREADS = 1;
my $NUM_SENTENCES_PER_THREAD = 2000;
my $PENN = 0;
my $NO_ESCAPING = 0;
while (@ARGV)
{
$_ = shift;
/^-b$/ && ($| = 1, next);
/^-l$/ && ($language = shift, next);
/^-q$/ && ($QUIET = 1, next);
/^-h$/ && ($HELP = 1, next);
/^-x$/ && ($SKIP_XML = 1, next);
/^-a$/ && ($AGGRESSIVE = 1, next);
/^-time$/ && ($TIMING = 1, next);
# Option to add list of regexps to be protected
/^-protected/ && ($protected_patterns_file = shift, next);
/^-threads$/ && ($NUM_THREADS = int(shift), next);
/^-lines$/ && ($NUM_SENTENCES_PER_THREAD = int(shift), next);
/^-penn$/ && ($PENN = 1, next);
/^-no-escape/ && ($NO_ESCAPING = 1, next);
}
# for time calculation
my $start_time;
if ($TIMING)
{
$start_time = [ Time::HiRes::gettimeofday( ) ];
}
# print help message
if ($HELP)
{
print "Usage ./tokenizer.perl (-l [en|de|...]) (-threads 4) < textfile > tokenizedfile\n";
print "Options:\n";
print " -q ... quiet.\n";
print " -a ... aggressive hyphen splitting.\n";
print " -b ... disable Perl buffering.\n";
print " -time ... enable processing time calculation.\n";
print " -penn ... use Penn treebank-like tokenization.\n";
print " -protected FILE ... specify file with patters to be protected in tokenisation.\n";
print " -no-escape ... don't perform HTML escaping on apostrophy, quotes, etc.\n";
exit;
}
if (!$QUIET)
{
print STDERR "Tokenizer Version 1.1\n";
print STDERR "Language: $language\n";
print STDERR "Number of threads: $NUM_THREADS\n";
}
# load the language-specific non-breaking prefix info from files in the directory nonbreaking_prefixes
load_prefixes($language,\%NONBREAKING_PREFIX);
if (scalar(%NONBREAKING_PREFIX) eq 0)
{
print STDERR "Warning: No known abbreviations for language '$language'\n";
}
# Load protected patterns
if ($protected_patterns_file)
{
open(PP,$protected_patterns_file) || die "Unable to open $protected_patterns_file";
while() {
chomp;
push @protected_patterns, $_;
}
}
my @batch_sentences = ();
my @thread_list = ();
my $count_sentences = 0;
if ($NUM_THREADS > 1)
{# multi-threading tokenization
while()
{
$count_sentences = $count_sentences + 1;
push(@batch_sentences, $_);
if (scalar(@batch_sentences)>=($NUM_SENTENCES_PER_THREAD*$NUM_THREADS))
{
# assign each thread work
for (my $i=0; $i<$NUM_THREADS; $i++)
{
my $start_index = $i*$NUM_SENTENCES_PER_THREAD;
my $end_index = $start_index+$NUM_SENTENCES_PER_THREAD-1;
my @subbatch_sentences = @batch_sentences[$start_index..$end_index];
my $new_thread = new Thread \&tokenize_batch, @subbatch_sentences;
push(@thread_list, $new_thread);
}
foreach (@thread_list)
{
my $tokenized_list = $_->join;
foreach (@$tokenized_list)
{
print $_;
}
}
# reset for the new run
@thread_list = ();
@batch_sentences = ();
}
}
# the last batch
if (scalar(@batch_sentences)>0)
{
# assign each thread work
for (my $i=0; $i<$NUM_THREADS; $i++)
{
my $start_index = $i*$NUM_SENTENCES_PER_THREAD;
if ($start_index >= scalar(@batch_sentences))
{
last;
}
my $end_index = $start_index+$NUM_SENTENCES_PER_THREAD-1;
if ($end_index >= scalar(@batch_sentences))
{
$end_index = scalar(@batch_sentences)-1;
}
my @subbatch_sentences = @batch_sentences[$start_index..$end_index];
my $new_thread = new Thread \&tokenize_batch, @subbatch_sentences;
push(@thread_list, $new_thread);
}
foreach (@thread_list)
{
my $tokenized_list = $_->join;
foreach (@$tokenized_list)
{
print $_;
}
}
}
}
else
{# single thread only
while()
{
if (($SKIP_XML && /^<.+>$/) || /^\s*$/)
{
#don't try to tokenize XML/HTML tag lines
print $_;
}
else
{
print &tokenize($_);
}
}
}
if ($TIMING)
{
my $duration = Time::HiRes::tv_interval( $start_time );
print STDERR ("TOTAL EXECUTION TIME: ".$duration."\n");
print STDERR ("TOKENIZATION SPEED: ".($duration/$count_sentences*1000)." milliseconds/line\n");
}
#####################################################################################
# subroutines afterward
# tokenize a batch of texts saved in an array
# input: an array containing a batch of texts
# return: another array containing a batch of tokenized texts for the input array
sub tokenize_batch
{
my(@text_list) = @_;
my(@tokenized_list) = ();
foreach (@text_list)
{
if (($SKIP_XML && /^<.+>$/) || /^\s*$/)
{
#don't try to tokenize XML/HTML tag lines
push(@tokenized_list, $_);
}
else
{
push(@tokenized_list, &tokenize($_));
}
}
return \@tokenized_list;
}
# the actual tokenize function which tokenizes one input string
# input: one string
# return: the tokenized string for the input string
sub tokenize
{
my($text) = @_;
if ($PENN) {
return tokenize_penn($text);
}
chomp($text);
$text = " $text ";
# remove ASCII junk
$text =~ s/\s+/ /g;
$text =~ s/[\000-\037]//g;
# Find protected patterns
my @protected = ();
foreach my $protected_pattern (@protected_patterns) {
my $t = $text;
while ($t =~ /(?$protected_pattern)(?.*)$/) {
push @protected, $+{PATTERN};
$t = $+{TAIL};
}
}
for (my $i = 0; $i < scalar(@protected); ++$i) {
my $subst = sprintf("THISISPROTECTED%.3d", $i);
$text =~ s,\Q$protected[$i], $subst ,g;
}
$text =~ s/ +/ /g;
$text =~ s/^ //g;
$text =~ s/ $//g;
# seperate out all "other" special characters
$text =~ s/([^\p{IsAlnum}\s\.\'\`\,\-])/ $1 /g;
# aggressive hyphen splitting
if ($AGGRESSIVE)
{
$text =~ s/([\p{IsAlnum}])\-(?=[\p{IsAlnum}])/$1 \@-\@ /g;
}
#multi-dots stay together
$text =~ s/\.([\.]+)/ DOTMULTI$1/g;
while($text =~ /DOTMULTI\./)
{
$text =~ s/DOTMULTI\.([^\.])/DOTDOTMULTI $1/g;
$text =~ s/DOTMULTI\./DOTDOTMULTI/g;
}
# seperate out "," except if within numbers (5,300)
#$text =~ s/([^\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g;
# separate out "," except if within numbers (5,300)
# previous "global" application skips some: A,B,C,D,E > A , B,C , D,E
# first application uses up B so rule can't see B,C
# two-step version here may create extra spaces but these are removed later
# will also space digit,letter or letter,digit forms (redundant with next section)
$text =~ s/([^\p{IsN}])[,]/$1 , /g;
$text =~ s/[,]([^\p{IsN}])/ , $1/g;
# separate "," after a number if it's the end of a sentence
$text =~ s/([\p{IsN}])[,]$/$1 ,/g;
# separate , pre and post number
#$text =~ s/([\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g;
#$text =~ s/([^\p{IsN}])[,]([\p{IsN}])/$1 , $2/g;
# turn `into '
#$text =~ s/\`/\'/g;
#turn '' into "
#$text =~ s/\'\'/ \" /g;
if ($language eq "en")
{
#split contractions right
$text =~ s/([^\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g;
$text =~ s/([^\p{IsAlpha}\p{IsN}])[']([\p{IsAlpha}])/$1 ' $2/g;
$text =~ s/([\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g;
$text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1 '$2/g;
#special case for "1990's"
$text =~ s/([\p{IsN}])[']([s])/$1 '$2/g;
}
elsif (($language eq "fr") or ($language eq "it") or ($language eq "ga"))
{
#split contractions left
$text =~ s/([^\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g;
$text =~ s/([^\p{IsAlpha}])[']([\p{IsAlpha}])/$1 ' $2/g;
$text =~ s/([\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g;
$text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1' $2/g;
}
else
{
$text =~ s/\'/ \' /g;
}
#word token method
my @words = split(/\s/,$text);
$text = "";
for (my $i=0;$i<(scalar(@words));$i++)
{
my $word = $words[$i];
if ( $word =~ /^(\S+)\.$/)
{
my $pre = $1;
if (($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) || ($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1) || ($i/\>/g; # xml
$text =~ s/\'/\'/g; # xml
$text =~ s/\"/\"/g; # xml
$text =~ s/\[/\[/g; # syntax non-terminal
$text =~ s/\]/\]/g; # syntax non-terminal
}
#ensure final line break
$text .= "\n" unless $text =~ /\n$/;
return $text;
}
sub tokenize_penn
{
# Improved compatibility with Penn Treebank tokenization. Useful if
# the text is to later be parsed with a PTB-trained parser.
#
# Adapted from Robert MacIntyre's sed script:
# http://www.cis.upenn.edu/~treebank/tokenizer.sed
my($text) = @_;
chomp($text);
# remove ASCII junk
$text =~ s/\s+/ /g;
$text =~ s/[\000-\037]//g;
# attempt to get correct directional quotes
$text =~ s/^``/`` /g;
$text =~ s/^"/`` /g;
$text =~ s/^`([^`])/` $1/g;
$text =~ s/^'/` /g;
$text =~ s/([ ([{<])"/$1 `` /g;
$text =~ s/([ ([{<])``/$1 `` /g;
$text =~ s/([ ([{<])`([^`])/$1 ` $2/g;
$text =~ s/([ ([{<])'/$1 ` /g;
# close quotes handled at end
$text =~ s=\.\.\.= _ELLIPSIS_ =g;
# separate out "," except if within numbers (5,300)
$text =~ s/([^\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g;
# separate , pre and post number
$text =~ s/([\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g;
$text =~ s/([^\p{IsN}])[,]([\p{IsN}])/$1 , $2/g;
#$text =~ s=([;:@#\$%&\p{IsSc}])= $1 =g;
$text =~ s=([;:@#\$%&\p{IsSc}\p{IsSo}])= $1 =g;
# Separate out intra-token slashes. PTB tokenization doesn't do this, so
# the tokens should be merged prior to parsing with a PTB-trained parser
# (see syntax-hyphen-splitting.perl).
$text =~ s/([\p{IsAlnum}])\/([\p{IsAlnum}])/$1 \@\/\@ $2/g;
# Assume sentence tokenization has been done first, so split FINAL periods
# only.
$text =~ s=([^.])([.])([\]\)}>"']*) ?$=$1 $2$3 =g;
# however, we may as well split ALL question marks and exclamation points,
# since they shouldn't have the abbrev.-marker ambiguity problem
$text =~ s=([?!])= $1 =g;
# parentheses, brackets, etc.
$text =~ s=([\]\[\(\){}<>])= $1 =g;
$text =~ s/\(/-LRB-/g;
$text =~ s/\)/-RRB-/g;
$text =~ s/\[/-LSB-/g;
$text =~ s/\]/-RSB-/g;
$text =~ s/{/-LCB-/g;
$text =~ s/}/-RCB-/g;
$text =~ s=--= -- =g;
# First off, add a space to the beginning and end of each line, to reduce
# necessary number of regexps.
$text =~ s=$= =;
$text =~ s=^= =;
$text =~ s="= '' =g;
# possessive or close-single-quote
$text =~ s=([^'])' =$1 ' =g;
# as in it's, I'm, we'd
$text =~ s='([sSmMdD]) = '$1 =g;
$text =~ s='ll = 'll =g;
$text =~ s='re = 're =g;
$text =~ s='ve = 've =g;
$text =~ s=n't = n't =g;
$text =~ s='LL = 'LL =g;
$text =~ s='RE = 'RE =g;
$text =~ s='VE = 'VE =g;
$text =~ s=N'T = N'T =g;
$text =~ s= ([Cc])annot = $1an not =g;
$text =~ s= ([Dd])'ye = $1' ye =g;
$text =~ s= ([Gg])imme = $1im me =g;
$text =~ s= ([Gg])onna = $1on na =g;
$text =~ s= ([Gg])otta = $1ot ta =g;
$text =~ s= ([Ll])emme = $1em me =g;
$text =~ s= ([Mm])ore'n = $1ore 'n =g;
$text =~ s= '([Tt])is = '$1 is =g;
$text =~ s= '([Tt])was = '$1 was =g;
$text =~ s= ([Ww])anna = $1an na =g;
#word token method
my @words = split(/\s/,$text);
$text = "";
for (my $i=0;$i<(scalar(@words));$i++)
{
my $word = $words[$i];
if ( $word =~ /^(\S+)\.$/)
{
my $pre = $1;
if (($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) || ($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1) || ($i/\>/g; # xml
$text =~ s/\'/\'/g; # xml
$text =~ s/\"/\"/g; # xml
$text =~ s/\[/\[/g; # syntax non-terminal
$text =~ s/\]/\]/g; # syntax non-terminal
#ensure final line break
$text .= "\n" unless $text =~ /\n$/;
return $text;
}
sub load_prefixes
{
my ($language, $PREFIX_REF) = @_;
my $prefixfile = "$mydir/nonbreaking_prefix.$language";
#default back to English if we don't have a language-specific prefix file
if (!(-e $prefixfile))
{
$prefixfile = "$mydir/nonbreaking_prefix.en";
print STDERR "WARNING: No known abbreviations for language '$language', attempting fall-back to English version...\n";
die ("ERROR: No abbreviations files found in $mydir\n") unless (-e $prefixfile);
}
if (-e "$prefixfile")
{
open(PREFIX, "<:utf8", "$prefixfile");
while ()
{
my $item = $_;
chomp($item);
if (($item) && (substr($item,0,1) ne "#"))
{
if ($item =~ /(.*)[\s]+(\#NUMERIC_ONLY\#)/)
{
$PREFIX_REF->{$1} = 2;
}
else
{
$PREFIX_REF->{$item} = 1;
}
}
}
close(PREFIX);
}
}
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/tools/vid_feature_extractor.py
================================================
import argparse
import os
import tqdm
from multiprocessing import Manager
import numpy as np
import cv2
import torch
import torch.nn as nn
from PIL import Image
import pretrainedmodels
from pretrainedmodels.utils import TransformImage
Q_FIN = "finished" # end-of-queue flag
def read_to_imgs(file):
"""Yield images and their frame number from a video file."""
vidcap = cv2.VideoCapture(file)
success, image = vidcap.read()
idx = 0
while success:
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
yield image, idx
idx += 1
success, image = vidcap.read()
def vid_len(path):
"""Return the length of a video."""
return int(cv2.VideoCapture(path).get(cv2.CAP_PROP_FRAME_COUNT))
class VidDset(object):
"""For each video, yield its frames."""
def __init__(self, model, root_dir, filenames):
self.root_dir = root_dir
self.filenames = filenames
self.paths = [os.path.join(self.root_dir, f) for f in self.filenames]
self.xform = TransformImage(model)
self.current = 0
def __len__(self):
return len(self.filenames)
def __getitem__(self, i):
path = self.paths[i]
return ((path, idx, self.xform(Image.fromarray(img)))
for img, idx in read_to_imgs(path))
def __iter__(self):
return self
def next(self):
if self.current >= len(self):
raise StopIteration
else:
self.current += 1
return self[self.current - 1]
def __next__(self):
return self.next()
def collate_tensor(batch):
batch[-1] = torch.stack(batch[-1], 0)
def batch(dset, batch_size):
"""Collate frames into batches of equal length."""
batch = [[], [], []]
batch_ct = 0
for seq in dset:
for path, idx, img in seq:
if batch_ct == batch_size:
collate_tensor(batch)
yield batch
batch = [[], [], []]
batch_ct = 0
batch[0].append(path)
batch[1].append(idx)
batch[2].append(img)
batch_ct += 1
if batch_ct != 0:
collate_tensor(batch)
yield batch
class FeatureExtractor(nn.Module):
"""Extract feature vectors from a batch of frames."""
def __init__(self):
super(FeatureExtractor, self).__init__()
self.model = pretrainedmodels.resnet152()
self.FEAT_SIZE = 2048
def forward(self, x):
return self.model.avgpool(
self.model.features(x)).view(-1, 1, self.FEAT_SIZE)
class Reconstructor(object):
"""Turn batches of feature vectors into sequences for each video.
Assumes data is ordered (use one reconstructor per process).
:func:`push()` batches in. When finished, :func:`flush()`
the last sequence.
"""
def __init__(self, out_path, finished_queue):
self.out_path = out_path
self.feats = None
self.finished_queue = finished_queue
def save(self, path, feats):
np.save(path, feats.numpy())
@staticmethod
def name_(path, out_path):
vid_path = path
vid_fname = os.path.basename(vid_path)
vid_id = os.path.splitext(vid_fname)[0]
save_fname = vid_id + ".npy"
save_path = os.path.join(out_path, save_fname)
return save_path, vid_id
def name(self, path):
return self.name_(path, self.out_path)
def push(self, paths, idxs, feats):
start = 0
for i, idx in enumerate(idxs):
if idx == 0:
if self.feats is None and i == 0:
# degenerate case
continue
these_finished_seq_feats = feats[start:i]
if self.feats is not None:
all_last_seq_feats = torch.cat(
[self.feats, these_finished_seq_feats], 0)
else:
all_last_seq_feats = these_finished_seq_feats
if i - 1 < 0:
name = self.path
else:
name = paths[i-1]
save_path, vid_id = self.name(name)
self.save(save_path, all_last_seq_feats)
n_feats = all_last_seq_feats.shape[0]
self.finished_queue.put((vid_id, n_feats))
self.feats = None
start = i
# cache the features
if self.feats is None:
self.feats = feats[start:]
else:
self.feats = torch.cat([self.feats, feats[start:]], 0)
self.path = paths[-1]
def flush(self):
if self.feats is not None: # shouldn't be
save_path, vid_id = self.name(self.path)
self.save(save_path, self.feats)
self.finished_queue.put((vid_id, self.feats.shape[0]))
def finished_watcher(finished_queue, world_size, root_dir, files):
"""Keep a progress bar of frames finished."""
n_frames = sum(vid_len(os.path.join(root_dir, f)) for f in files)
n_finished_frames = 0
with tqdm.tqdm(total=n_frames, unit="Fr") as pbar:
n_proc_finished = 0
while True:
item = finished_queue.get()
if item == Q_FIN:
n_proc_finished += 1
if n_proc_finished == world_size:
return
else:
vid_id, n_these_frames = item
n_finished_frames += n_these_frames
pbar.set_postfix(vid=vid_id)
pbar.update(n_these_frames)
def run(device_id, world_size, root_dir, batch_size_per_device,
feats_queue, files):
"""Process a disjoint subset of the videos on each device."""
if world_size > 1:
these_files = [f for i, f in enumerate(files)
if i % world_size == device_id]
else:
these_files = files
fe = FeatureExtractor()
dset = VidDset(fe.model, root_dir, these_files)
dev = torch.device("cuda", device_id) \
if device_id >= 0 else torch.device("cpu")
fe.to(dev)
fe = fe.eval()
with torch.no_grad():
for samp in batch(dset, batch_size_per_device):
paths, idxs, images = samp
images = images.to(dev)
feats = fe(images)
if torch.is_tensor(feats):
feats = feats.to("cpu")
else:
feats = [f.to("cpu") for f in feats]
feats_queue.put((paths, idxs, feats))
feats_queue.put(Q_FIN)
return
def saver(out_path, feats_queue, finished_queue):
rc = Reconstructor(out_path, finished_queue)
while True:
item = feats_queue.get()
if item == Q_FIN:
rc.flush()
finished_queue.put(Q_FIN)
return
else:
paths, idxs, feats = item
rc.push(paths, idxs, feats)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--root_dir", type=str, required=True,
help="Directory of videos.")
parser.add_argument("--out_dir", type=str, required=True,
help="Directory for output features.")
parser.add_argument("--world_size", type=int, default=1,
help="Number of devices to run on.")
parser.add_argument("--batch_size_per_device", type=int, default=512)
opt = parser.parse_args()
batch_size_per_device = opt.batch_size_per_device
root_dir = opt.root_dir
out_path = opt.out_dir
if not os.path.exists(out_path):
os.makedirs(out_path)
# mp queues don't work well between procs unless they're from a manager
manager = Manager()
finished_queue = manager.Queue()
world_size = opt.world_size if torch.cuda.is_available() else -1
mp = torch.multiprocessing.get_context("spawn")
procs = []
print("Starting processing. Progress bar startup can take some time, but "
"processing will start in the meantime.")
files = list(sorted(list(os.listdir(root_dir))))
files = [f for f in files
if os.path.basename(Reconstructor.name_(f, out_path)[0])
not in os.listdir(out_path)]
procs.append(mp.Process(
target=finished_watcher,
args=(finished_queue, world_size, root_dir, files),
daemon=False
))
procs[0].start()
if world_size >= 1:
feat_queues = [manager.Queue(2) for _ in range(world_size)]
for feats_queue, device_id in zip(feat_queues, range(world_size)):
# each device has its own saver so that reconstructing is easier
procs.append(mp.Process(
target=run,
args=(device_id, world_size, root_dir,
batch_size_per_device, feats_queue, files),
daemon=True))
procs[-1].start()
procs.append(mp.Process(
target=saver,
args=(out_path, feats_queue, finished_queue),
daemon=True))
procs[-1].start()
else:
feats_queue = manager.Queue()
procs.append(mp.Process(
target=run,
args=(-1, 1, root_dir,
batch_size_per_device, feats_queue, files),
daemon=True))
procs[-1].start()
procs.append(mp.Process(
target=saver,
args=(out_path, feats_queue, finished_queue),
daemon=True))
procs[-1].start()
for p in procs:
p.join()
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/train.py
================================================
#!/usr/bin/env python
from onmt.bin.train import main
if __name__ == "__main__":
main()
================================================
FILE: 深度学习自然语言处理/机器翻译/OpenNMT-py/translate.py
================================================
#!/usr/bin/env python
from onmt.bin.translate import main
if __name__ == "__main__":
main()
================================================
FILE: 深度学习自然语言处理/机器翻译/README.md
================================================
## 机器翻译
================================================
FILE: 深度学习自然语言处理/机器翻译/bpe-subword论文的我的阅读总结.md
================================================
bpe论文的我的阅读感受
Neural Machine Translation of Rare Words with Subword Units
提出这个算法的直觉是这样的,作者发现翻译一个单词,有时候不需要这个单词的全部信息,可能只需要一部分信息就可以知道大致信息。还有一种可能是翻译这个单词与,可能通过组成
这个单词的多个小单元信息来翻译就可以了。
这个方法是为了解决稀少单词(也就是频率在人为规定下的单词)和未登录词没有办法有效翻译的问题。
这里作者在摘要中提到了一个back-off 字典,就是说在之前翻译模型在处理未登录词汇的时候,处理办法是使用一个词典,把source-target中未登陆词汇一一对应起来,我们在翻译过程中
如果出现了未登录词汇,直接使用字典中的对应关系进行替换就可以了。但是这样存在一个问题,就是说,最低频率是我们人为规定的,有些时候在调参的时候,这个频率是一个超参,,那么
我们在准备词典的时候,就是一个动态的长度,这样很不方便,但是如果我们准备所有单词的back-off就得不偿失。还有一个问题是我们不确定source-target对应的关系是一一对应的,可能对应不上,可能对应
是多种,在不同句子环境中,我们需要选择不同的单词翻译,这也是存在的一个问题。
基于word-level的模型还存在一个问题,就是不能产生没有看见过的单词,也就是说在翻译端,没有出现在词汇表中的在翻译模型测试的时候是不会出现的。这其实是一个很重要的问题,就是我不能确定
我的训练语料包含所有情况下的翻译。
作者在摘要中说明,自己使用了字符级的n-gram和bpe方法,在WMT 15 英文德文翻译中提升1.1,在英文俄罗斯中提升1.3。
翻译是一个开放词汇表的问题,我们在翻译模型中,一般把翻译模型词汇表限制在30000–50000(基于词)。
作者通过实验发现,使用subeword模型,比使用大量词汇表的模型和使用back-off模型效果很好更简单。
我在博客中看到了总结这个论文不错的博客,总结在下面
通过BPE解决OOV问题----Neural machine Translation of Rare Words with Subword Units
https://blog.csdn.net/weixin_38937984/article/details/101723700
================================================
FILE: 深度学习自然语言处理/模型蒸馏/BERT知识蒸馏代码解析-如何写好损失函数.md
================================================
大家好,我是DASOU;
今天从代码角度深入了解一下知识蒸馏,主要核心部分就是分析一下在知识蒸馏中损失函数是如何实现的;
之前写过一个关于BERT知识蒸馏的理论的文章,感兴趣的朋友可以去看一下:[Bert知识蒸馏系列(一):什么是知识蒸馏](http://mp.weixin.qq.com/s?__biz=MzIyNTY1MDUwNQ==&mid=2247484225&idx=1&sn=b48cfea668bd5b91e1bb8c74e3ab1db3&chksm=e87d3167df0ab8713d045bf656291b0da9e4928f57c27d5ea4e4f537cf44f051dc0b6d6ec35a&scene=21#wechat_redirect)。
知识蒸馏一个简单的脉络可以这么去梳理:**学什么,从哪里学,怎么学?**
**学什么**:学的是老师的知识,体现在网络的参数上;
**从哪里学**:输入层,中间层,输出层;
**怎么学**:损失函数度量老师网络和学生网络的差异性;
从架构上来说,BERT可以蒸馏到简单的TextCNN,LSTM等,也就可以蒸馏到TRM架构模型,比如12层BERT到4层BERT;
之前工作中用到的是BERT蒸馏到TextCNN;
最近在往TRM蒸馏靠近,使用的是 Textbrewer 这个库(这个库太强大了);
接下来,我从代码的角度来梳理一下知识蒸馏的核心步骤,其实最主要的就是分析一下损失函数那块的代码形式。
我以一个文本分类的任务为例子,在阅读理解的过程中,最需要注意的一点是数据的流入流出的Shape,这个很重要,在自己写代码的时候,最重要的其实就是这个;
首先使用的是MNLI任务,也就是一个文本分类任务,三个标签;
输入为Batch_data:[32,128]---[Batch_size,seq_len];
老师网络:BERT_base:12层,Hidden_size为768;
学生网络:BERT_base:4层,Hidden_size为312;
首先第一个步骤是训练一个老师网络,这个没啥可说。
其次是初始化学生网络,然后将输入Batch_data流经两个网络;
在初始化学生网络的时候,之前有的同学问到是如何初始化的一个BERT模型的;
关于这个,最主要的是修改Config文件那里的层数,由正常的12改为4,然后如果你不是从本地load参数到学生网络,BERT模型的类会自动调用初始化;
关于代码实现,我之前写过一个文章,大家可以看这里的代码解析,更加的清洗一点:[Pytorch代码验证--如何让Bert在finetune小数据集时更“稳”一点](http://mp.weixin.qq.com/s?__biz=MzIyNTY1MDUwNQ==&mid=2247483696&idx=1&sn=cc79da01752c5e7588ef8686c1f95e1f&chksm=e87d3316df0aba00e23189158bfdb7a41e964f545422d97940a87d89571881a5ac1be4bb77f8&scene=21#wechat_redirect);
然后我们来说数据首先流经学生网络,我们得到两个东西,一个是最后一层【CLS】的输出,此时未经softmax操作,所以是logits,维度为:[32,3]-[batch_size,label_size];
第二个东西是中间隐层的输出,维度为:[5,32,128,312],也就是 [隐层数量,batch_size,seq_len,Hidden_size];
需要注意的是这里的隐层数量是5,因为正常的隐层在模型定义的时候是4,然后这里是加上了embedding层;
还有一点需要注意的是,在度量学生网络和老师网络隐层差异的时候,这里是度量的seq_len,也就是对每个token的输出都做了操作;
如果在这里我们想做类似【CLS】的输出的时候,只需要提取最开始的一个[32,312]的向量就可以;不过,一般来说我们不这么做;
其次流经老师网络,我们同样得到两个东西,一个是最后一层【CLS】的输出,此时未经softmax操作,所以是logits,维度为:[32,3]-[batch_size,label_size];
第二个东西是中间隐层的输出,维度为:[5,32,128,768],也就是 [隐层数量,batch_size,seq_len,Hidden_size];
这里需要注意的是老师网络和学生网络隐层数量不一样,一个是768,一个是312。
这其实是一个很常见的现象;就是我们的学生网络在减少参数的时候,不仅会变矮,有时候我们也想让它变窄,也就是隐层的输出会发生变化,从768变为312;
这个维度的变化需要注意两点,首先就是在学生模型初始化的时候,不能套用老师网络的对应层的参数,因为隐层Hidden_size发生了变化。所以一般调用的是BERT自带的初始化方式;
其次就是在度量学生网络和老师网络差异性的时候,因为矩阵大小不一致,不能直接做MSE。在代码层面上,需要做一个线性映射,才能做MSE。
而且还需要注意的一点是,由于老师网络已经固定不动了,所以在做映射的时候我们是要对学生网路的312加一个线性层转化到768层,也就是说这个线性层是加在了学生网络;
整个架构的损失函数可以分为三种:首先对于【CLS】的输出,使用KL散度度量差异;对于隐层输出使用MSE和MMD损失函数进行度量;
对于损失函数这块的选择,其实我觉得没啥经验可说,只能试一试;
看了很多论文加上自己的经验,一般来说在最后面使用KL,中间层使用MSE会更好一点;当然有的实验也会在最后一层直接用MSE;玄学。
在初看代码的时候,MMD这个之前我没接触过,还特意去看了一下,关于理论我就不多说了,一会看代码吧。
首先对【CLS】的输出,代码如下:
```
def kd_ce_loss(logits_S, logits_T, temperature=1):
if isinstance(temperature, torch.Tensor) and temperature.dim() > 0:
temperature = temperature.unsqueeze(-1)
beta_logits_T = logits_T / temperature
beta_logits_S = logits_S / temperature
p_T = F.softmax(beta_logits_T, dim=-1)
loss = -(p_T * F.log_softmax(beta_logits_S, dim=-1)).sum(dim=-1).mean()
return loss
```
首先对于 logits_S,就是学生网络的【CLS】的输出,logits_T就是老师网络【CLS】的输出,temperature 在代码中默认参数是1,例子中设置为了8;
整个代码其实很简单,就是先做Temp的一个转化,注意这里我们对学生网络的输出和老师网络的输出都做了转化,然后做loss计算;
其次我们来看比较复杂的中间层的度量;
首先需要掌握一点,就是学生网络和老师网络层之间的对应关系;
学生网络是4层,老师网络12层,那么在对应的时候,简单的对应关系就是这样的:
```
layer_T : 0, layer_S : 0,
layer_T : 3, layer_S : 1,
layer_T : 6, layer_S : 2,
layer_T : 9, layer_S : 3,
layer_T : 12, layer_S : 4,
```
这个对应关系是需要我们认为去设定的,将学生网络的1层对应到老师网络的12层可不可以?当然可以,但是效果不一定好;
一般来说等间隔的对应上就好;
这个对应关系其实还有一个用处,就是学生网络在初始化的时候【假如没有变窄,只是变矮,也就是层数变低了】,那么可以从依据这个对应关系把权重copy过来;
学生网络的隐层输出为:[5,32,128,312],老师网络隐层输出为[5,32,128,768]
那么在代码实现的时候,需要做一个zip函数把对应层映射过去,然后每一层计算MSE,然后加起来作为损失函数;
我们来看代码:
```
inters_T = {feature: results_T.get(feature,[]) for feature in FEATURES}
inters_S = {feature: results_S.get(feature,[]) for feature in FEATURES}
for ith,inter_match in enumerate(self.d_config.intermediate_matches):
if type(layer_S) is list and type(layer_T) is list: ## MMD损失函数对应的情况
inter_S = [inters_S[feature][s] for s in layer_S]
inter_T = [inters_T[feature][t] for t in layer_T]
name_S = '-'.join(map(str,layer_S))
name_T = '-'.join(map(str,layer_T))
if self.projs[ith]: ## 这里失去做学生网络隐层的映射
#inter_T = [self.projs[ith](t) for t in inter_T]
inter_S = [self.projs[ith](s) for s in inter_S]
else:## MSE 损失函数
inter_S = inters_S[feature][layer_S]
inter_T = inters_T[feature][layer_T]
name_S = str(layer_S)
name_T = str(layer_T)
if self.projs[ith]:
inter_S = self.projs[ith](inter_S) # 需要注意的是隐层输出是312,但是老师网络是768,所以这里要做一个linear投影到更高维,方便计算损失函数
intermediate_loss = match_loss(inter_S, inter_T, mask=inputs_mask_S) ## loss = F.mse_loss(state_S, state_T)
total_loss += intermediate_loss * match_weight
```
这个代码里面比如迷糊的是【self.d_config.intermediate_matches】,打印出来发现是这个东西:
```
IntermediateMatch: layer_T : 0, layer_S : 0, feature : hidden, weight : 1, loss : hidden_mse, proj : ['linear', 312, 768, {}],
IntermediateMatch: layer_T : 3, layer_S : 1, feature : hidden, weight : 1, loss : hidden_mse, proj : ['linear', 312, 768, {}],
IntermediateMatch: layer_T : 6, layer_S : 2, feature : hidden, weight : 1, loss : hidden_mse, proj : ['linear', 312, 768, {}],
IntermediateMatch: layer_T : 9, layer_S : 3, feature : hidden, weight : 1, loss : hidden_mse, proj : ['linear', 312, 768, {}],
IntermediateMatch: layer_T : 12, layer_S : 4, feature : hidden, weight : 1, loss : hidden_mse, proj : ['linear', 312, 768, {}],
IntermediateMatch: layer_T : [0, 0], layer_S : [0, 0], feature : hidden, weight : 1, loss : mmd, proj : None,
IntermediateMatch: layer_T : [3, 3], layer_S : [1, 1], feature : hidden, weight : 1, loss : mmd, proj : None,
IntermediateMatch: layer_T : [6, 6], layer_S : [2, 2], feature : hidden, weight : 1, loss : mmd, proj : None,
IntermediateMatch: layer_T : [9, 9], layer_S : [3, 3], feature : hidden, weight : 1, loss : mmd, proj : None,
IntermediateMatch: layer_T : [12, 12], layer_S : [4, 4], feature : hidden, weight : 1, loss : mmd, proj : None
```
简单说,这个变量存储的就是上面我们谈到的层与层之间的对应关系。前面5行就是MSE损失函数度量,后面那个注意看,层数对应的时候是一个列表,对应的是MMD损失函数;
我们来看一下MMD损失的代码形式:
```
def mmd_loss(state_S, state_T, mask=None):
state_S_0 = state_S[0] # (batch_size , length, hidden_dim_S)
state_S_1 = state_S[1] # (batch_size , length, hidden_dim_S)
state_T_0 = state_T[0] # (batch_size , length, hidden_dim_T)
state_T_1 = state_T[1] # (batch_size , length, hidden_dim_T)
if mask isNone:
gram_S = torch.bmm(state_S_0, state_S_1.transpose(1, 2)) / state_S_1.size(2) # (batch_size, length, length)
gram_T = torch.bmm(state_T_0, state_T_1.transpose(1, 2)) / state_T_1.size(2)
loss = F.mse_loss(gram_S, gram_T)
else:
mask = mask.to(state_S[0])
valid_count = torch.pow(mask.sum(dim=1), 2).sum()
gram_S = torch.bmm(state_S_0, state_S_1.transpose(1, 2)) / state_S_1.size(2) # (batch_size, length, length)
gram_T = torch.bmm(state_T_0, state_T_1.transpose(1, 2)) / state_T_1.size(2)
loss = (F.mse_loss(gram_S, gram_T, reduction='none') * mask.unsqueeze(-1) * mask.unsqueeze(1)).sum() / valid_count
return loss
```
看最重要的代码就可以:
```
state_S_0 = state_S[0]# 32 128 312 (batch_size , length, hidden_dim_S)
state_T_0 = state_T[0] # 32 128 768 (batch_size , length, hidden_dim_T)
gram_S = torch.bmm(state_S_0, state_S_1.transpose(1, 2)) / state_S_1.size(2)
gram_T = torch.bmm(state_T_0, state_T_1.transpose(1, 2)) / state_T_1.size(2)
```
简单说就是现在自己内部计算bmm,然后两个矩阵之间做mse;这里如果我没理解错使用的是一个线性核函数;
损失函数代码大致就是这样,之后有时间我写个简单的repository,梳理一下整个流程;
================================================
FILE: 深度学习自然语言处理/模型蒸馏/Bert蒸馏到简单网络lstm.md
================================================
假如手上有一个文本分类任务,我们在提升模型效果的时候一般有以下几个思路:
1. 增大数据集,同时提升标注质量
2. 寻找更多有效的文本特征,比如词性特征,词边界特征等等
3. 更换模型,使用更加适合当前任务或者说更加复杂的模型,比如FastText-->TextCNN--Bert
...
之后接触到了知识蒸馏,学习到了简单的神经网络可以从复杂的网路中学习知识,进而提升模型效果。
之前写个一个文章是TextCNN如何逼近Bert,当时写得比较粗糙,但是比较核心的点已经写出来。
这个文章脱胎于这个论文:Distilling Task-Specific Knowledge from BERT into Simple Neural Networks
整个训练过程是这样的:
1. 在标签数据上微调Bert模型
2. 使用三种方式对无标签数据进行数据增强
3. Bert模型在无标签数据上进行推理,Lstm模型学习Bert模型的推理结果,使用MSE作为损失函数。
#### 目标函数
知识蒸馏的目标函数:

一般来说,我们会使用两个部分,一个是硬目标损失函数,一个是软目标损失函数,两者都可以使用交叉熵进行度量。
在原论文中,作者在计算损失函数的时候只是使用到了软目标,同时这个软目标并不是使用softmax之前的logits进行MSE度量损失,也就是并没有使用带有温度参数T的sotmax进行归一化。
#### 数据增强
为了促进有效的知识转移,我们经常需要一个庞大的,未标记的数据集。
三种数据增强的方式:
1. Masking:使用概率$P_{mask}$随机的替换一个单词为[MASK].
需要注意的是这里替换之后,Bert模型也会输入这个数据的。从直觉上来讲,这个规则可以阐明每个单词对标签的影响。
2. POS-guided word replacement.使用概率$P_{pos}$随机替换一个单词为另一个相同POS的单词。这个规则有可能会改变句子的语义信息。
3. n-gram sampling
整个流程是这样的:对于每个单词,如果概率p<$p_{mask}$,我们使用第一条规则,如果p<$p_{mask}+p_{pos}$,我们使用第二条规则,两条规则互斥,也就是同一个单词只使用两者之间的一个。当对句子中的每个单词都过了一遍之后,我进行第三条规则,之后把整条句子补充道无标签数据集中。
#### 知识蒸馏结果图
效果图:

================================================
FILE: 深度学习自然语言处理/模型蒸馏/PKD-Bert基于多层的知识蒸馏方式.md
================================================
[PKD](https://arxiv.org/pdf/1908.09355.pdf "Patient Knowledge Distillation for BERT Model Compression") 核心点就是不仅仅从Bert(老师网络)的最后一层学习知识去做蒸馏,它还另加了一部分,就是从**Bert的中间层去学习**。
简单说,PKD的知识来源有两部分:**中间层+最后输出**。
它缓解了之前只用最后softmax输出层的蒸馏方式出现的过拟合而导致泛化能力降低的问题。
接下来,我们从PKD模型的两个策略说起:PKD-Last 和 PKD-Skip。
# 1.PKD-Last and PKD-Skip
PKD的本质是从中间层学习知识,但是这个中间层如何去定义,就各式各样了。
比如说,我完全可以定位我只要**奇数层**,或者我只要**偶数层**,或者说我只要**最中间的两层**,等等,不一而足。
那么作者,主要是使用了这么多想法中的看起来比较合理的两种。
**PKD-Last,就是把中间层定义为老师网络的最后k层**。
这样做是基于老师网络越靠后的层数含有更多更重要的信息。
这样的想法其实和之前的蒸馏想法很类似,也就是只使用softmax层的输出去做蒸馏。但是从感官来看,有种尾大不掉的感觉,不均衡。
另一个策略是 就是**PKD-Skip,顾名思义,就是每跳几层学习一层**。
这么做是基于老师网络比较底层的层也含有一些重要性信息,这些信息不应该被错过。
作者在后面的实验中,证明了,PKD-Skip 效果稍微好一点(slightly better);
作者认为PKD-Skip抓住了老师网络不同层的多样性信息。而PKD-Last抓住的更多相对来说同质化信息,因为集中在了最后几层。
# 2. PKD
## 2.1架构图
两种策略的PKD的架构图如下所示,**注意观察图,有个细节很容易忽视掉**:

我们注意看这个图,**Bert的最后一层(不是那个绿色的输出层)是没有被蒸馏的,这个细节一会会提到**。
## 2.2 怎么蒸馏中间层
这个时候,需要解决一个问题:我们怎么蒸馏中间层?
仔细想一下Bert的架构,假设最大长度是128,那么我们每一层Transformer encoder的输出都应该是128个单元,每个单元是768维度。
那么在对中间层进行蒸馏的时候,我们需要针对哪一个单元?是针对所有单元还是其中的部分单元?
首先,我们想一下,正常KD进行蒸馏的时候,我们使用的是[CLS]单元Softmax的输出,进行蒸馏。
我们可以把这个思想借鉴过来,一来,对所有单元进行蒸馏,计算量太大。二来,[CLS] 不严谨的说,可以看到整个句子的信息。
为啥说是不严谨的说呢?因为[CLS]是不能代表整个句子的输出信息,这一点我记得Bert中有提到。
## 2.3蒸馏层数和学生网络的初始化
接下来,我想说一个很小的细节点,对比着看上面的模型架构图:
**Bert(老师网络)的最后一层 (Layer 12 for BERT-Base) 在蒸馏的时候是不予考虑**;
原因的话,其一可以这么理解,PKD创新点是从中间层学习知识,最后一层不属于中间层。当然这么说有点牵强附会。
作者的解释是最后一层的隐层输出之后连接的就是Softmax层,而Softmax层的输出已经被KD Loss计算在内了。
比如说,K=5,那么对于两种PKD的模式,被学习的中间层分别是:
PKD-Skip: $I_{pt} = {2,4,6,8,10}$;
PKD-Last: $I_{pt} = {7,8,9,10,11}$
还有一个细节点需要注意,就是学生网络的初始化方式,直接使用老师网络的前几层去初始化学生网络的参数。
## 2.4 损失函数
首先需要注意的是中间层的损失,作者使用的是MSE损失。如下:

整个模型的损失主要是分为两个部分:KD损失和中间层的损失,如下:

超参数问题:
1. $T:{5,10,20}$
2. $\alpha:{0.2,0.5,0.7}$
3. $LR :{5e-5, 2e-5, 1e-5}$
4. $\beta :{10, 100, 500, 1000} $
# 3. 实验效果
实验效果可以总结如下:
1. PKD确实有效,而且Skip模型比Last效果稍微好一点。
2. PKD模型减少了参数量,加快了推理速度,基本是线性关系,毕竟减少了层数
除了这两点,作者还做了一个实验去验证:**如果老师网络更大,PKD模型得到的学生网络会表现更好吗**?
这个实验我很感兴趣。
直接上结果图:

KD情况下,注意不是PKD模型,看#1 和#2,在老师网络增加的情况下,效果有好有坏。这个和训练数据大小有关。
KD情况下,看#1和#3,在老师网络增加的情况下,学生网络明显变差。
作者分析是因为,压缩比高了,学生网络获取的信息变少了。
也就是大网络和小网络本身效果没有差多少,但是学生网络在老师是大网络的情况下压缩比大,学到的信息就少了。
更有意思的是对比#2和#3,老师是大网络的情况下,学生网络效果差。
这里刚开始没理解,后来仔细看了一下,注意#2 的学生网络是$Bert_{6}[Base]-KD$,也就是它的初始化是从$Bert_{12}[Base]$来的,占了一半的信息。
好的,写到这里


================================================
FILE: 深度学习自然语言处理/模型蒸馏/Theseus-模块压缩交替训练.md
================================================
大家好,我是DASOU,今天介绍一下:BERT-of-Theseus
这个论文我觉得还挺有意思,攒个思路。
读完这个文章,BERT-of-Theseus 掌握以下两点就可以了:
1. 基于模块替换进行压缩
2. 除了具体任务的损失函数,没有其他多余损失函数。
效果的话,与$Bert-base$相比,$BERT-of-Theseus$:推理速度 $1.94$;模型效果 98%;
# 模块替换
举个例子,比如有一个老师网络是12层的Bert,现在我每隔两层Transformer,替换为学生网络的一层Transformer。那么最后我的学生网络也就变成了6层的小Bert,训练的时候老师网络和学生网络的模块交替训练。
直接看下面这个架构图:

作者说他是受 Dropout 的启发,仔细想了想还真的挺像的。
我们来说一下这样做的好处。
我刚才说每隔老师网络的两层替换为学生网络的一层。很容易就想到PKD里面,有一个是PKD-Skip策略。
就是每隔几层,学生网络的层去学习老师网络对应层的输出,使用损失函数让两者输出接近,使用的是CLS的输出。
在这里提一下蒸馏/压缩的基本思想,一个最朴素的想法就是让学生网络和老师网络通过损失函数在输出层尽可能的靠近。
进一步的,为了提升效果,可以通过损失函数,让学生网络和老师网络在中间层尽可能的靠近,就像PKD这种。
这个过程最重要的就是在训练的时候需要通过损失函数来让老师网络和学生网络尽可能的接近。
如果是这样的话,问题就来了,损失函数的选取以及各自损失函数之前的权重就需要好好的选择,这是一个很麻烦的事情。
然后我们再来看 BERT-of-Theseus,它就没有这个问题。
它是在训练的时候以概率 $r$ 来从老师网络某一层和学生网络的某一层选择一个出来,放入到训练过程中。
在这个论文里,老师网络叫做 $predecessor$, 学生网络叫做 $successor$ ;
# 训练过程
对着这个网络架构,我说一下整体训练的过程:
1. 在具体任务数据上训练一个 BERT-base 网络作为 $predecessor$;
2. 使用 $predecessor$ 前六层初始化一个 6层的Bert作为 $successor$ ;
3. 在具体任务数据上,固定 $predecessor$ 相应权重,以概率$r$(随着steps,线性增加到1),对整个网络($predecessor$加上$successor$ )进行整体的训练。
4. 为了让$successor$ 作为一个整体,单独抽离出来$successor$ (其实$r$设置为1就可以了),作为一个单独的个体,在训练数据上继续微调。直至效果不再增加。
简单总结,在训练数据上,老师网络和学生网络共同训练,因为存在概率问题,有的时候是老师网络的部分层加入训练,有的时候是学生网络的部分层加入训练。在这一步训练完成之后,为了保证学生网络作为一个整体(因为在第一步训练的时候大部分情况下学生网络的层都是分开加入训练过程的),在具体任务数据上,对学生网络继续微调,直至效果不再增加。
# 结果分析
## 不同方法的损失函数
论文提供了一个不同Bert蒸馏方法使用的损失函数的图,值得一看,见下图:

值得注意的是,这里的 $Finetuning$应该是选取前六层,在具体任务微调的结果。
## 效果

整体来说,BERT-of-Theseus 思路很简单,效果也还不错。
================================================
FILE: 深度学习自然语言处理/模型蒸馏/bert2textcnn模型蒸馏.md
================================================
为什么需要做模型蒸馏?
Bert类模型精读高,但是推理速度慢,模型蒸馏可以在速度和精读之间做一个平衡。
1. 从蒸馏方法
从蒸馏方法来看,一般可以分为三种:
1. 参数的共享或者剪枝
2. 低秩分解
3. 知识蒸馏
对于1和2,可以参考一下 Albert。
而对于知识蒸馏来说,本质是通过一种映射关系,将老师学到的东西映射到或者说传递给学生网络。
在最开始的时候,一般都会有一种疑问? 我有训练数据了,训练数据的准确度肯定比你大模型的输出结构准确度高,为什么还需要从老师网络来学习知识?
我觉得对于这个问题,我在李如的文章看到这样一句话:”好模型的目标不是拟合训练数据,而是学习如何泛化到新的数据“
我觉得写的很好。对于这个问题,我们这么去想,我们的大模型的输出对于logits不仅仅是类别属于哪一个,还有一个特点就是会给出不同类别之间的一个关系。
比如说,在预测”今天天气真不错,现在就决定了,出去浪一波,来顿烧烤哦“。
文本真实标签可能就直接给出了”旅游“这个标签,而我们的模型在概率输出的时候可能会发现”旅游“和”美食“两个标签都还行。
这就是模型从数据中学习到的一种”暗知识“(好像是这么叫,忘了在哪里看到了)、
而且还存在一个问题,有些时候是没有那么多训练数据的,需要的是大模型Bert这种给出无监督数据的伪标签作为冷启动也是不错的。
2. 从蒸馏结构
从蒸馏结构来说,我们可以分为两种:
1. 从transformer到transformer结构
2. 从transformer结构到别的模型(CNN或者lstm结构)
我主要是想聊一下 Bert 到 TextCNN模型的蒸馏。
为啥选择textcnn?最大的原因就是速度快精读还不错。
论文参考 Distilling Task-Specific Knowledge from BERT into Simple Neural Networks
对于这个蒸馏,对于我而言,最重要的掌握一个点就是损失函数的设定,别的地方我暂且不考虑。
对于损失函数,分为两个部分,一个是我当前lstm输出结果和真实标签的交叉熵损失,一个是我的当前lstm输出结果和大模型bert的输出logits的平方损失。
至于为啥一个是交叉熵一个是平方损失,是因为其实前面的看做分类问题,后面的看做回归问题。当然只是谁更合适的选择问题。
因为是加权两个部分做损失,我这边选择为都是0.5。
当然在李如的文章中谈到,可能真实标签这边的权重小一点会更好一点,因为蒸馏本质上还是想多关注bert的输出多一点。
关于这个论文有一个很好的解释:
知识蒸馏论文选读(二) - 小禅心的文章 - 知乎
https://zhuanlan.zhihu.com/p/89420539
关于模型蒸馏,我就简单了解到这里,可能之后会花费大量精力看看背的蒸馏方式,放上开源代码:
[bert到lstm的蒸馏](https://github.com/DA-southampton/knowledge-distillation)
[bert到textcnn/lstm/lkeras/torch](https://github.com/DA-southampton/bert_distill)
[一个pytorch实现的模型蒸馏库](https://github.com/DA-southampton/KD_Lib)
罗列一下关于Bert模型蒸馏的文章和博客:
首先一个讲的比较好的文章就是下面这个文章,比较系统的讲了一遍
BERT知识蒸馏综述 - 王三火的文章 - 知乎
https://zhuanlan.zhihu.com/p/106810758
还有一个文章讲的比较好的是:BERT 模型蒸馏 Distillation BERT
https://www.jianshu.com/p/ed7942b5207a
这个文章就是比较系统的对比了Bert的两个蒸馏操作:DistilBERT 和 Distilled BiLSTM 我觉得写得还不错
从实战的角度来说,我觉得写得很好的就是:BERT 蒸馏在垃圾舆情识别中的探索
https://blog.csdn.net/alitech2017/article/details/107412038
这个文章是对bert的蒸馏,到textcnn,使用了多种方式并且比较了最终的结果。
接下来是李如的这个文章,很概括,确实大佬,写得很好:
【DL】模型蒸馏Distillation - 李如的文章 - 知乎
https://zhuanlan.zhihu.com/p/71986772
================================================
FILE: 深度学习自然语言处理/模型蒸馏/tinybert-全方位蒸馏.md
================================================
大家好,我是DASOU,今天说一下 TinyBert;
[TinyBert](https://openreview.net/pdf?id=rJx0Q6EFPB "TINYBERT: DISTILLING BERT FOR NATURAL LANGUAGE UNDERSTANDING") 主要掌握两个核心点:
1. 提出了对基于 transformer 的模型的蒸馏方式:Transformer distillation;
2. 提出了两阶段学习框架:在预训练和具体任务微调阶段都进行了 Transformer distillation(两阶段有略微不同);
下面对这两个核心点进行阐述。
# 1. Transformer distillation
## 1.1整体架构
整体架构如下:

Bert不严谨的来划分,可以分为三个部分:词向量输入层,中间的TRM层,尾端的预测输出层。
在这个论文里,作者把词向量输入层 和中间的TRM层统一称之为中间层,大家读的时候需要注意哈。
Bert的不同层代表了学习到了不同的知识,所以针对不同的层,设定不同的损失函数,让学生网络向老师网络靠近,如下:
1. ebedding层的输出
2. 多头注意力层的注意力矩阵和隐层的输出
3. 预测层的输出
## 1.2 Transformer 基础知识:
注意力层:

多头注意力层:

前馈神经网路:

## 1.3 Transformer 的蒸馏
对 Transformer的蒸馏分为两个部分:一个是注意力层矩阵的蒸馏,一个是前馈神经网络输出的蒸馏。
**注意力层矩阵蒸馏的损失函数**:

这里注意两个细节点:
一个是使用的是MSE;
还有一个是,使用的没有归一化的注意力矩阵,见(1),而不是softmax之后的。**原因是实验证明这样能够更快的收敛而且效果会更好**。
**前馈神经网络蒸馏的损失函数**

两个细节点:
第一仍然使用的是MSE.
第二个细节点是注意,学生网路的隐层输出乘以了一个权重矩阵$w_{h}$,这样的原因是学生网络的隐层维度和老师网络的隐层维度不一定相同。
所以如果直接计算MSE是不行的,这个权重矩阵也是在训练过程中学习的。
写到这里提一点,其实这里也可以看出来为什么tinybert的初始化没有采用类似PKD这种,而是使用GD过程进行蒸馏学习。
因为我们的tinybert 在减少层数的同时也减少了宽度(隐层的输出维度),如果采用PKD这种形式,学生网络的维度和老师网络的维度对不上,是不能初始化的。
**词向量输入层的蒸馏**:

**预测层输出蒸馏**:

## 1.4 总体蒸馏损失函数

# 2. 两阶段蒸馏
## 2.1 整体架构
整体架构如图:

## 2.2 为什么需要GD:
说一下我自己的理解哈,我觉得有两个原因:
首先,就是上文说到的,tinybert不仅降低了层数,也降低了维度,所以学生网络和老师网络的维度是不符的,所以PKD这种初始化方式不太行。
其次,一般来说,比如PKD,学生网络会使用老师网络的部分层进行初始化。这个从直觉上来说,就不太对。
老师网络12层,学到的是文本的全部信息。学生网络是6层,如果使用老师的12层的前6层进行初始化,这个操作相当于认为这前6层代表了文本的全部信息。
当然,对于学生网络,还会在具体任务上微调。这里只是说这个初始化方式不太严谨。
Tiny bert的初始化方式很有意思,也是用了蒸馏的方式。
老师网络是没有经过在具体任务进行过微调的Bert网络,然后在大规模无监督数据集上,进行Transformer distillation。当然这里的蒸馏就没有预测输出层的蒸馏,翻看附录,发现这里只是中间层的蒸馏。
简单总结一下,这个阶段,使用一个预训练好的Bert( 尚未微调)进行了3epochs的 distillation;
## 2.3 TD:
TD就是针对具体任务进行蒸馏。
核心点:先进行中间层(包含embedding层)的蒸馏,再去做输出层的蒸馏。
老师网络是一个微调好的Bert,学生网络使用GD之后的tinybert,对老师网络进行TD蒸馏。
TD过程是,先在数据增强之后的数据上进行中间层的蒸馏-10eopchs,learning rate 5e-5;然后预测层的蒸馏3epochs,learning rate 3e-5.
# 3. 数据增强
在具体任务数据上进行微调的时候,进行了数据增强。
(感觉怪怪的)
两个细节点:
1. 对于 single-piece word 通过Bert找到当前mask词最相近的M个单词;对于 multiple sub-word pieces 使用Glove和Consine找到最相近的M个词
2. 通过概率P来决定是否替换当前的词为替换词。
3. 对任务数据集中的所有文本数据做上述操作,持续N次。
伪代码如下:

# 4. 实验效果
其实我最关心的一个点就是,数据增强起到了多大的作用。
作者确实也做了实验,如下,数据增强作用还是很大的:

我比较想知道的是,在和PKD同等模型架构下,两者的比较,很遗憾,作者好像并没有做类似的实验(或者我没发现)。
这里的tinybert参数如下:
> the number of layers M=4, the hidden size d 0=312, the feedforward/filter size d 0 i=1200 and the head number h=12.
# 5. 简单总结
先说一下,我读完论文学到的东西:
首先是transformer层蒸馏是如何涉及到的损失函数:
1. 注意力矩阵和前馈神经层使用mse;
2. 蒸馏的时候注意力矩阵使用未归一化
3. 维度不同使用权重矩阵进行转化
其次,维度不同导致不能从老师Bert初始化。GD过程为了解决这个问题,直接使用学生网络的架构从老师网络蒸馏一个就可以,这里并不是重新学一个学生网络。
还有就是数据增强,感觉tinyebert的数据增强还是比较简陋的,也比较牵强,而且是针对英文的方法。
TD过程,对不同的层的蒸馏是分开进行的,先进行的中间层的蒸馏,然后是进行的输出层的蒸馏,输出层使用的是Soft没有使用hard。
这个分过程蒸馏很有意思,之前没注意到这个细节点。
在腾讯的文章中看到这样一句话:
> 并且实验中,softmax cross-entropy loss 容易发生不收敛的情况,把 softmax 交叉熵改成 MSE, 收敛效果变好,但泛化效果变差。这是因为使用 softmax cross-entropy 需要学到整个概率分布,更难收敛,因为拟合了 teacher BERT 的概率分布,有更强的泛化性。MSE 对极值敏感,收敛的更快,但泛化效果不如前者。
是有道理的,积累一下。
值得看的一些资料:
比 Bert 体积更小速度更快的 TinyBERT - 腾讯技术工程的文章 - 知乎 https://zhuanlan.zhihu.com/p/94359189
================================================
FILE: 深度学习自然语言处理/模型蒸馏/什么是知识蒸馏.md
================================================
Bert知识蒸馏系列(一):什么是知识蒸馏
全文参考的论文是:Distilling the Knowledge in a Neural Network
参考的讲解的比较的博文是:
《Distilling the Knowledge in a Neural Network》知识蒸馏 - musk星辰大海的文章 - 知乎
https://zhuanlan.zhihu.com/p/75031938
这个含有Hiton的PPT介绍
【经典简读】知识蒸馏(Knowledge Distillation) 经典之作 - 潘小小的文章 - 知乎
https://zhuanlan.zhihu.com/p/102038521
这个把其中的公式推导写的比较明白
如何理解soft target这一做法? - YJango的回答 - 知乎
https://www.zhihu.com/question/50519680/answer/136406661
Bert 系列文章
1. Bert 模型压缩
什么是知识蒸馏?知识蒸馏基础概念一览。
2. Bert 的后续改进
Albert
Robert
#### 什么是蒸馏
一般来说,为了提高模型效果,我们可以使用两种方式。一种是直接使用复杂模型,比如你原来使用的TextCNN,现在使用Bert。一种是多个简单模型的集成,这种套路在竞赛中非常的常见。
这两种方法在离线的时候是没有什么问题的,因为不涉及到实时性的要求。但是一旦涉及到到部署模型,线上实时推理,我们需要考虑时延和计算资源,一般需要对模型的复杂度和精度做一个平衡。
这个时候,我们就可以将我们大模型学到的信息提取精华灌输到到小模型中去,这个过程就是蒸馏。
#### 什么是知识
对于一个模型,我们一般关注两个部分:模型架构和模型参数。
简答的说,我们可以把这两个部分当做是我们模型从数据中学习到的信息或者说是知识(当然主要是参数,因为架构一般来说是训练之前就定下来的)
但是这两个部分,对于我们来说,属于黑箱,就是我们不知道里面究竟发生了什么事情。
那么什么东西是我们肉眼可见的呢?从输入向量到输出向量的一个映射关系是可以被我们观测到的。
简单来说,我输入一个example,你输出是一个什么情况我是可以看到的。
区别于标签数据格式 [0,0,1,0],模型的输出结果一般是这样的:[0.01,0.01,0.97,0.01]。
举个比较具象的例子,就是如果我们在做一个图片分类的任务,你的输入图像是一辆宝马,那么模型在宝马这个类别上会有着最大的概率值,与此同时还会把剩余的概率值分给其他的类别。
这些其他类别的概率值一般都很小,但是仍然存在着一些信息,比如垃圾车的概率就会比胡萝卜的概率更高一些。
模型的输出结果含有的信息更丰富了,信息熵更大了,我们进一步的可以把这种当成是一种知识,也就是小模型需要从大模型中学习到的经验。
这个时候我们一般把大模型也就是复杂模型称之为老师网络,小模型也就那我们需要的蒸馏模型称之为学生网络。学生网络通过学习老师网络的输出,进而训练模型,达到比较好的收敛效果。
#### 为什么知识蒸馏可以获得比较好的效果
在前面提到过,卡车和胡萝卜都会有概率值的输出,但是卡车的概率会比胡萝卜大,这种信息是很有用的,它定义了一种丰富的数据相似结构。
上面谈到一个问题,就是不正确的类别概率都比较小,它对交叉熵损失函数的作用非常的低,因为这个概率太接近零了,也就是说,这种相似性存在,但是在损失函数中并没有充分的体现出来。
第一种就是,使用sofmax之前的值,也就是logits,计算损失函数
第二种是在计算损失函数的时候,使用温度参数T,温度参数越高,得到的概率值越平缓。通过升高温度T,我们获取“软目标”,进而训练小模型
其实对于第一种其实是第二种蒸馏方式的的一种特例情况,论文后续有对此进行证明。
这里的温度参数其实在一定程度上和蒸馏这个名词相呼应,通过升温,提取精华,进而灌输知识。
#### 带温度参数T的Softmax函数
软化公式如下:

说一下为什么需要这么一个软化公式。上面我们谈到,通过升温T,我们得到的概率分布会变得比较平缓。
用上面的例子说就是,宝马被识别为垃圾车的概率比较小,但是通过升温之后,仍然比较小,但是没有那么小(好绕口啊)。
也就是说,数据中存在的相似性信息通过升温被放大了,这样在计算损失函数的时候,这个相似性才会被更大的注意到,才会对损失函数产生比较大的影响力。
#### 损失函数

损失函数是软目标损失函数和硬目标损失函数的结合,一般来说,软目标损失函数设置的权重需要大一些效果会更好一点。
#### 如何训练
整体的算法示意图如下:

整体的算法示意图如上所示:
1. 首先使用标签数据训练一个正常的大模型
2. 使用训练好的模型,计算soft targets。
3. 训练小模型,分为两个步骤,首先小模型使用相同的温度参数得到输出结果和软目标做交叉熵损失,其次小模型使用温度参数为1,和标签数据(也就是硬目标)做交叉损失函数。
4. 预测的时候,温度参数设置为1,正常预测。
================================================
FILE: 深度学习自然语言处理/模型蒸馏/知识蒸馏综述万字长文.md
================================================
本文首发公众号:**【DASOU】**
涉及到的代码部分,可以去我仓库里找,已经**1.2k**了:
DA-southampton/NLP_abilitygithub.com
文中内容有不同见解大家及时沟通
目录如下:
1. 知识蒸馏简单介绍
2. Bert 蒸馏到 BiLSTM
3. PKD-BERT
4. BERT-of-Theseus
5. TinyBert
## 1. 知识蒸馏简单介绍
**1.1 什么是蒸馏**
一般来说,为了提高模型效果,我们可以使用两种方式。一种是直接使用复杂模型,比如你原来使用的TextCNN,现在使用Bert。一种是多个简单模型的集成,这种套路在竞赛中非常的常见。
这两种方法在离线的时候是没有什么问题的,因为不涉及到实时性的要求。但是一旦涉及到到部署模型,线上实时推理,我们需要考虑时延和计算资源,一般需要对模型的复杂度和精度做一个平衡。
这个时候,我们就可以将我们大模型学到的信息提取精华灌输到到小模型中去,这个过程就是蒸馏。
### **1.2 什么是知识**
对于一个模型,我们一般关注两个部分:模型架构和模型参数。
简答的说,我们可以把这两个部分当做是我们模型从数据中学习到的信息或者说是知识(当然主要是参数,因为架构一般来说是训练之前就定下来的)
但是这两个部分,对于我们来说,属于黑箱,就是我们不知道里面究竟发生了什么事情。
那么什么东西是我们肉眼可见的呢?从输入向量到输出向量的一个映射关系是可以被我们观测到的。
简单来说,我输入一个example,你输出是一个什么情况我是可以看到的。
区别于标签数据格式 [0,0,1,0],模型的输出结果一般是这样的:[0.01,0.01,0.97,0.01]。
举个比较具象的例子,就是如果我们在做一个图片分类的任务,你的输入图像是一辆宝马,那么模型在宝马这个类别上会有着最大的概率值,与此同时还会把剩余的概率值分给其他的类别。
这些其他类别的概率值一般都很小,但是仍然存在着一些信息,比如垃圾车的概率就会比胡萝卜的概率更高一些。
模型的输出结果含有的信息更丰富了,信息熵更大了,我们进一步的可以把这种当成是一种知识,也就是小模型需要从大模型中学习到的经验。
这个时候我们一般把大模型也就是复杂模型称之为老师网络,小模型也就那我们需要的蒸馏模型称之为学生网络。学生网络通过学习老师网络的输出,进而训练模型,达到比较好的收敛效果。
### **1.3 为什么知识蒸馏可以获得比较好的效果**
在前面提到过,卡车和胡萝卜都会有概率值的输出,但是卡车的概率会比胡萝卜大,这种信息是很有用的,它定义了一种丰富的数据相似结构。
上面谈到一个问题,就是不正确的类别概率都比较小,它对交叉熵损失函数的作用非常的低,因为这个概率太接近零了,也就是说,这种相似性存在,但是在损失函数中并没有充分的体现出来。
第一种就是,使用sofmax之前的值,也就是logits,计算损失函数
第二种是在计算损失函数的时候,使用温度参数T,温度参数越高,得到的概率值越平缓。通过升高温度T,我们获取“软目标”,进而训练小模型
其实对于第一种其实是第二种蒸馏方式的的一种特例情况,论文后续有对此进行证明。
这里的温度参数其实在一定程度上和蒸馏这个名词相呼应,通过升温,提取精华,进而灌输知识。
### **1.4 带温度参数T的Softmax函数**
软化公式如下:

编辑切换为居中
添加图片注释,不超过 140 字(可选)
说一下为什么需要这么一个软化公式。上面我们谈到,通过升温T,我们得到的概率分布会变得比较平缓。
用上面的例子说就是,宝马被识别为垃圾车的概率比较小,但是通过升温之后,仍然比较小,但是没有那么小(好绕口啊)。
也就是说,数据中存在的相似性信息通过升温被放大了,这样在计算损失函数的时候,这个相似性才会被更大的注意到,才会对损失函数产生比较大的影响力。
### **1.5 损失函数**

编辑切换为居中
添加图片注释,不超过 140 字(可选)
损失函数是软目标损失函数和硬目标损失函数的结合,一般来说,软目标损失函数设置的权重需要大一些效果会更好一点。
### **1.6 如何训练**
整体的算法示意图如下:

编辑切换为居中
添加图片注释,不超过 140 字(可选)
整体的算法示意图如上所示:
1. 首先使用标签数据训练一个正常的大模型
2. 使用训练好的模型,计算soft targets。
3. 训练小模型,分为两个步骤,首先小模型使用相同的温度参数得到输出结果和软目标做交叉熵损失,其次小模型使用温度参数为1,和标签数据(也就是硬目标)做交叉损失函数。
4. 预测的时候,温度参数设置为1,正常预测。
## 2. Bert 蒸馏到 BiLSTM
**2.1 简单介绍**
假如手上有一个文本分类任务,我们在提升模型效果的时候一般有以下几个思路:
1. 增大数据集,同时提升标注质量
2. 寻找更多有效的文本特征,比如词性特征,词边界特征等等
3. 更换模型,使用更加适合当前任务或者说更加复杂的模型,比如FastText-->TextCNN--Bert
...
之后接触到了知识蒸馏,学习到了简单的神经网络可以从复杂的网路中学习知识,进而提升模型效果。
之前写个一个文章是TextCNN如何逼近Bert,当时写得比较粗糙,但是比较核心的点已经写出来。
这个文章脱胎于这个论文:Distilling Task-Specific Knowledge from BERT into Simple Neural Networks
整个训练过程是这样的:
1. 在标签数据上微调Bert模型
2. 使用三种方式对无标签数据进行数据增强
3. Bert模型在无标签数据上进行推理,Lstm模型学习Bert模型的推理结果,使用MSE作为损失函数。
### **2.2 目标函数**
知识蒸馏的目标函数:

编辑切换为居中
添加图片注释,不超过 140 字(可选)
一般来说,我们会使用两个部分,一个是硬目标损失函数,一个是软目标损失函数,两者都可以使用交叉熵进行度量。
在原论文中,作者在计算损失函数的时候只是使用到了软目标,同时这个软目标并不是使用softmax之前的logits进行MSE度量损失,也就是并没有使用带有温度参数T的sotmax进行归一化。
### **2.3 数据增强**
为了促进有效的知识转移,我们经常需要一个庞大的,未标记的数据集。
三种数据增强的方式:
1. Masking:使用概率随机的替换一个单词为[MASK]. 需要注意的是这里替换之后,Bert模型也会输入这个数据的。从直觉上来讲,这个规则可以阐明每个单词对标签的影响。
2. POS-guided word replacement.使用概率随机替换一个单词为另一个相同POS的单词。这个规则有可能会改变句子的语义信息。
3. n-gram sampling
整个流程是这样的:对于每个单词,如果概率p<,我们使用第一条规则,如果p<,我们使用第二条规则,两条规则互斥,也就是同一个单词只使用两者之间的一个。当对句子中的每个单词都过了一遍之后,我进行第三条规则,之后把整条句子补充道无标签数据集中。
### **2.4 知识蒸馏结果图**
效果图:

编辑切换为居中
添加图片注释,不超过 140 字(可选)
## 3. PKD-BERT
[PKD](https://arxiv.org/pdf/1908.09355.pdf) 核心点就是不仅仅从Bert(老师网络)的最后一层学习知识去做蒸馏,它还另加了一部分,就是从**Bert的中间层去学习**。
简单说,PKD的知识来源有两部分:**中间层+最后输出**。
它缓解了之前只用最后softmax输出层的蒸馏方式出现的过拟合而导致泛化能力降低的问题。
接下来,我们从PKD模型的两个策略说起:PKD-Last 和 PKD-Skip。
## **3.1 PKD-Last and PKD-Skip**
PKD的本质是从中间层学习知识,但是这个中间层如何去定义,就各式各样了。
比如说,我完全可以定位我只要**奇数层**,或者我只要**偶数层**,或者说我只要**最中间的两层**,等等,不一而足。
那么作者,主要是使用了这么多想法中的看起来比较合理的两种。
**PKD-Last,就是把中间层定义为老师网络的最后k层**。
这样做是基于老师网络越靠后的层数含有更多更重要的信息。
这样的想法其实和之前的蒸馏想法很类似,也就是只使用softmax层的输出去做蒸馏。但是从感官来看,有种尾大不掉的感觉,不均衡。
另一个策略是 就是**PKD-Skip,顾名思义,就是每跳几层学习一层**。
这么做是基于老师网络比较底层的层也含有一些重要性信息,这些信息不应该被错过。
作者在后面的实验中,证明了,PKD-Skip 效果稍微好一点(slightly better);
作者认为PKD-Skip抓住了老师网络不同层的多样性信息。而PKD-Last抓住的更多相对来说同质化信息,因为集中在了最后几层。
## **3.2. PKD**
## **3. 2.1架构图**
两种策略的PKD的架构图如下所示,**注意观察图,有个细节很容易忽视掉**:

编辑切换为居中
添加图片注释,不超过 140 字(可选)
我们注意看这个图,**Bert的最后一层(不是那个绿色的输出层)是没有被蒸馏的,这个细节一会会提到**。
## **3. 2.2 怎么蒸馏中间层**
这个时候,需要解决一个问题:我们怎么蒸馏中间层?
仔细想一下Bert的架构,假设最大长度是128,那么我们每一层Transformer encoder的输出都应该是128个单元,每个单元是768维度。
那么在对中间层进行蒸馏的时候,我们需要针对哪一个单元?是针对所有单元还是其中的部分单元?
首先,我们想一下,正常KD进行蒸馏的时候,我们使用的是[CLS]单元Softmax的输出,进行蒸馏。
我们可以把这个思想借鉴过来,一来,对所有单元进行蒸馏,计算量太大。二来,[CLS] 不严谨的说,可以看到整个句子的信息。
为啥说是不严谨的说呢?因为[CLS]是不能代表整个句子的输出信息,这一点我记得Bert中有提到。
## **3.2.3蒸馏层数和学生网络的初始化**
接下来,我想说一个很小的细节点,对比着看上面的模型架构图:
**Bert(老师网络)的最后一层 (Layer 12 for BERT-Base) 在蒸馏的时候是不予考虑**;
原因的话,其一可以这么理解,PKD创新点是从中间层学习知识,最后一层不属于中间层。当然这么说有点牵强附会。
作者的解释是最后一层的隐层输出之后连接的就是Softmax层,而Softmax层的输出已经被KD Loss计算在内了。
比如说,K=5,那么对于两种PKD的模式,被学习的中间层分别是:
PKD-Skip: ;
PKD-Last:
还有一个细节点需要注意,就是学生网络的初始化方式,直接使用老师网络的前几层去初始化学生网络的参数。
## **3.2.4 损失函数**
首先需要注意的是中间层的损失,作者使用的是MSE损失。如下:

编辑切换为居中
添加图片注释,不超过 140 字(可选)
整个模型的损失主要是分为两个部分:KD损失和中间层的损失,如下:

编辑切换为居中
添加图片注释,不超过 140 字(可选)
## **3.3. 实验效果**
实验效果可以总结如下:
1. PKD确实有效,而且Skip模型比Last效果稍微好一点。
2. PKD模型减少了参数量,加快了推理速度,基本是线性关系,毕竟减少了层数
除了这两点,作者还做了一个实验去验证:**如果老师网络更大,PKD模型得到的学生网络会表现更好吗**?
这个实验我很感兴趣。
直接上结果图:

编辑切换为居中
添加图片注释,不超过 140 字(可选)
KD情况下,注意不是PKD模型,看#1 和#2,在老师网络增加的情况下,效果有好有坏。这个和训练数据大小有关。
KD情况下,看#1和#3,在老师网络增加的情况下,学生网络明显变差。
作者分析是因为,压缩比高了,学生网络获取的信息变少了。
也就是大网络和小网络本身效果没有差多少,但是学生网络在老师是大网络的情况下压缩比大,学到的信息就少了。
更有意思的是对比#2和#3,老师是大网络的情况下,学生网络效果差。
这里刚开始没理解,后来仔细看了一下,注意#2 的学生网络是,也就是它的初始化是从来的,占了一半的信息。
好的,写到这里
## 4. BERT-of-Theseus
大家好,我是DASOU,今天介绍一下:BERT-of-Theseus
这个论文我觉得还挺有意思,攒个思路。
读完这个文章,BERT-of-Theseus 掌握以下两点就可以了:
1. 基于模块替换进行压缩
2. 除了具体任务的损失函数,没有其他多余损失函数。
效果的话,与相比,:推理速度 ;模型效果 98%;
## **4.1 模块替换**
举个例子,比如有一个老师网络是12层的Bert,现在我每隔两层Transformer,替换为学生网络的一层Transformer。那么最后我的学生网络也就变成了6层的小Bert,训练的时候老师网络和学生网络的模块交替训练。
直接看下面这个架构图:

编辑切换为居中
添加图片注释,不超过 140 字(可选)
作者说他是受 Dropout 的启发,仔细想了想还真的挺像的。
我们来说一下这样做的好处。
我刚才说每隔老师网络的两层替换为学生网络的一层。很容易就想到PKD里面,有一个是PKD-Skip策略。
就是每隔几层,学生网络的层去学习老师网络对应层的输出,使用损失函数让两者输出接近,使用的是CLS的输出。
在这里提一下蒸馏/压缩的基本思想,一个最朴素的想法就是让学生网络和老师网络通过损失函数在输出层尽可能的靠近。
进一步的,为了提升效果,可以通过损失函数,让学生网络和老师网络在中间层尽可能的靠近,就像PKD这种。
这个过程最重要的就是在训练的时候需要通过损失函数来让老师网络和学生网络尽可能的接近。
如果是这样的话,问题就来了,损失函数的选取以及各自损失函数之前的权重就需要好好的选择,这是一个很麻烦的事情。
然后我们再来看 BERT-of-Theseus,它就没有这个问题。
它是在训练的时候以概率 来从老师网络某一层和学生网络的某一层选择一个出来,放入到训练过程中。
在这个论文里,老师网络叫做 , 学生网络叫做 ;
## **4.2 训练过程**
对着这个网络架构,我说一下整体训练的过程:
1. 在具体任务数据上训练一个 BERT-base 网络作为 ;
2. 使用 前六层初始化一个 6层的Bert作为 ;
3. 在具体任务数据上,固定 相应权重,以概率(随着steps,线性增加到1),对整个网络(加上 )进行整体的训练。
4. 为了让 作为一个整体,单独抽离出来 (其实设置为1就可以了),作为一个单独的个体,在训练数据上继续微调。直至效果不再增加。
简单总结,在训练数据上,老师网络和学生网络共同训练,因为存在概率问题,有的时候是老师网络的部分层加入训练,有的时候是学生网络的部分层加入训练。在这一步训练完成之后,为了保证学生网络作为一个整体(因为在第一步训练的时候大部分情况下学生网络的层都是分开加入训练过程的),在具体任务数据上,对学生网络继续微调,直至效果不再增加。
## **4.3 结果分析**
## **不同方法的损失函数**
论文提供了一个不同Bert蒸馏方法使用的损失函数的图,值得一看,见下图:

编辑切换为居中
添加图片注释,不超过 140 字(可选)
值得注意的是,这里的 应该是选取前六层,在具体任务微调的结果。
## **效果**

编辑切换为居中
添加图片注释,不超过 140 字(可选)
整体来说,BERT-of-Theseus 思路很简单,效果也还不错。
## 5. Tiny-Bert
大家好,我是DASOU,今天说一下 TinyBert;
[TinyBert](https://openreview.net/pdf?id=rJx0Q6EFPB) 主要掌握两个核心点:
1. 提出了对基于 transformer 的模型的蒸馏方式:Transformer distillation;
2. 提出了两阶段学习框架:在预训练和具体任务微调阶段都进行了 Transformer distillation(两阶段有略微不同);
下面对这两个核心点进行阐述。
## **5.1. Transformer distillation**
## **5.1.1整体架构**
整体架构如下:

编辑切换为居中
添加图片注释,不超过 140 字(可选)
Bert不严谨的来划分,可以分为三个部分:词向量输入层,中间的TRM层,尾端的预测输出层。
在这个论文里,作者把词向量输入层 和中间的TRM层统一称之为中间层,大家读的时候需要注意哈。
Bert的不同层代表了学习到了不同的知识,所以针对不同的层,设定不同的损失函数,让学生网络向老师网络靠近,如下:
1. ebedding层的输出
2. 多头注意力层的注意力矩阵和隐层的输出
3. 预测层的输出
## **5.1.2 Transformer 基础知识:**
注意力层:

编辑切换为居中
添加图片注释,不超过 140 字(可选)
多头注意力层:

编辑切换为居中
添加图片注释,不超过 140 字(可选)
前馈神经网路:

编辑切换为居中
添加图片注释,不超过 140 字(可选)
## **5.1.3 Transformer 的蒸馏**
对 Transformer的蒸馏分为两个部分:一个是注意力层矩阵的蒸馏,一个是前馈神经网络输出的蒸馏。
**注意力层矩阵蒸馏的损失函数**:

编辑切换为居中
添加图片注释,不超过 140 字(可选)
这里注意两个细节点:
一个是使用的是MSE;
还有一个是,使用的没有归一化的注意力矩阵,见(1),而不是softmax之后的。**原因是实验证明这样能够更快的收敛而且效果会更好**。
**前馈神经网络蒸馏的损失函数**

编辑切换为居中
添加图片注释,不超过 140 字(可选)
两个细节点:
第一仍然使用的是MSE.
第二个细节点是注意,学生网路的隐层输出乘以了一个权重矩阵,这样的原因是学生网络的隐层维度和老师网络的隐层维度不一定相同。
所以如果直接计算MSE是不行的,这个权重矩阵也是在训练过程中学习的。
写到这里提一点,其实这里也可以看出来为什么tinybert的初始化没有采用类似PKD这种,而是使用GD过程进行蒸馏学习。
因为我们的tinybert 在减少层数的同时也减少了宽度(隐层的输出维度),如果采用PKD这种形式,学生网络的维度和老师网络的维度对不上,是不能初始化的。
**词向量输入层的蒸馏**:

编辑切换为居中
添加图片注释,不超过 140 字(可选)
**预测层输出蒸馏**:

编辑切换为居中
添加图片注释,不超过 140 字(可选)
## **5.1.4 总体蒸馏损失函数**

编辑切换为居中
添加图片注释,不超过 140 字(可选)
## **5.2. 两阶段蒸馏**
## **5.2.1 整体架构**
整体架构如图:

编辑切换为居中
添加图片注释,不超过 140 字(可选)
## **5.2.2 为什么需要GD:**
说一下我自己的理解哈,我觉得有两个原因:
首先,就是上文说到的,tinybert不仅降低了层数,也降低了维度,所以学生网络和老师网络的维度是不符的,所以PKD这种初始化方式不太行。
其次,一般来说,比如PKD,学生网络会使用老师网络的部分层进行初始化。这个从直觉上来说,就不太对。
老师网络12层,学到的是文本的全部信息。学生网络是6层,如果使用老师的12层的前6层进行初始化,这个操作相当于认为这前6层代表了文本的全部信息。
当然,对于学生网络,还会在具体任务上微调。这里只是说这个初始化方式不太严谨。
Tiny bert的初始化方式很有意思,也是用了蒸馏的方式。
老师网络是没有经过在具体任务进行过微调的Bert网络,然后在大规模无监督数据集上,进行Transformer distillation。当然这里的蒸馏就没有预测输出层的蒸馏,翻看附录,发现这里只是中间层的蒸馏。
简单总结一下,这个阶段,使用一个预训练好的Bert( 尚未微调)进行了3epochs的 distillation;
## **5.2.3 TD:**
TD就是针对具体任务进行蒸馏。
核心点:先进行中间层(包含embedding层)的蒸馏,再去做输出层的蒸馏。
老师网络是一个微调好的Bert,学生网络使用GD之后的tinybert,对老师网络进行TD蒸馏。
TD过程是,先在数据增强之后的数据上进行中间层的蒸馏-10eopchs,learning rate 5e-5;然后预测层的蒸馏3epochs,learning rate 3e-5.
## **5.3. 数据增强**
在具体任务数据上进行微调的时候,进行了数据增强。
(感觉怪怪的)
两个细节点:
1. 对于 single-piece word 通过Bert找到当前mask词最相近的M个单词;对于 multiple sub-word pieces 使用Glove和Consine找到最相近的M个词
2. 通过概率P来决定是否替换当前的词为替换词。
3. 对任务数据集中的所有文本数据做上述操作,持续N次。
伪代码如下:

编辑切换为居中
添加图片注释,不超过 140 字(可选)
## **5.4. 实验效果**
其实我最关心的一个点就是,数据增强起到了多大的作用。
作者确实也做了实验,如下,数据增强作用还是很大的:

编辑
添加图片注释,不超过 140 字(可选)
我比较想知道的是,在和PKD同等模型架构下,两者的比较,很遗憾,作者好像并没有做类似的实验(或者我没发现)。
这里的tinybert参数如下:
> the number of layers M=4, the hidden size d 0=312, the feedforward/filter size d 0 i=1200 and the head number h=12.
## **5.5. 简单总结**
先说一下,我读完论文学到的东西:
首先是transformer层蒸馏是如何涉及到的损失函数:
1. 注意力矩阵和前馈神经层使用mse;
2. 蒸馏的时候注意力矩阵使用未归一化
3. 维度不同使用权重矩阵进行转化
其次,维度不同导致不能从老师Bert初始化。GD过程为了解决这个问题,直接使用学生网络的架构从老师网络蒸馏一个就可以,这里并不是重新学一个学生网络。
还有就是数据增强,感觉tinyebert的数据增强还是比较简陋的,也比较牵强,而且是针对英文的方法。
TD过程,对不同的层的蒸馏是分开进行的,先进行的中间层的蒸馏,然后是进行的输出层的蒸馏,输出层使用的是Soft没有使用hard。
这个分过程蒸馏很有意思,之前没注意到这个细节点。
在腾讯的文章中看到这样一句话:
> 并且实验中,softmax cross-entropy loss 容易发生不收敛的情况,把 softmax 交叉熵改成 MSE, 收敛效果变好,但泛化效果变差。这是因为使用 softmax cross-entropy 需要学到整个概率分布,更难收敛,因为拟合了 teacher BERT 的概率分布,有更强的泛化性。MSE 对极值敏感,收敛的更快,但泛化效果不如前者。
是有道理的,积累一下。
值得看的一些资料:
比 Bert 体积更小速度更快的 TinyBERT - 腾讯技术工程的文章 - 知乎 https://zhuanlan.zhihu.com/p/94359189
================================================
FILE: 深度学习自然语言处理/论文解读/模型训练需不需要将损失降低为零.md
================================================
本文是对苏神文章的解读,主要是关于公式推导中省略的部分细节记录了自己的理解,希望能帮助大家更好的理解。
模型训练的时候,我们会把数据分为训练数据和开发数据。
Loss变化一般是这样的:训练集损失在不停的降低,开发集先降低随后上升。
我们一般选择两条线的交叉点(其实也没有交叉),也就是开发数据集开始上升的那个点作为我们的最终模型的选择,这样既可以得到最好的结果,也可以避免过拟合。
这个论文思路是这样的,当损失函数降低的一定程度(足够小)的时候,改变损失函数为:
$$
\widetilde{J_{\theta}} =|J_{\theta}-b|+b\tag{1}
$$
公式 $(1)$ 中 $J_{\theta}$ 为原始的损失函数, $\widetilde{J_{\theta}}$ 改变之后的损失函数。
观察这个公式,其实可以这样去描述:
1. 当 $J_{\theta} \geq b$ 时,损失函数就是$J_{\theta}$;
2. 当$J_{\theta} < b$ 的时候,损失函数就是$\widetilde{J_{\theta}}=2b - J_{\theta}$。
这个时候,我们想一下梯度下降算法公式,如下:
$$
\theta_{n}=\theta_{n-1}- \alpha\nabla J(\theta_{n-1})\tag{2}
$$
所以,当损失函数为 $\widetilde{J_{\theta}}$ 的时候,符号就会发生变化,这个时候我们就使用就不是梯度下降而是梯度上升算法。也就是说,以$b$ 为临界点,在交替的进行梯度上升和梯度下降算法。
论文发现,在某些任务上,使用这个方法,开发集上的损失函数会发生二次下降。
再次说一下,关于这一点,苏剑林给出来相关的数学推导(参考链接放在文章末尾)。不过有个关于泰勒公式的展开跳过了,我简单做了一个补充,帮助自己和大家理解。
首先如果交替做梯度上升和梯度下降算法,参数更新公式如下所示:
$$
\theta_{n}=\theta_{n-1}- \alpha\nabla J(\theta_{n-1})
$$
$$
\theta_{n+1}=\theta_{n}+ \alpha\nabla J(\theta_{n}) \tag{3}
$$
对此公式上下消参 中,我们可以得到:
$$
\theta_{n+1}= \theta_{n-1}- \alpha\nabla J(\theta_{n-1})+ \alpha\nabla J( \theta_{n-1}- \alpha\nabla J(\theta_{n-1}))\tag{4}
$$
对于公式$(4)$ ,重点是对 **$ J(\theta_{n-1}- \alpha\nabla J(\theta_{n-1}))$** 这个损失函数进行一个剖析化简,这里用到了泰勒公式的展开。
先回忆一下泰勒公式,这里直接给出一个一阶泰勒公式的展开:
$$
J(\omega) \approx J(\omega_{0}) + (\omega - \omega_{0})*J^{'}(\omega_{0}) + \epsilon \qquad \omega_{0} 和 \omega 足够接近\tag{5}
$$
注意,这个时候,我们仔细观察公式 **$ J(\theta_{n-1}- \alpha\nabla J(\theta_{n-1}))$** 和 公式$(5)$。
首先,我们知道的是,$\alpha\nabla J(\theta_{n-1}))$ 是每次参数更新时候的增量,在损失函数足够小的时候,我们每次参数更新的增量可以认定是一个极小值。
换句话说,$\theta_{n-1}- \alpha\nabla J(\theta_{n-1})$可以对应到我们公式(5) 中的$\omega_{0}$,$\theta_{n-1}$ 对应的就是公式(5) 中的 $\omega$
也就是说,
$$
J( \theta_{n-1}) \approx J( \theta_{n-1}- \alpha\nabla J(\theta_{n-1})) +\alpha\nabla J(\theta_{n-1})*\nabla J(\theta_{n-1}) \tag{6}
$$
这里需要注意的,公式最后面一个$\nabla J(\theta_{n-1})$ 的由来。按道理,这里应该是对$J^{'}(\omega_{0})$进行求导 。但是这里因为$\omega 和 \omega_{0}$ 非常的相近,我们直接使用对$J(\omega)$的求导结果就可以,这一点是个比较重要的细节点。
基于此,我们可以继续往下推导:
$$
\theta_{n+1}= \theta_{n-1}- \alpha\nabla J(\theta_{n-1})+ \alpha\nabla J( \theta_{n-1}- \alpha\nabla J(\theta_{n-1}))
$$
$$
\approx \theta_{n-1}- \alpha\nabla J(\theta_{n-1})+ \alpha\nabla (J (\theta_{n-1})- \alpha\nabla J(\theta_{n-1})*\nabla J(\theta_{n-1}))
$$
$$
= \theta_{n-1}- \alpha\nabla J(\theta_{n-1})+ \alpha\nabla J (\theta_{n-1})- \alpha^{2}\nabla (\nabla J(\theta_{n-1})*\nabla J(\theta_{n-1}))
$$
$$
= \theta_{n-1}- \alpha^{2}\nabla ||\nabla J(\theta_{n-1})||^{2} \tag{7}
$$
这里,我还想提一点就是 $||\nabla J(\theta_{n-1})||^{2}$ ,它是两个求微分函数的乘积,所以结果是一个带参数的函数,也就是求得一个微分之后,做一个平方,得到的函数,这个时候在参数更新的时候,我们带入相应的值就可以了。
我们针对这个公式(7),会发现一个很奇怪的现象,就是参数更新的模式没有发生变化,都是进行了梯度下降(注意开头我们单从损失函数看是认为梯度下降和梯度上升是交替进行的,两个理解其实都没有问题)。
只是,当前步骤的参数更新不再是取决于上一个步骤,而是取决于上上一个步骤的参数。
这一点,我是这么理解的。使用普通的损失函数,相当于此时我们站在上一个步骤往山下看。当损失函数非常小的时候,极有可能会陷入局部最小值,并不是全局最优点。此时寻找出来的更新的方向,还是局限于局部最优点。而使用新的损失函数,我们通过公式,最直观的感受就是,是站在了上上一个步骤,是脱离了当前的视线(虽然只是差了一个步骤),相当于视野变大了,有更大的可能跳出当前的局部最优点,从而寻找到全局最优点。
我的理解就是这样的,当然苏神给出了另一个解释。大家可以去看一下。我这个文章主要是对他的公式推导中的跳过的泰勒公式的展开做了一个比较详细的阐述,记录下来,方便自己和大家理解。
================================================
FILE: 深度学习自然语言处理/词向量/CBOW和skip-gram相较而言,彼此相对适合哪些场景.md
================================================
CBOW和skip-gram相较而言,彼此相对适合哪些场景
先用一句话来个结论:CBOW比Skip-gram 训练速度快,但是Skip-gram可以得到更好的词向量表达。
为什么这么说?
因为我们知道两种优化方式只是对softmax的近似优化,不会影响最终结果,所以这里,我们讨论的时候,为了更加的清晰讲解,不考虑优化的情况。
使用一句话作为一个例子: “我/永远/爱/中国/共产党”
先说CBOW,我们想一下,它的情况是使用周围词预测中心词。如果“爱”是中心词,别的是背景词。对于“爱”这个中心词,只是被预测了一次。
对于Skip-gram,同样,我们的中心词是“爱”,背景词是其他词,对于每一个背景词,我们都需要进行一次预测,每进行一次预测,我们都会更新一次词向量。也就是说,相比CBOW,我们的词向量更新了2k次(假设K为窗口,那么窗口内包含中心词就有2k+1个单词)
想一下是不是这么回事?Skip-gram被训练的次数更多,那么词向量的表达就会越丰富。
如果语料库中,我们的的低频词很多,那么使用Skip-gram就会得到更好的低频词的词向量的表达,相应的训练时长就会更多。
简单来说,我们视线回到一个大小为K的训练窗口(窗口内全部单词为2k+1个),CBOW只是训练一次,Skip-gram 则是训练了2K次。当然是Skip-gram词向量会更加的准确一点,相应的会训练的慢一点。
欢迎大佬拍砖
================================================
FILE: 深度学习自然语言处理/词向量/Fasttext解读(1).md
================================================
我先说一个小问题,估计很多人也有疑惑。
看了很多文章,有的说是fasttext是CBOW的简单变种,有的说是Skip-gram的变种。究竟哪个是对的?
带着这个问题,我们来聊一聊Fasttext。首先Fasttext涉及到两个论文:
1. 第一个是Bag of Tricks for Efficient TextClassification(201607)。它解决的问题是使用Fasttext进行文本分类
2. 第二个是Enriching Word Vectors with Subword Information(201607) 。它解决的是使用Fasttext训练词向量。
今天这个文章,主要谈一下Bag of Tricks for Efficient Text Classification 这个论文 ,主要涉及到的就是文本分类的问题。
Fasttext用作文本分类,做到了速度和精读的一个平衡:标准多核CPU情况下,不到十分钟,可以训练超过十亿个单词。不到一分钟,可以对50万个句子在312千个类别中进行分类。
这么说,其实不太明显,简单算一下。假设每个句子含有20个单词,那么十亿个单词对应就是五千万个句子,换句话讲在多核CPU的条件下,一分钟左右可以训练500万个句子。
和Bert比较一下,在GPU条件下,8个小时训练300万条数据左右。相比之下Fasttext的这个速度是真的太快了。
在这个论文中,也就是使用做文本分类的Fasttext,使用的是CBOW的架构。
注意哦,强调一遍,Fasttext用在文本分类,模型架构使用的是CBOW的变种。(我这句话的意思不是说使用Skip-gram不可以,而是CBOW在理解文本分类的时候更加的容易理解)
这里和Word2vec的CBOW有两个区别:
1. 第一,使用类别标签替换了中心词。
2. 第二,使用句子中所有单词作为输入,而不再是单单的针对滑动窗口中的单词。
这两个点如果我们自己考虑,也很容易想到。
为什么这么说呢?先说第二点。我现在要做的是针对文本进行分类,所以对于我的输入需要转换到整体这个句子来看,才能使对一个句子的特征表达。
再说第一点,我们知道在Wrod2vec中,我们使用的是中心词作为输出,而且使用了霍夫曼作为输出层。
非叶子点上的向量为了我的二分类提供计算,叶子节点为整个词汇表中所有词汇的向量。两个向量都会随着模型而训练。
如果要做分类,我们可以想一下叶子节点和非叶子节点的变化。
首先叶子节点对应的是所有类别。如果说我们的类别有5000个,那么对应到Word2vec,我们就有着5000个词汇。想一下是不是这么对应。
非叶子节点其实没什么变化,因为它没有什么实际含义,只是为二分类提供计算。
在这里还想说一下,word2vec中的叶子节点也就是词向量更新之后我们最后是要的,但是对于fasttext其实不会用到这个,因为我们是对文本进行分类,只需要保存了模型权重在预测的时候可以预测就可以了。
还想谈一下词向量初始化的问题,模型训练开始的时候,词向量随机初始化就可以,模型训练结束之后,我们在预测阶段直接使用这个词向量就可以(就是随着模型训练而更新的这个词向量)。
对这个论文还有一个很有意思的点,就是N-gram就是fasttext的模型的输入不仅仅针对的是每个单词,为了加入词序信息,还加入了n-gram信息。
需要注意的一个细节是,这里的n-gram针对的是word,而不是char。对应到中文,应该对应的是分词之后的词,而不是字。但是我自己认为这么对应过来不太好理解。
中文的字做n-gram貌似也有词序信息。但是英文的char-level的n-gram很难说针对这个句子提供一个语序信息。大家理解一下就好。
还有一个问题想说一下,使用了n-gram信息之后,词表肯定是变大了的。你需要的n的内容越多,比如你想要n=1orn=2orn=3等等吧,n的取值范围越大,你的词表越大。
这就会出现一个问题训练非常缓慢。这点很容易理解,参数越多训练当然越慢。
针对这个问题,怎么解决呢?使用哈希。
我举个简单的例子,不一定准确,"我/爱/中国/共产党",我在更新的时候,把'我','爱','中国','共产党'我们都使用同一个参数来代表(这种情况很难遇见,理解一下就好),那么在更新训练参数的时候,我只需要更新一个参数就把这个四个词都更新了,当然会快一点。
但是会出现一个问题,就是精度的问题。这个过程,不知道大家有咩有想到和albert很类似。哈希这个过程我自己感觉有点共享参数的意思。
================================================
FILE: 深度学习自然语言处理/词向量/Fasttext解读(2).md
================================================
这个文章主要是谈一下Enriching Word Vectors with Subword Information 这个论文。
有了上一个文章的打底,([上一个文章点击这里](https://mp.weixin.qq.com/s?__biz=MzIyNTY1MDUwNQ==&mid=2247483925&idx=1&sn=9b980a4fb55fd55f92684e403188f024&chksm=e87d3033df0ab925e1ebdc3c89637974645698f1680fd1ad25e23db05903e2061ef8242f16a1&token=509904673&lang=zh_CN#rd))这个论文理解起来就比较简单,所以我写的也比较短。
对于这个论文,我先给出它最核心的部分: 使用负采样的skip-gram的基础上,将每个中心词视为子词的集合,并学习子词的词向量。
这句话涉及到的一个最关键的部分就是子词subword,也是这个论文的核心。
举个例子,现在我们的中心词是"where",设定子词大小为3,那么子词集合分为两个部分,注意是两个部分。
第一部分形如这样:“”,第二部分就是特殊子词,也就是整词“”。
那么对应到模型是,原来我的输入是“where”的词向量,现在在Fasttext就是所有子词的词向量的和。
注意哦,这里是所有子词,是包含特殊子词,也就是整词的。
对于背景词,直接使用整词就可以。
简单来说,就是输出层使用子词(普通子词加上整词),输出层使用整词。
如果遇到了OOV怎么办?使用普通子词的向量和来表示就可以。
其实这里的子词,在名字上和上一个文章的ngram很类似,不过,这里使用的是就char的n-gram,缓解的问题并不是语序,而是利用了词序形态的规律。
对应到中文,其实就是偏旁部首。 我记得阿里好像有发一个关于fasttext的中文版本,训练的就是偏旁部首。大家有兴趣可以去看一看。
写完了,我对两个文章做个小总结,顺便对文章开头的问题做个回答: fasttext 训练词向量的时候一般是使用Skip-gram模型的变种。在用作文本分类的时候,一般是使用CBOW的变种。
在这里,我想要强调一下,上一段我说的是一般情况,是为了方便大家了解,并不代表说CBOW架构不能训练词向量,skip-gram不能用作文本分类,需要注意这一点哦。
================================================
FILE: 深度学习自然语言处理/词向量/README.md
================================================
## 词向量
### Word2vec
### Fasttext
### Glove
================================================
FILE: 深度学习自然语言处理/词向量/Word2vec为什么需要二次采样?.md
================================================
Word2vec为什么需要二次采样?
说到 Word2vec 的采样,首先会想起来的是负采样,属于对Word2vec的一个近似训练方法。
其实它还涉及到一个采样方法,就是subsampling,中文叫做二次采样。
用最简单的一句话描述二次采样就是,对文本中的每个单词会有一定概率删除掉,这个概率是和词频有关,越高频的词越有概率被删掉。
二次采样的公式如下所示:

注意: t为超参数,分母 f(w) 为单词w的词频与总词数之比
首先说一下,我们需要对文本数据进行二次采样?
举个简单例子,“他/是/个/优秀/的/学生”。如果此时中心词为"学生",背景词为"的"。
那么,我们的背景词对于我们这个中心词其实是没有什么作用的,并没有什么语义信息上的补充。
但是像“的”这种高频词,出现的机会还很大,所以对于这一句话信息是存在冗余的。
也就是说,在一个背景窗口中,一个词和较低频词同时出现比和较高频词同时出现对训练词嵌入模型更有益。
举个生活中的例子,现实生活中自律优秀的人比较少,堕落不努力人的人比较多,当然是优秀的人出现在我们身边会对我们自身的成长更加的有益。
所以我们的想法就是减少和堕落的人出现的次数,远离他们,让优秀的人出现在我们生活中的概率上升。
那么二次采样之后文本数据变成了什么样子?
还是上面那句话,“他/是/个/优秀/的/学生”,在这个时候,就变成了“他/是/个/优秀/学生”。也就是说高频词“的”在我们的训练数据中消失了。
当然这个消失正如上文所说,是一个概率,可能在之后的另一个句子中,它还是存在的,只不过它出现在文本中的词频肯定是降低了的。
================================================
FILE: 深度学习自然语言处理/词向量/Word2vec模型究竟是如何获得词向量的.md
================================================
Word2vec模型究竟是如何获得词向量的?
问大家一个问题:Word2vec模型是如何获得词向量的?
很多文章在解释的时候,会说对一个词通过One-hot编码,然后通过隐层训练,得到的输入到隐层的矩阵就对应的词表词向量。
我不能说这么解释是不对的,但是我认为是不准确的。
在前面文章也说过了,Word2vec是不涉及到隐层的,CBOW有投影层,只是简单的求和平均,Skip-gram没有投影层,就是中心词接了一个霍夫曼树。
所以,很多文章涉及到的隐层的权重矩阵也就无从谈起。
在此情况下,词向量是怎么来的?
从源码的角度来看,我们是对每个词都初始化了一个词向量作为输入,这个词向量是会随着模型训练而更新的,词向量的维度就是我们想要的维度,比如说200维。
以Skip-gram为例,我们的输入的中心词的词向量其实不是One-hot编码,而是随机初始化的一个词向量,它会随着模型训练而更新。
需要注意的一个细节点是,每个单词都会对应两个词向量,一个是作为中心词的时候的词向量,一个是作为背景词的时候的词向量。大家一般选择第一种。
这一点需要注意区别Glove中的中心词向量和背景词向量。Glove中的中心词向量和背景词向量从理论上来说是等价的,只不过由于初始化的不同,最终结果会略有不同。
================================================
FILE: 深度学习自然语言处理/词向量/Word2vec的负采样.md
================================================
Word2vec的负采样
负采样的特点
首先对基于负采样的技术,我们更新的权重只是采样集合,减少了训练量,同时效果上来说,中心词一般来说只和上下文有关,更新其他词的权重并不重要,所以在降低计算量的同时,效果并没有变差。
负采样具体实施细节
我自己的总结就是创建两个线段,第一个线段切开词表大小的份数,每个份数的长度和频率正比。
第二个线段均分M个,然后随机取整数,整数落在第二个线段那里,然后取第一个线段对应的词,如果碰到是自己,那么就跳过。
欢迎拍砖
================================================
FILE: 深度学习自然语言处理/词向量/Word2vec训练参数的选定.md
================================================
Word2vec训练参数的选定?
首先根据具体任务,选一个领域相似的语料,在这个条件下,语料越大越好。然后下载一个 word2vec 的新版(14年9月更新),语料小(小于一亿词,约 500MB 的文本文件)的时候用 Skip-gram 模型,语料大的时候用 CBOW 模型。最后记得设置迭代次数为三五十次,维度至少选 50,就可以了。(引自 《How to Generate a Good Word Embedding》)
================================================
FILE: 深度学习自然语言处理/词向量/word2vec两种优化方式的联系和区别.md
================================================
**总结不易,请大力点赞,感谢**
上一个文章,[Word2vec-负采样/霍夫曼之后模型是否等价-绝对干货](http://mp.weixin.qq.com/s?__biz=MzIyNTY1MDUwNQ==&mid=2247483837&idx=1&sn=9b334c5db352acc298aa6bb0e5e41ce9&chksm=e87d339bdf0aba8d1e79d7ef88461a6914d0a74330e744b501394d7ad9aa0c1d8a65382ce80b&scene=21#wechat_redirect)是字节的面试真题,建议朋友们多看几遍,有问题及时沟通。
私下有几个朋友看完之后还是有点懵,又问了一下具体细节。基于此,我重新写了一个简短的文章,希望能让大家明白,大家可以结合上一个文章看。
我们再看一下题目:W2V经过霍夫曼或者负采样之后,模型与原模型相比,是等价的还是相似的?
首先,我们要明确,这里的原模型指的是什么?原模型就是我们的没有经过优化的W2V(当然我们也说过它是一个工具不是一个模型)。
也就是只是使用Skip-gram模型或者CBOW模型而没有进行优化的原始版本。对于这个原始版本,是在最后一层进行了Softmax。
我们的目标函数中,最核心的一个部分就是在给定中心词的条件下生成正确背景词的概率,我们要最大化这个东西,公式如下:

仔细看,在分母涉及到了一个V,这里的V就是我们的词典大小。也就是说,为了计算这个条件概率,我们需要对整个词典进行操作,复杂度就是O(|V|)
所以,负采样和霍夫曼就是针对这一个计算开销大的地方进行了优化。当然W2V为了减少计算量,还是去掉了隐层。比如CBOW直接是输入向量求和平均然后接霍夫曼树。比如Skip-gram直接是中心词的词向量接霍夫曼树。
这不是我这个文章的重点,就不细细展开了。
我们先说负采样。负采样的本质在于生成K个噪声。它的本质是基于中心词生成正确的背景词概率为1,生成噪声词概率为0,这个是我们的优化方向。公式如下:

仔细看这个公式,V已经消失,取而代之的是K,也就是我们的噪声词的数量,换句话讲,我们的复杂度被K这个大小限制住了,降低为了O(|K|)
然后我们再来看层序Softmax。它的核心本质是在一条路径上不停的做二分类,概率连乘就会得到我们的条件概率。公式如下:

注意看,这个公式中,V也已经消失了,被霍夫曼树中到达背景词的路径限制住了,这也就是上个文章中说到的,复杂度变成了二叉树的高度: O(log|V|)
既然只是针对的部分节点,那么与原始版本相比,当然是近似。
简单的总结一下:
其实可以这样理解,以跳字模型为例,条件概率是中心词生成背景词的概率,也就是我们优化函数中最核心的部分。没有使用优化的,分母涉及到全部词汇,训练开销大。负采样近似训练,把复杂度限制在了k个噪声词,层序softmax也属于近似训练,在它的条件概率中,不断的二分类,涉及到的是能够达到背景词的那个路径上的非叶子结点,也就是没涉及到其他节点,这一点和负采样很类似,都是从全部词汇降低复杂度,只不过负采样是被k限制,层序是被路径编码限制(0,1,1,1,0)这种限制住。
**不知道大家有没有注意到,负采样和霍夫曼都是讲Softmax转化为二分类的问题从而降低了复杂度。负采样是针对是不是背景词做二分类,霍夫曼是在对是不是正确路径上的节点做二分类。这么说有点不严谨,但是意思就是这么个意思,大家理解一下。**
**总结不易,请大力点赞,感谢**
================================================
FILE: 深度学习自然语言处理/词向量/史上最全词向量面试题梳理.md
================================================
**微信公众号:NLP从入门到放弃**
1. 有没有使用自己的数据训练过Word2vec,详细说一下过程。包括但是不限于:语料如何获取,清理以及语料的大小,超参数的选择及其原因,词表以及维度大小,训练时长等等细节点。
2. Word2vec模型是如何获得词向量的?聊一聊你对词嵌入的理解?如何理解分布式假设?
3. 如何评估训练出来的词向量的好坏
4. Word2vec模型如何做到增量训练
5. 大致聊一下 word2vec这个模型的细节,包括但不限于:两种模型以及两种优化方法(大致聊一下就可以,下面会详细问)
6. 解释一下 hierarchical softmax 的流程(CBOW and Skip-gram)
7. 基于6,可以展开问一下模型如何获取输入层,有没有隐层,输出层是什么情况。
8. 基于6,可以展开问输出层为何选择霍夫曼树,它有什么优点,为何不选择其他的二叉树
9. 基于6,可以问该模型的复杂度是多少,目标函数分别是什么,如何做到更新梯度(尤其是如何更新输入向量的梯度)
10. 基于6,可以展开问一下 hierarchical softmax 这个模型 有什么缺点
11. 聊一下负采样模型优点(为什么使用负采样技术)
12. 如何对输入进行负采样(负采样的具体实施细节是什么)
13. 负采样模型对应的目标函数分别是什么(CBOW and Skip-gram)
14. CBOW和skip-gram相较而言,彼此相对适合哪些场景
15. 有没有使用Word2vec计算过句子的相似度,效果如何,有什么细节可以分享出来
16. 详细聊一下Glove细节,它是如何进行训练的?有什么优点?什么场景下适合使用?与Word2vec相比,有什么区别(比如损失函数)?
17. 详细聊一下Fasttext细节,每一层都代表了什么?它与Wod2vec的区别在哪里?什么情况下适合使用Fasttext这个模型?
18. ELMO的原理是什么?以及它的两个阶段分别如何应用?(第一阶段如何预训练,第二阶段如何在下游任务使用)
19. ELMO的损失函数是什么?它是一个双向语言模型吗?为什么?
20. ELMO的优缺点分别是什么?为什么可以做到一词多义的效果?
本面试题词向量资源参考:
word2vec、glove、cove、fastext以及elmo对于知识表达有什么优劣? - 霍华德的回答 - 知乎
https://www.zhihu.com/question/292482891/answer/492247284
面试题:Word2Vec中为什么使用负采样? - 七月在线 七仔的文章 - 知乎
https://zhuanlan.zhihu.com/p/66088781
关于word2vec,我有话要说 - 张云的文章 - 知乎
https://zhuanlan.zhihu.com/p/29364112
word2vec(二):面试!考点!都在这里 - 我的土歪客的文章 - 知乎
https://zhuanlan.zhihu.com/p/133025678
nlp中的词向量对比:word2vec/glove/fastText/elmo/GPT/bert - JayLou娄杰的文章 - 知乎
https://zhuanlan.zhihu.com/p/56382372
史上最全词向量讲解(LSA/word2vec/Glove/FastText/ELMo/BERT) - 韦伟的文章 - 知乎
https://zhuanlan.zhihu.com/p/75391062
word2vec详解(CBOW,skip-gram,负采样,分层Softmax) - 孙孙的文章 - 知乎
https://zhuanlan.zhihu.com/p/53425736
Word2Vec详解-公式推导以及代码 - link-web的文章 - 知乎
https://zhuanlan.zhihu.com/p/86445394
关于ELMo,面试官们都怎么问:https://cloud.tencent.com/developer/article/1594557
================================================
FILE: 深度学习自然语言处理/词向量/聊一下Glove.md
================================================
本文大概需要阅读 4.75 分钟
先问大家两个问题,看能不能解答
1. Glove 中词向量的表达是使用的中心词向量还是背景词向量还是有其他方法?
2. 能不能分别用一句话概括出Glove和Fasttext 的核心要点?
先来谈Glove。中文全称 Global Vectors for Word Representation。它做的事情概括出来就是:基于全局语料,获得词频统计,学习词语表征。
我们从语料之中,学习到X共现词频矩阵,词频矩阵中的每个元素$x_{ij}$,代表的是词
$x_{j}$出现在$x_{i}$的环境中的次数。注意,对于共现词频矩阵来说,它是一个对称矩阵。
这一点非常的重要,也很容易理解,词A出现在词B周围的次数肯定是等价于词B出现在词A周围的次数的。
类比于Word2vec,对于词$x_{i}$,就是中心词,对于词$x_{j}$也就是背景词。
理论上,一个词作为中心词向量和一个词作为背景学到的两种向量应该是完全相同的。
但是现实中,由于我们初始化的不同,所以我们最终学习到的两种词向量是有些许不同。
为了增加模型的鲁棒性,在Glove中,使用两个词向量的和作为我们一个词的词向量的表达。
这一点是区别于Word2vec,对于Word2vec,中心词向量和背景词向量是不等价的,我们一般使用中心词向量代表一个词最终的语义表达。
Glove 论文中的推导过程其实不是很严谨,大致流程就是从大语料中发现了一个规律,即条件概率的比值可以比较直观的表达出词与词之间的关系。
随后可以构建词向量函数去拟合条件概率的比值。基于此一步步的进行延伸推导,在我看了就是在基于一些假设,寻找出一种可能性的存在。
在这里就不细说,直接给出Glove的损失函数:
$\sum_{i \in V} \sum_{j \in V} h(x_{ij})(u_{j}^Tv_{i}+b_{i}+b_{j}-log(x_{ij}))^2$
详细讲一下这个损失函数,它是一个平方损失函数,很值得琢磨。
我把它分为了三个部分,权重函数h(x),词向量表达是$u_{j}^Tv_{i}+b_{i}+b_{j}$,共现词频的对数 $log(x_{ij})$
从这里,我插一句我的理解,就是GLove基于的是无标签的语料,属于无监督的训练过程,但是从损失函数看,我觉得可以认为它是一个有监督的过程。
标签就是$log(x_{ij})$,这个是我们从语料之中计算出来的,换句话说,在模型训练之前,我们可以对语料的计算,得到相应的标签数据,所以我自己认为这个可以看做是一个有监督的过程。
我们使用一句话去描述这个损失函数可以这么去说:随着模型不停的优化,词向量的表达在不断的拟合共现词频的对数。
h(x)是权重函数,代表的含义是表达一个词对的重要性,在值域[0,1]上单调递增。直观上理解就是一对词语共现的词频越多,那么它的重要性也就越大。
论文中给出的函数是这样的,在x”,第二部分就是特殊子词,也就是整词“”。
那么对应到模型是,原来我的输入是“where”的词向量,现在在Fasttext就是所有子词的词向量的和。
注意哦,这里是所有子词,是包含特殊子词,也就是整词的。
对于背景词,直接使用整词就可以。
简单来说,就是输出层使用子词(普通子词加上整词),输出层使用整词。
如果遇到了OOV怎么办?使用普通子词的向量和来表示就可以。
其实这里的子词,在名字上和上一个文章的ngram很类似,不过,这里使用的是就char的n-gram,缓解的问题并不是语序,而是利用了词序形态的规律。
对应到中文,其实就是偏旁部首。 我记得阿里好像有发一个关于fasttext的中文版本,训练的就是偏旁部首。大家有兴趣可以去看一看。
写完了,我对两个文章做个小总结,顺便对文章开头的问题做个回答: fasttext 训练词向量的时候一般是使用Skip-gram模型的变种。在用作文本分类的时候,一般是使用CBOW的变种。
在这里,我想要强调一下,上一段我说的是一般情况,是为了方便大家了解,并不代表说CBOW架构不能训练词向量,skip-gram不能用作文本分类,需要注意这一点哦。
### 12. Glove细节详解解读
本文大概需要阅读 4.75 分钟 先问大家两个问题,看能不能解答
1. Glove 中词向量的表达是使用的中心词向量还是背景词向量还是有其他方法?
2. 能不能分别用一句话概括出Glove和Fasttext 的核心要点?
先来谈Glove。中文全称 Global Vectors for Word Representation。它做的事情概括出来就是:基于全局语料,获得词频统计,学习词语表征。
我们从语料之中,学习到X共现词频矩阵,词频矩阵中的每个元素$x_{ij}$,代表的是词
$x_{j}$出现在$x_{i}$的环境中的次数。注意,对于共现词频矩阵来说,它是一个对称矩阵。
这一点非常的重要,也很容易理解,词A出现在词B周围的次数肯定是等价于词B出现在词A周围的次数的。
类比于Word2vec,对于词$x_{i}$,就是中心词,对于词$x_{j}$也就是背景词。
理论上,一个词作为中心词向量和一个词作为背景学到的两种向量应该是完全相同的。
但是现实中,由于我们初始化的不同,所以我们最终学习到的两种词向量是有些许不同。
为了增加模型的鲁棒性,在Glove中,使用两个词向量的和作为我们一个词的词向量的表达。
这一点是区别于Word2vec,对于Word2vec,中心词向量和背景词向量是不等价的,我们一般使用中心词向量代表一个词最终的语义表达。
Glove 论文中的推导过程其实不是很严谨,大致流程就是从大语料中发现了一个规律,即条件概率的比值可以比较直观的表达出词与词之间的关系。
随后可以构建词向量函数去拟合条件概率的比值。基于此一步步的进行延伸推导,在我看了就是在基于一些假设,寻找出一种可能性的存在。
在这里就不细说,直接给出Glove的损失函数:
$\sum_{i \in V} \sum_{j \in V} h(x_{ij})(u_{j}^Tv_{i}+b_{i}+b_{j}-log(x_{ij}))^2$
详细讲一下这个损失函数,它是一个平方损失函数,很值得琢磨。
我把它分为了三个部分,权重函数h(x),词向量表达是$u_{j}^Tv_{i}+b_{i}+b_{j}$,共现词频的对数 $log(x_{ij})$
从这里,我插一句我的理解,就是GLove基于的是无标签的语料,属于无监督的训练过程,但是从损失函数看,我觉得可以认为它是一个有监督的过程。
标签就是$log(x_{ij})$,这个是我们从语料之中计算出来的,换句话说,在模型训练之前,我们可以对语料的计算,得到相应的标签数据,所以我自己认为这个可以看做是一个有监督的过程。
我们使用一句话去描述这个损失函数可以这么去说:随着模型不停的优化,词向量的表达在不断的拟合共现词频的对数。
h(x)是权重函数,代表的含义是表达一个词对的重要性,在值域[0,1]上单调递增。直观上理解就是一对词语共现的词频越多,那么它的重要性也就越大。
论文中给出的函数是这样的,在x