Showing preview only (315K chars total). Download the full file or copy to clipboard to get everything.
Repository: airsplay/vokenization
Branch: master
Commit: 5601b799184e
Files: 70
Total size: 295.5 KB
Directory structure:
gitextract_6h_zq3l4/
├── LICENSE
├── README.md
├── data/
│ ├── lxmert/
│ │ └── .gitignore
│ ├── mscoco/
│ │ └── .gitignore
│ ├── vg/
│ │ └── .gitignore
│ ├── wiki/
│ │ ├── get_data_cased.bash
│ │ ├── get_data_cased_untokenized.bash
│ │ ├── install-tools.sh
│ │ └── tools/
│ │ ├── remove_accent.py
│ │ ├── segment_th.py
│ │ └── tokenize.sh
│ └── wiki103/
│ ├── get_data_cased.sh
│ └── get_data_uncased.sh
├── requirements.txt
├── scripts/
│ ├── base_vlm_wiki.bash
│ ├── base_vlm_wiki_glue.bash
│ ├── base_wiki.bash
│ ├── base_wiki_glue.bash
│ ├── extract_keys.bash
│ ├── mpvokenize_wiki.bash
│ ├── mpvokenize_wiki103.bash
│ ├── run_glue_at_epoch.bash
│ ├── run_glue_epochs.bash
│ ├── run_xmatching.bash
│ ├── small_vlm_wiki103.bash
│ ├── small_vlm_wiki103_glue.bash
│ ├── small_wiki103.bash
│ ├── small_wiki103_glue.bash
│ └── xmatching_benchmark.bash
├── snap/
│ ├── bert/
│ │ └── .gitkeep
│ ├── vlm/
│ │ └── .gitkeep
│ └── xmatching/
│ └── .gitkeep
├── tokenization/
│ ├── to_hdf5.py
│ ├── tokenize_dataset.py
│ ├── tokenize_wiki103_bert.bash
│ ├── tokenize_wiki103_roberta.bash
│ ├── tokenize_wiki_bert.bash
│ └── tokenize_wiki_roberta.bash
├── vlm/
│ ├── __init__.py
│ ├── configs/
│ │ ├── bert-12L-768H.json
│ │ ├── bert-4L-768H.json
│ │ ├── bert-6L-512H.json
│ │ └── bert_base.json
│ ├── data.py
│ ├── model.py
│ ├── param.py
│ ├── run_glue.py
│ ├── run_glue_epochs.py
│ ├── run_lm_distributed.py
│ ├── run_vlm_distributed.py
│ └── show_glue_results_epochs.py
├── vokenization/
│ ├── __init__.py
│ ├── common.py
│ ├── create_image_ids.py
│ ├── evaluate_diversity.py
│ ├── evaluate_retrieval.py
│ ├── extract_vision_keys.py
│ ├── indexing.py
│ ├── revokenization.py
│ ├── revokenize_corpus_mp.py
│ ├── vokenization.py
│ └── vokenize_corpus_mp.py
└── xmatching/
├── __init__.py
├── data.py
├── frozen_batch_norm.py
├── loss.py
├── main.py
├── metric.py
├── model.py
└── param.py
================================================
FILE CONTENTS
================================================
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2020 Hao Tan
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
# Vokenization
PyTorch code for the EMNLP 2020 paper "[Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision](https://arxiv.org/pdf/2010.06775.pdf)" (Hao Tan and Mohit Bansal).
**Outline**
* [Contextualized Cross-Modal Matching](#contextualized-cross-modal-matching-xmatching)
* [Downloading Image and Captioning Data](#download-image-and-captioning-data)
* [Model Training](#training-the-cross-modal-matching-model)
* [Benchmark (Optional)](#benchmarking-cross-modal-matching-models-optional)
* [Vokenization](#vokenization-vokenization)
* [Downloading Pure-Language Data](#downloading-and-pre-processing-pure-language-data)
* [Extracting Visual Feature](#extracting-image-features)
* [Vokenization Process](#the-vokenization-process)
* [Visually-Supervised Language Model](#visually-supervised-language-model-vlm)
* [VLM Pre-training](#pre-training-with-vlm)
* [GLUE Evaluation](#glue-evaluation)
* [MLM Pre-training (as baselines)](#bert-as-baselines)
> Note: I recommend to focus on "Wiki103" first and
> ingore the code blocks related to "English Wikipedia".
> "Eng Wiki" might take too long to complete.
## Installation
```shell script
pip install -r requirements.txt
```
Require python 3.6 + (to support huggingface [transformers](https://github.com/huggingface/transformers)).
## Contextualized Cross-Modal Matching (xmatching)
In this [module](xmatching) (corresponding to Sec 3.2 of the [paper](https://arxiv.org/pdf/2010.06775.pdf)),
we want to learn a token-image matching model from sentence-image aligned data (i.e., image captioning data).
The model "contextually" measures the relevance between tokens (i.e., words) and images.
The terminology "contextual" emphasize the nature that
the sentences (the context) are considered
when measuring the token-image relevance score.
### Download Image and Captioning Data
1. Download MS COCO images:
```shell script
# MS COCO (Train 13G, Valid 6G)
mkdir -p data/mscoco
wget http://images.cocodataset.org/zips/train2014.zip -P data/mscoco
wget http://images.cocodataset.org/zips/val2014.zip -P data/mscoco
unzip data/mscoco/train2014.zip -d data/mscoco/images/ && rm data/mscoco/train2014.zip
unzip data/mscoco/val2014.zip -d data/mscoco/images/ && rm data/mscoco/val2014.zip
```
If you already have COCO image on disk. Save them as
```
data
|-- mscoco
|-- images
|-- train2014
|-- COCO_train2014_000000000009.jpg
|-- COCO_train2014_000000000025.jpg
|-- ......
|-- val2014
|-- COCO_val2014_000000000042.jpg
|-- ......
```
2. Download captions (split following the LXMERT project):
```shell script
mkdir -p data/lxmert
wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_train.json -P data/lxmert/
wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_nominival.json -P data/lxmert/
wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/vgnococo.json -P data/lxmert/
wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_minival.json -P data/lxmert/
```
### Training the Cross-Modal Matching Model
The model is trained on MS COCO with pairwise hinge loss (details in Sec. 3.2 of the [paper](https://arxiv.org/pdf/2010.06775.pdf)).
Running Commands:
```bash
# Run the cross-modal matching model with single-machine multi-processing distributed training
# "0,1" indicates using the GPUs 0 and 1.
# "bert_resnext" is the name of this snapshot and would be saved at snap/xmatching/bert_resnext
# "--visn resnext101_32x8d" is the vision backbone
# "--lang bert" is the langaugae backbone
# Speed: 20 min ~ 30 min / 1 Epoch, 20 Epochs by default.
bash scripts/run_xmatching.bash 0,1 bert_resnext --visn resnext101_32x8d --lang bert
```
The options `--visn` and `--lang` specify the architecture of the encoder.
Tested options
```
--visn $VISN_MODEL
VISN_MODEL={resnet18, resnet34, resnet50, resnet101, resnet152,
wide_resnet50_2, wide_resnet101_2, resnext101_32x8d (default), ...}
--lang $LANG_MODEL
LANG_MODEL={bert, roberta, xlnet, bert-large, ...}
```
For visual backbones, the models in [torchvision](https://pytorch.org/docs/stable/torchvision/models.html) are mostly supported.
You might need to handle the last FC layer, because it is written differently in different backbones.
The language backbones are initialized from huggingface [transformers](https://github.com/huggingface/transformers).
> We found that the results with XLNet is pretty low but have not identified
> the reason. Results of other backbones are similar.
## Vokenization (vokenization)
The vokenization is a bridge between the cross-modality (words-and-image) matching models (xmatching) and
visually-supervised lagnauge models (vlm).
The final goal is to convert the language tokens to related images
(we called them **vokens**).
These **vokens** enable the visual supervision of the language model.
We mainly provide pr-eprocessing tools (i.e., feature extraction, tokenization, and vokenization) and
evaluation tools of previous cross-modal matching models here.
Here is a diagram of these processes and we next discuss them one-by-one:
```
Extracting Image Features-----> Benchmakring the Matching Models (Optional) --> Vokenization
Downloading Language Data --> Tokenization -->-->--/
```
### Downloading and Pre-Processing Pure-Language Data
We provide scripts to get the datasets "wiki103" and "wiki".
We would note them as "XX-cased" or "XX-uncased" where the suffix "cased" / "uncased" only indicates
the property of the raw text.
1. **Wiki103**. The [wiki103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) dataset
is a seleted subset of English Wikipedia, containing around 100M tokens.
```shell script
bash data/wiki103/get_data_cased.sh
```
2. **English Wikipedia**.
The script to download and process wiki data are modified from [XLM](https://github.com/facebookresearch/XLM).
It will download a 17G file.
The speed depends on the networking and it usually takes several hours to filter the data.
The process ends with around 2.8B tokens.
```shell script
bash data/wiki/get_data_cased.bash en
```
Note: For *RoBERTa*, it requires an untokenized version of wiki (o.w. the results would be much lower),
so please use the following command:
```shell script
bash data/wiki/get_data_cased_untokenized.bash en
```
> Note: I recommend to focus on "Wiki103" first and
> ingore the code blocks related to "English Wikipedia".
> "Eng Wiki" might take too long to complete.
### Tokenization of Language Data
We next tokenize the language corpus.
It would locally save three files:
"$dataset_name.$tokenizer_name",
"$dataset_name.$tokenizer_name.hdf5",
and "$dataset_name.$tokenizer_name.line".
Taking the wiki103 dataset and BERT tokenizer as an example,
we convert the training file into
```
data
|-- wiki103-cased
|-- wiki.train.raw.bert-base-uncased
|-- wiki.train.raw.bert-base-uncased.hdf5
|-- wiki.train.raw.bert-base-uncased.line
```
The txt file `wiki.train.raw.bert-base-uncased` saves the tokens and each line in this file is the tokens of a line
in the original file,
The hdf5 file `wiki.train.raw.bert-base-uncased.hdf5` stores all the tokens continuously and use
`wiki.train.raw.bert-base-uncased.line` to index the starting token index of each line.
The ".line" file has `L+1` lines where `L` is the number of lines in the original files.
Each line has a range "line[i]" to "line[i+1]" in the hdf5 file.
Commands:
1. Wiki103 (around 10 min)
```shell script
bash tokenization/tokenize_wiki103_bert.bash
```
2. English Wikipedia (around 3 hours)
```shell script
bash tokenization/tokenize_wiki_bert.bash
```
### Extracting Image Features
The image pre-processing extracts the image features to build the keys in the vokenization retrieval process.
#### Download the Visual Genome (VG) images
Since MS COCO images are used in training the cross-modal matching model
as in [xmatching](#contextualized-cross-modal-matching-xmatching).
We will use the [Visual Genome](https://visualgenome.org/) images as
candidate vokens for retrievel.
We here download the images first.
```shell script
wget https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip -P data/vg/
wget https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip -P data/vg/
unzip data/vg/images.zip -d data/vg/images && rm data/vg/images.zip
unzip data/vg/images2.zip -d data/vg/images && rm data/vg/images2.zip
cd data/vg/images
mv VG_100K/* .
mv VG_100K_2/* .
rm -rf VG_100K VG_100K_2
cd ../../../
```
If you already have Visual Genome image on disk. Save them as
```
data
|-- vg
|-- images
|-- 1000.jpg
|-- 1001.jpg
|-- ......
```
#### Build Universal Image Ids
We first build a list of universal image indexes with
[vokenization/create_image_ids.py](vokenization/create_image_ids.py).
It is used to unify the image ids in different experiments
thus the feature array stored in hdf5 could be universally indexed.
The image ids are saved under a shared path `LOCAL_DIR` (default to `data/vokenization`)
defined in [vokenization/common.py](vokenization/common.py).
The image ids are saved under `data/vokenization/images` with format `{IMAGE_SET}_ids.txt`.
We will make sure that all the experiments agree with this meta info,
so that we would not get different indexing in different retrieval experiments.
> Note: The ids created by [create_image_ids.py](vokenization/create_image_ids.py) are only the order of the images.
> The actual images in the dictionary are provided by `extract_keys.bash`, thus is corresponding to the
> `_paths.txt`, because the `extract_keys` will filter all broken images and non-existing images.
Commands:
```bash
# Step 1, Build image orders.
python vokenization/create_image_ids.py
```
#### Extracting Image Features
Extract image features regarding the list built above, using code
[vokenization/extract_vision_keys.py](vokenization/extract_vision_keys.py).
The code will first read the image ids saved in `data/vokenization/images/{IMAGE_SET}_ids.txt` and locate the images.
The features will be saved under `snap/xmatching/bert_resnext/keys/{IMAGE_SET}.hdf5`.
It finishes within 1 hour.
Commands:
```bash
# Step 2, Extract features.
# bash scripts/extract_keys.bash $GPU_ID $MODEL_NAME
bash scripts/extract_keys.bash 0 bert_resnext
```
### Benchmarking Cross-Modal Matching Models (Optional)
> Before evaluating, please make sure that `extracting_image_features` and `tokenization` are completed.
We benchmark the performance of cross-modal matching models from large scale.
The evaluation includes two different metrics: diversity and the retrieval performance.
Diversity
(in [vokenization/evaluate_diversity.py](vokenization/evaluate_diversity.py))
ensures that the same [token type](https://arxiv.org/pdf/1902.06006.pdf)
is mapped to diverse images regarding its context (i.e., the sentence).
Retrieval
(in [vokenization/evaluate_retrieval.py](vokenization/evaluate_retrieval.py))
measures the correspondence of the token and the retrieved images.
We gather these two utils into one script and the command here:
```bash
bash scripts/xmatching_benchmark.bash 0 bert_resnext
```
### The Vokenization Process
After all these steps, we could start to vokenize the language corpus.
It would load the tokens saved in `dataset_name.tokenizer_name.hdf5`
and uses the line-split information in `dataset_name.tokenzier_name.line`.
The code is optimized and could be continued by just rerunning it.
The vokens will be saved in `snap/xmatching/bert_resnext/vokens/wiki.train.raw.vg_nococo.hdf5` by default.
The file `snap/xmatching/bert_resnext/vokens/wiki.train.raw.vg_nococo.ids` contains the universal image ids
for each voken,
e.g., the image id `vg_nococo/8` corresponds to 8-th feature
saved in `snap/xmatching/bert_resnext/keys/vg_nococo.hdf5`.
> Note: `--tokenizer-name` must be provided in the script.
Commands
1. Wiki103 (around 1 hour on 4 Titan V)
```shell script
# Note: mp is the abbreviation for "multi-processing"
# bash scripts/mpvokenize_wiki103.bash $USE_GPUS $SNAP_NAME
bash scripts/mpvokenize_wiki103.bash 0,1,2,3 bert_resnext
```
2. English Wikipedia (around 1 day on 4 Titan V)
```shell script
# bash scripts/mpvokenize_wiki.bash $USE_GPUS $SNAP_NAME
bash scripts/mpvokenize_wiki.bash 0,1,2,3 bert_resnext
```
> The script will call
> [vokenization/vokenize_corpus_mp.py](vokenization/vokenize_corpus_mp.py)
> to vokenize a corpus.
> The vokenziation happens in [vokenization/vokenization.py](vokenization/vokenization.py) and
> it use [vokenization/indexing.py](vokenization/indexing.py) to do nearest neighbor search
> (based on [faiss](https://github.com/facebookresearch/faiss)).
## Visually-Supervised Language Model (vlm)
### Pre-Training with VLM
As discussed in Sec. 2 of the [paper](https://arxiv.org/pdf/2010.06775.pdf),
we use previous generated vokens to pre-train the model
with visual supervision.
#### Wiki103
After the [vokenization process](#the-vokenization-process) of wiki103,
we could run the model with command:
```shell script
# bash scripts/small_vlm_wiki103_glue.bash $GPUs $SNAP_NAME
bash scripts/small_vlm_wiki103.bash 0,1,2,3 wiki103_bert_small
```
It will call
[vlm/run_vlm_distributed.py](vlm/run_vlm_distributed.py)
and run a BERT-6Layers-512Hiddens model on [wiki103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)
dataset with the support of voken supervisions.
The snapshot will be saved to `snap/vlm/wiki103_bert_small`.
We recommend to run this Wiki103 experiment first since it will finish
in a reasonable time (20 hours).
The pure BERT pre-training option is also available [later](#bert-as-baselines)
for comparisons.
Note: defautly, the mixed-precision training is not used.
To support the mixed precision pre-training,
please install the [nvidia/apex](https://github.com/NVIDIA/apex) library with command:
```shell script
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
```
After that, you could bring back the option `--fp16` and `--fp16_opt_level O2` in
the script `scripts/small_vlm_wiki103.bash`.
I recommend to use `--fp16_opt_level O2`.
Although the option O2 might be [unstable](https://github.com/NVIDIA/apex/issues/818#issuecomment-639012282),
it saves a lot memory:
the max per-gpu-batch-size is 32 with O1 but 64 with O2.
#### English Wikipedia
After the [vokenization process](#the-vokenization-process) of wiki103,
we could run the model with command:
```shell script
# bash scripts/base_vlm_wiki.bash $GPUs $SNAP_NAME
bash scripts/base_vlm_wiki.bash 0,1,2,3 wiki_bert_base
```
It will run a BERT-12Layers-768Hiddens (same as BERT_BASE) model on the English Wikipedia
dataset with the support of voken supervisions.
The snapshot will be saved to `snap/vlm/wiki_bert_base`.
It takes around 3-5 days on 4 Titan V / GTX 2080
and around 5-7 days to finish in 4 Titan Pascal/T4 cards.
(This estimation is accurate since I inevitably run experiments on all these servers...).
Titan V / 2080 / T4 have native support of mixed precision training (triggered by `--fp16` option and need
installing [apex](https://github.com/NVIDIA/apex)).
The speed would be much faster.
Titan Pascal would also save some memory with the `--fp16` option.
### GLUE Evaluation
We defautly use the [GLUE](https://gluebenchmark.com/) benchmark
(e.g., [SST](https://nlp.stanford.edu/sentiment/index.html),
[MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398),
[QQP](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs),
[MNLI](https://cims.nyu.edu/~sbowman/multinli/),
[QNLI](https://rajpurkar.github.io/SQuAD-explorer/),)
as downstreaming tasks.
Other tasks could be evaluated following the setup [here](https://github.com/huggingface/transformers/tree/28d183c90cbf91e94651cf4a655df91a52ea1033/examples)
by changing the option `--model_name_or_path` to the correct snapshot path `snap/bert/wiki103`.
#### Download GLUE dataset
This downloaindg scrip is copied from [huggingface transformers](https://github.com/huggingface/transformers/tree/master/examples/text-classification)
project.
Since the [transformers](https://github.com/huggingface/transformers) is still under dense
development, the change of APIs might affect the code.
I have upgraded the code compaticability to transformers==3.3.
```shell script
wget https://raw.githubusercontent.com/huggingface/transformers/master/utils/download_glue_data.py
python download_glue_data.py --data_dir data/glue --tasks all
```
#### Finetuning on GLUE Tasks
The pre-trained snapshots are evaluated by fine-tuning them on the [GLUE](https://gluebenchmark.com/)
benchmark.
The code are modified from the huggingface [transformers](https://github.com/huggingface/transformers).
Running GLUE evaluation for snapshots from different epochs:
```bash
# bash scripts/run_glue_epochs.bash $GPUS #SNAP_PATH --snaps $NUM_OF_SNAPS
bash scripts/run_glue_epochs.bash 0,1,2,3 snap/vlm/wiki103_bert_small --snaps 7
```
It will assess 7 snaps using all 0,1,2,3 GPUs.
Setting `snaps=-1` will assess all checkpoints.
If you just want to evaluate the last (usually the best) snapshot, please use:
```
bash scripts/run_glue_epochs.bash 0 snap/vlm/wiki103_bert_small --snaps 1
```
#### Showing the results
For all results saved under `snap/` (whatever the dir names),
running the folloing command will print out all the results.
```bash
python vlm/show_glue_results_epochs.py
```
It will print results like
```
snap/vlm/test_finetune/glueepoch_checkpoint-epoch0019
RTE MRPC STS-B CoLA SST-2 QNLI QQP MNLI MNLI-MM GLUE
54.51 84.72 87.18 52.32 90.02 88.36 87.16 81.92 82.57 78.75
snap/vlm/bert_6L_512H_wiki103_sharedheadctr_noshuffle/glueepoch_checkpoint-epoch0029
RTE MRPC STS-B CoLA SST-2 QNLI QQP MNLI MNLI-MM GLUE
58.12 82.76 84.45 26.74 89.56 84.40 86.52 77.56 77.99 74.23
```
### BERT (As baselines)
We also provide pure language-model pre-training as baselines.
#### Wiki103
```shell script
# bash scripts/small_wiki103.bash $GPUs $SNAP_NAME
bash scripts/small_wiki103.bash 0,1,2,3 bert_small
```
It will call
[vlm/run_lm_distributed.py](vlm/run_lm_distributed.py)
and run a BERT-6Layers-512Hiddens model on [wiki103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)
dataset with the masked language model only.
The snapshot will be saved to `snap/bert/wiki103_bert_small`.
Or you could directly using the script `small_wiki103_glue.bash` to
enable GLUE evaluation after finishing pre-training.
```shell script
bash scripts/small_wiki103_glue.bash 0,1,2,3 bert_small
```
#### English Wikipedia
Command:
```shell script
# bash scripts/base_wiki.bash $GPUs $SNAP_NAME
bash scripts/base_wiki.bash 0,1,2,3 bert_wiki
```
With GLUE evaluation:
```shell script
bash scripts/base_wiki_glue.bash 0,1,2,3 bert_wiki
```
## Pre-processed Data and Pre-trained Models
### Data
Wiki103 (100M tokens)
```
mkdir -p data/wiki103-cased
wget https://nlp.cs.unc.edu/data/vokenization/wiki103-cased/wiki.test.raw.bert-base-uncased.hdf5 -P data/wiki103-cased
wget https://nlp.cs.unc.edu/data/vokenization/wiki103-cased/wiki.train.raw.bert-base-uncased.hdf5 -P data/wiki103-cased
wget https://nlp.cs.unc.edu/data/vokenization/wiki103-cased/wiki.valid.raw.bert-base-uncased.hdf5 -P data/wiki103-cased
```
Wiki (2800 M tokens)
```
mkdir -p data/wiki-cased
wget https://nlp.cs.unc.edu/data/vokenization/wiki-cased/en.test.raw.bert-base-uncased.hdf5 -P data/wiki-cased
wget https://nlp.cs.unc.edu/data/vokenization/wiki-cased/en.train.raw.bert-base-uncased.hdf5 -P data/wiki-cased
wget https://nlp.cs.unc.edu/data/vokenization/wiki-cased/en.valid.raw.bert-base-uncased.hdf5 -P data/wiki-cased
```
### Models
- Cross-Modal Matching model: [https://nlp.cs.unc.edu/data/vokenization/coco_hinge05_dim64_resxt101_bertl4.zip](https://nlp.cs.unc.edu/data/vokenization/coco_hinge05_dim64_resxt101_bertl4.zip)
- BERT (on Wiki): [https://nlp.cs.unc.edu/data/vokenization/bert_12L_768H_wiki.zip](https://nlp.cs.unc.edu/data/vokenization/bert_12L_768H_wiki.zip)
- BERT + VLM (on Wiki): [https://nlp.cs.unc.edu/data/vokenization/vlm_12L_768H_wiki.zip](https://nlp.cs.unc.edu/data/vokenization/vlm_12L_768H_wiki.zip)
- RoBERTa + VLM (on Wiki): [https://nlp.cs.unc.edu/data/vokenization/vlm_roberta_12L_768H_wiki.zip](https://nlp.cs.unc.edu/data/vokenization/vlm_roberta_12L_768H_wiki.zip)
## Reference
If you find our project useful, please cite this paper:
```
@inproceedings{tan2020vokenization,
title={Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision},
author={Tan, Hao and Bansal, Mohit},
booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing},
year={2020}
}
```
## Acknowledgement
I thank the support from [Bloomberg Data Science Ph.D. Fellowship](https://www.techatbloomberg.com/bloomberg-data-science-ph-d-fellowship/).
We thank the reviewers and [Yixin Nie](https://easonnie.github.io/)
and [Jie Lei](https://www.cs.unc.edu/~jielei/)
for their helpful discussions.
Part of the code are built based on huggingface [transformers](https://github.com/huggingface/transformers) and
facebook [xlm](https://github.com/facebookresearch/XLM) and [faiss](https://github.com/facebookresearch/faiss).
4K3.
================================================
FILE: data/lxmert/.gitignore
================================================
/mscoco_minival.json
/mscoco_nominival.json
/mscoco_train.json
/vgnococo.json
================================================
FILE: data/mscoco/.gitignore
================================================
/images
================================================
FILE: data/vg/.gitignore
================================================
/images
================================================
FILE: data/wiki/get_data_cased.bash
================================================
# Copyright (c) 2019-present, Facebook, Inc.
# Copy frrom https://github.com/facebookresearch/XLM
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#
#
# Usage: ./get-data-wiki.sh $lg (en)
#
set -e
lg=$1 # input language
# data path
WIKI_PATH=data/wiki-cased
MAIN_PATH=$WIKI_PATH
# tools paths
TOOLS_PATH=$MAIN_PATH/tools
TOKENIZE=$TOOLS_PATH/tokenize.sh
REMOVE_ACCENT=$TOOLS_PATH/remove_accent.py
# Wiki data
WIKI_DUMP_NAME=${lg}wiki-latest-pages-articles.xml.bz2
WIKI_DUMP_LINK=https://dumps.wikimedia.org/${lg}wiki/latest/$WIKI_DUMP_NAME
# install tools
data/wiki/install-tools.sh $TOOLS_PATH
# create Wiki paths
mkdir -p $WIKI_PATH/bz2
mkdir -p $WIKI_PATH/txt
# download Wikipedia dump
echo "Downloading $lg Wikipedia dump from $WIKI_DUMP_LINK ..."
wget -c $WIKI_DUMP_LINK -P $WIKI_PATH/bz2/
echo "Downloaded $WIKI_DUMP_NAME in $WIKI_PATH/bz2/$WIKI_DUMP_NAME"
# extract and tokenize Wiki data
echo "*** Cleaning and tokenizing $lg Wikipedia dump ... ***"
#python -m $TOOLS_PATH/wikiextractor/wikiextractor/WikiExtractor $WIKI_PATH/bz2/$WIKI_DUMP_NAME --processes 24 -q -o - \
if [ ! -f $WIKI_PATH/txt/$lg.all.raw ]; then
python $TOOLS_PATH/wikiextractor/WikiExtractor.py $WIKI_PATH/bz2/$WIKI_DUMP_NAME --processes 24 -q -o - \
| sed "/^\s*\$/d" \
| grep -v "^<doc id=" \
| grep -v "</doc>\$" \
| $TOKENIZE $lg $TOOLS_PATH \
| python $REMOVE_ACCENT \
> $WIKI_PATH/txt/$lg.all.raw
fi
echo "*** Tokenized ( + accent-removal) $lg Wikipedia dump to $WIKI_PATH/txt/train.${lg} ***"
# split into train / valid / test
echo "*** Split into train / valid / test ***"
split_data() {
NLINES=`wc -l $1 | awk -F " " '{print $1}'`;
NTRAIN=$((NLINES - 10000));
NVAL=$((NTRAIN + 5000));
cat $1 | head -$NTRAIN > $2;
cat $1 | head -$NVAL | tail -5000 > $3;
cat $1 | tail -5000 > $4;
}
split_data $WIKI_PATH/txt/$lg.all.raw $WIKI_PATH/txt/$lg.train.raw $WIKI_PATH/txt/$lg.valid.raw $WIKI_PATH/txt/$lg.test.raw
# File structure
mv $WIKI_PATH/txt/* $WIKI_PATH/
rm -rf $WIKI_PATH/bz2
rm -rf $WIKI_PATH/txt
================================================
FILE: data/wiki/get_data_cased_untokenized.bash
================================================
# Copyright (c) 2019-present, Facebook, Inc.
# Copy frrom https://github.com/facebookresearch/XLM
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#
#
# Usage: ./get-data-wiki.sh $lg (en)
#
set -e
lg=$1 # input language
# data path
WIKI_PATH=data/wiki-cased-untokenized
MAIN_PATH=$WIKI_PATH
# tools paths
TOOLS_PATH=$MAIN_PATH/tools
TOKENIZE=$TOOLS_PATH/tokenize.sh
REMOVE_ACCENT=$TOOLS_PATH/remove_accent.py
# Wiki data
WIKI_DUMP_NAME=${lg}wiki-latest-pages-articles.xml.bz2
WIKI_DUMP_LINK=https://dumps.wikimedia.org/${lg}wiki/latest/$WIKI_DUMP_NAME
# install tools
data/wiki/install-tools.sh $TOOLS_PATH
# create Wiki paths
mkdir -p $WIKI_PATH/bz2
mkdir -p $WIKI_PATH/txt
# download Wikipedia dump
if [ ! -f $WIKI_PATH/bz2/enwiki-latest-pages-articles.xml.bz2 ]; then
echo "Downloading $lg Wikipedia dump from $WIKI_DUMP_LINK ..."
wget -c $WIKI_DUMP_LINK -P $WIKI_PATH/bz2/
echo "Downloaded $WIKI_DUMP_NAME in $WIKI_PATH/bz2/$WIKI_DUMP_NAME"
fi
# extract and tokenize Wiki data
#cd $MAIN_PATH
echo "*** Cleaning and tokenizing $lg Wikipedia dump ... ***"
if [ ! -f $WIKI_PATH/txt/$lg.all.raw ]; then
python $TOOLS_PATH/wikiextractor/WikiExtractor.py $WIKI_PATH/bz2/$WIKI_DUMP_NAME --processes 24 -q -o - \
| sed "/^\s*\$/d" \
| grep -v "^<doc id=" \
| grep -v "</doc>\$" \
| python $REMOVE_ACCENT \
> $WIKI_PATH/txt/$lg.all.raw
fi
echo "*** Not Tokenized ( but + accent-removal) $lg Wikipedia dump to $WIKI_PATH/txt/train.${lg} ***"
# split into train / valid / test
echo "*** Split into train / valid / test ***"
split_data() {
NLINES=`wc -l $1 | awk -F " " '{print $1}'`;
NTRAIN=$((NLINES - 10000));
NVAL=$((NTRAIN + 5000));
cat $1 | head -$NTRAIN > $2;
cat $1 | head -$NVAL | tail -5000 > $3;
cat $1 | tail -5000 > $4;
}
split_data $WIKI_PATH/txt/$lg.all.raw $WIKI_PATH/txt/$lg.train.raw $WIKI_PATH/txt/$lg.valid.raw $WIKI_PATH/txt/$lg.test.raw
# File structure
mv $WIKI_PATH/txt/* $WIKI_PATH/
rm -rf $WIKI_PATH/bz2
rm -rf $WIKI_PATH/txt
================================================
FILE: data/wiki/install-tools.sh
================================================
# Copyright (c) 2019-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#
set -e
# data path
TOOLS_PATH=$1
# tools
MOSES_DIR=mosesdecoder
FASTBPE_DIR=fastBPE
FASTBPE=fast
WMT16_SCRIPTS=wmt16-scripts
# tools path
mkdir -p $TOOLS_PATH
# Copy the scripts to TOOLS_PATH
cp -r data/wiki/tools/* $TOOLS_PATH
#
# Download and install tools
#
old=$(pwd)
cd $TOOLS_PATH
# Download Moses
if [ ! -d "$MOSES_DIR" ]; then
echo "Cloning Moses from GitHub repository..."
git clone https://github.com/moses-smt/mosesdecoder.git
fi
# Download fastBPE
if [ ! -d "$FASTBPE_DIR" ]; then
echo "Cloning fastBPE from GitHub repository..."
git clone https://github.com/glample/fastBPE
fi
# Compile fastBPE
if [ ! -f "$FASTBPE_DIR/$FASTBPE" ]; then
echo "Compiling fastBPE..."
cd fastBPE
g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
cd ..
fi
# Download Sennrich's tools
if [ ! -d "$WMT16_SCRIPTS" ]; then
echo "Cloning WMT16 preprocessing scripts..."
git clone https://github.com/rsennrich/wmt16-scripts.git
fi
# Download WikiExtractor
if [ ! -d wikiextractor ]; then
echo "Cloning WikiExtractor from GitHub repository..."
git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor
git checkout e4abb4cbd019b0257824ee47c23dd163919b731b
cd ..
fi
cd $old
# # Chinese segmenter
# if ! ls $TOOLS_PATH/stanford-segmenter-* 1> /dev/null 2>&1; then
# echo "Stanford segmenter not found at $TOOLS_PATH/stanford-segmenter-*"
# echo "Please install Stanford segmenter in $TOOLS_PATH"
# exit 1
# fi
#
# # Thai tokenizer
# if ! python -c 'import pkgutil; exit(not pkgutil.find_loader("pythainlp"))'; then
# echo "pythainlp package not found in python"
# echo "Please install pythainlp (pip install pythainlp)"
# exit 1
# fi
#
================================================
FILE: data/wiki/tools/remove_accent.py
================================================
# Copyright (c) 2019-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#
import sys
import unicodedata
import six
def convert_to_unicode(text):
"""
Converts `text` to Unicode (if it's not already), assuming UTF-8 input.
"""
# six_ensure_text is copied from https://github.com/benjaminp/six
def six_ensure_text(s, encoding='utf-8', errors='strict'):
if isinstance(s, six.binary_type):
return s.decode(encoding, errors)
elif isinstance(s, six.text_type):
return s
else:
raise TypeError("not expecting type '%s'" % type(s))
return six_ensure_text(text, encoding="utf-8", errors="ignore")
def run_strip_accents(text):
"""
Strips accents from a piece of text.
"""
text = unicodedata.normalize("NFD", text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == "Mn":
continue
output.append(char)
return "".join(output)
for line in sys.stdin:
line = convert_to_unicode(line.rstrip())
line = run_strip_accents(line)
print(u'%s' % line)
================================================
FILE: data/wiki/tools/segment_th.py
================================================
# Copyright (c) 2019-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#
import sys
from pythainlp.tokenize import word_tokenize
for line in sys.stdin.readlines():
line = line.rstrip('\n')
print(' '.join(word_tokenize(line)))
================================================
FILE: data/wiki/tools/tokenize.sh
================================================
# Copyright (c) 2019-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#
# Tokenize text data in various languages
# Usage: e.g. cat wiki.ar | tokenize.sh ar
set -e
N_THREADS=8
lg=$1
TOOLS_PATH=$2
# moses
MOSES=$TOOLS_PATH/mosesdecoder
REPLACE_UNICODE_PUNCT=$MOSES/scripts/tokenizer/replace-unicode-punctuation.perl
NORM_PUNC=$MOSES/scripts/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$MOSES/scripts/tokenizer/remove-non-printing-char.perl
TOKENIZER=$MOSES/scripts/tokenizer/tokenizer.perl
# Chinese
if [ "$lg" = "zh" ]; then
$TOOLS_PATH/stanford-segmenter-*/segment.sh pku /dev/stdin UTF-8 0 | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR
# Thai
elif [ "$lg" = "th" ]; then
cat - | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | python $TOOLS_PATH/segment_th.py
# Japanese
elif [ "$lg" = "ja" ]; then
cat - | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | kytea -notags
# other languages
else
cat - | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | $TOKENIZER -no-escape -threads $N_THREADS -l $lg
fi
================================================
FILE: data/wiki103/get_data_cased.sh
================================================
OUTPUT=data/wiki103-cased
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip -P $OUTPUT/
unzip $OUTPUT/wikitext-103-raw-v1.zip -d $OUTPUT
mv $OUTPUT/wikitext-103-raw/* $OUTPUT
rm -rf $OUTPUT/wikitext-103-raw-v1.zip $OUTPUT/wikitext-103-raw
================================================
FILE: data/wiki103/get_data_uncased.sh
================================================
OUTPUT=data/wiki103
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip -P $OUTPUT/
unzip $OUTPUT/wikitext-103-v1.zip -d $OUTPUT
mv $OUTPUT/wikitext-103/* $OUTPUT
rm -rf $OUTPUT/wikitext-103-v1.zip $OUTPUT/wikitext-103
================================================
FILE: requirements.txt
================================================
torch
#==1.4.0
torchvision
#==0.5.0
transformers==3.3.0
tensorboardX
# For GLUE evaluation
sklearn
# Fiass supports fast indexing.
# The code has a torch-implemented GPU indexing, so do not worry if you could not install faiss.
faiss-gpu>=1.6.3
# Spacy is used in sentence segmentation where the sentences are the input the cross-modality matching model.
spacy
# A higher h5py version to support h5py.VirtualLayout
h5py>=2.10.0
================================================
FILE: scripts/base_vlm_wiki.bash
================================================
# The name of experiment
GPUS=$1
NAME=$2
# Create dirs and make backup
output=snap/bert/$NAME
mkdir -p $output/src
cp -r vlm $output/src/
cp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash
cp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash
cp $0 $output/run.bash
export TRAIN_FILE=data/wiki-cased/en.train.raw
export TEST_FILE=data/wiki-cased/en.valid.raw
# Pre-training
CUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \
--output_dir=$output \
--overwrite_output_dir \
--config_name=vlm/configs/bert-12L-768H.json \
--tokenizer_name=bert-base-uncased \
--model_type=bert \
--block_size=126 \
--per_gpu_train_batch_size=32 \
--per_gpu_eval_batch_size=32 \
--gradient_accumulation_steps=2 \
--max_steps=200000 \
--learning_rate=2e-4 \
--weight_decay=0.01 \
--warmup_steps=5000 \
--mlm_probability 0.15 \
--mlm_ratio 1.0 \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE \
--col_data \
--split_sent \
--do_voken_cls \
--voken_labels all \
--voken_dir snap/xmatching/bert_resnext/vokens \
--voken_suffix vg_nococo \
--mlm ${@:3} | tee $output/log.log
#--fp16 \
#--fp16_opt_level O2 \
================================================
FILE: scripts/base_vlm_wiki_glue.bash
================================================
# The name of experiment
GPUS=$1
NAME=$2
# Create dirs and make backup
output=snap/bert/$NAME
mkdir -p $output/src
cp -r vlm $output/src/
cp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash
cp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash
cp $0 $output/run.bash
export TRAIN_FILE=data/wiki-cased/en.train.raw
export TEST_FILE=data/wiki-cased/en.valid.raw
# Pre-training
CUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \
--output_dir=$output \
--overwrite_output_dir \
--config_name=vlm/configs/bert-12L-768H.json \
--tokenizer_name=bert-base-uncased \
--model_type=bert \
--block_size=126 \
--per_gpu_train_batch_size=32 \
--per_gpu_eval_batch_size=32 \
--gradient_accumulation_steps=2 \
--max_steps=200000 \
--learning_rate=2e-4 \
--weight_decay=0.01 \
--warmup_steps=5000 \
--mlm_probability 0.15 \
--mlm_ratio 1.0 \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE \
--col_data \
--split_sent \
--do_voken_cls \
--voken_labels all \
--voken_dir snap/xmatching/bert_resnext/vokens \
--voken_suffix vg_nococo \
--mlm ${@:3} | tee $output/log.log
#--fp16 \
#--fp16_opt_level O2 \
# Wait for clearing the GPU cache
sleep 30
bash scripts/run_glue_epochs.bash $GPUS $output --snaps 4
================================================
FILE: scripts/base_wiki.bash
================================================
GPUS=$1
# The name of experiment
NAME=$2
# Create dirs and make backup
output=snap/bert/$NAME
mkdir -p $output/src
cp -r vlm/*.py $output/src/
cp $0 $output/run.bash
cp run_glue_epochs.bash $output/run_glue_epochs.bash
cp run_glue_at_epoch.bash $output/run_glue_at_epoch.bash
export TRAIN_FILE=data/wiki-cased/en.train.raw
export TEST_FILE=data/wiki-cased/en.valid.raw
# Pre-training
CUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \
--output_dir=$output \
--overwrite_output_dir \
--config_name=vlm/configs/bert-12L-768H.json \
--tokenizer_name=bert-base-uncased \
--model_type=bert \
--block_size=126 \
--per_gpu_train_batch_size=64 \
--per_gpu_eval_batch_size=64 \
--gradient_accumulation_steps=1 \
--max_steps 220000 \
--learning_rate=2e-4 \
--weight_decay=0.01 \
--warmup_steps=5000 \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE \
--col_data \
--split_sent \
--mlm ${@:3} | tee $output/log.log
#--fp16 \
#--fp16_opt_level O2 \
================================================
FILE: scripts/base_wiki_glue.bash
================================================
GPUS=$1
# The name of experiment
NAME=$2
# Create dirs and make backup
output=snap/bert/$NAME
mkdir -p $output/src
cp -r vlm/*.py $output/src/
cp $0 $output/run.bash
cp run_glue_epochs.bash $output/run_glue_epochs.bash
cp run_glue_at_epoch.bash $output/run_glue_at_epoch.bash
export TRAIN_FILE=data/wiki-cased/en.train.raw
export TEST_FILE=data/wiki-cased/en.valid.raw
# Pre-training
CUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \
--output_dir=$output \
--overwrite_output_dir \
--config_name=vlm/configs/bert-12L-768H.json \
--tokenizer_name=bert-base-uncased \
--model_type=bert \
--block_size=126 \
--per_gpu_train_batch_size=64 \
--per_gpu_eval_batch_size=64 \
--gradient_accumulation_steps=1 \
--max_steps 220000 \
--learning_rate=2e-4 \
--weight_decay=0.01 \
--warmup_steps=5000 \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE \
--col_data \
--split_sent \
--mlm ${@:3} | tee $output/log.log
#--fp16 \
#--fp16_opt_level O2 \
#--shuffle \
# Wait for clearing the GPU cache
sleep 30
bash scripts/run_glue_epochs.bash $GPUS $output --snaps -1
================================================
FILE: scripts/extract_keys.bash
================================================
CUDA_VISIBLE_DEVICES=$1 python vokenization/extract_vision_keys.py \
--image-sets vg_nococo,coco_minival,coco_nominival,coco_train,cc_valid \
--load-dir snap/xmatching/$2
================================================
FILE: scripts/mpvokenize_wiki.bash
================================================
GPU=$1
LOAD=snap/xmatching/$2
DATA_DIR=data/wiki-cased
TOKENIZER=bert-base-uncased
for DATA_NAME in en.valid.raw en.test.raw en.train.raw
do
CUDA_VISIBLE_DEVICES=$GPU python vokenization/vokenize_corpus_mp.py \
--load $LOAD \
--corpus=$DATA_DIR/$DATA_NAME \
--tokenizer-name $TOKENIZER \
--image-sets vg_nococo \
--max-img-num 50000
done
================================================
FILE: scripts/mpvokenize_wiki103.bash
================================================
GPU=$1
LOAD=snap/xmatching/$2
WIKI_DIR=data/wiki103-cased
TOKENIZER=bert-base-uncased
for DATA_NAME in wiki.valid.raw wiki.test.raw wiki.train.raw
do
CUDA_VISIBLE_DEVICES=$GPU python vokenization/vokenize_corpus_mp.py \
--load $LOAD \
--corpus=$WIKI_DIR/$DATA_NAME \
--tokenizer-name $TOKENIZER \
--image-sets vg_nococo \
--max-img-num 50000
done
================================================
FILE: scripts/run_glue_at_epoch.bash
================================================
export GLUE_DIR=data/glue/
EPOCHS=$2
MODEL=$3
CKPT=$4
for TASK_NAME in WNLI RTE MRPC STS-B CoLA SST-2 QNLI QQP MNLI
do
CUDA_VISIBLE_DEVICES=$1 python vlm/run_glue.py \
--model_type bert \
--tokenizer_name=bert-base-uncased \
--model_name_or_path $MODEL/$CKPT \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/$TASK_NAME \
--save_steps -1 \
--max_seq_length 126 \
--per_gpu_eval_batch_size=32 \
--per_gpu_train_batch_size=32 \
--learning_rate 1e-4 \
--warmup_steps 0.1 \
--num_train_epochs $EPOCHS.0 \
--output_dir $MODEL/glueepoch_$CKPT/$TASK_NAME
done
#--overwrite_output_dir \
================================================
FILE: scripts/run_glue_epochs.bash
================================================
GPUS=$1
MODEL=$2
python vlm/run_glue_epochs.py --gpus $GPUS --load $MODEL \
${@:3}
================================================
FILE: scripts/run_xmatching.bash
================================================
GPUS=$1
# The name of experiment
NAME=$2
# Create dirs and make backup
output=snap/xmatching/$NAME
mkdir -p $output/src/
cp -r xmatching $output/src/
cp $0 $output/run.bash
# Pre-training
CUDA_VISIBLE_DEVICES=$GPUS unbuffer python xmatching/main.py \
--train-imgs mscoco_train,mscoco_nominival --valid-imgs mscoco_minival \
--train-langs mscoco --valid-langs mscoco \
--max-len 20 --dim 64 \
--lang-layers 4,3,2,1 \
--lang-pretrained --visn-pretrained \
--num-workers 8 --batchSize 256 --optim adam --lr 1e-3 --epochs 20 \
--nodes 1 --nr 0 \
--output $output ${@:3} | tee $output/log.log
#--visn resnext101_32x8d --lang bert \
================================================
FILE: scripts/small_vlm_wiki103.bash
================================================
# The name of experiment
GPUS=$1
NAME=$2
# Create dirs and make backup
output=snap/vlm/$NAME
mkdir -p $output/src
cp -r vlm $output/src/
cp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash
cp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash
cp $0 $output/run.bash
export TRAIN_FILE=data/wiki103-cased/wiki.train.raw
export TEST_FILE=data/wiki103-cased/wiki.valid.raw
# Pre-training
CUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \
--output_dir=$output \
--overwrite_output_dir \
--config_name=vlm/configs/bert-6L-512H.json \
--tokenizer_name=bert-base-uncased \
--model_type=bert \
--block_size=126 \
--per_gpu_train_batch_size=32 \
--per_gpu_eval_batch_size=32 \
--gradient_accumulation_steps=2 \
--num_train_epochs=40 \
--learning_rate=2e-4 \
--weight_decay=0.01 \
--warmup_steps=10000 \
--mlm_probability 0.15 \
--mlm_ratio 1.0 \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE \
--col_data \
--split_sent \
--do_voken_cls \
--voken_labels all \
--voken_dir snap/xmatching/bert_resnext/vokens \
--voken_suffix vg_nococo \
--mlm ${@:3} | tee $output/log.log
#--fp16 \
#--fp16_opt_level O2 \
================================================
FILE: scripts/small_vlm_wiki103_glue.bash
================================================
# The name of experiment
GPUS=$1
NAME=$2
# Create dirs and make backup
output=snap/vlm/$NAME
mkdir -p $output/src
cp -r vlm $output/src/
cp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash
cp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash
cp $0 $output/run.bash
export TRAIN_FILE=data/wiki103-cased/wiki.train.raw
export TEST_FILE=data/wiki103-cased/wiki.valid.raw
# Pre-training
CUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \
--output_dir=$output \
--overwrite_output_dir \
--config_name=vlm/configs/bert-6L-512H.json \
--tokenizer_name=bert-base-uncased \
--model_type=bert \
--block_size=126 \
--per_gpu_train_batch_size=32 \
--per_gpu_eval_batch_size=32 \
--gradient_accumulation_steps=2 \
--num_train_epochs=40 \
--learning_rate=2e-4 \
--weight_decay=0.01 \
--warmup_steps=10000 \
--mlm_probability 0.15 \
--mlm_ratio 1.0 \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE \
--col_data \
--split_sent \
--do_voken_cls \
--voken_labels all \
--voken_dir snap/xmatching/bert_resnext/vokens \
--voken_suffix vg_nococo \
--mlm ${@:3} | tee $output/log.log
#--fp16 \
#--fp16_opt_level O2 \
# Wait for clearing the GPU cache
sleep 30
bash scripts/run_glue_epochs.bash $GPUS $output --snaps 4
================================================
FILE: scripts/small_wiki103.bash
================================================
# The name of experiment
GPUS=$1
NAME=$2
# Create dirs and make backup
output=snap/bert/$NAME
mkdir -p $output/src
cp -r vlm/*.py $output/src/
cp $0 $output/run.bash
export TRAIN_FILE=data/wiki103-cased/wiki.train.raw
export TEST_FILE=data/wiki103-cased/wiki.valid.raw
# Pre-training
CUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \
--output_dir=$output \
--overwrite_output_dir \
--config_name=vlm/configs/bert-6L-512H.json \
--tokenizer_name=bert-base-uncased \
--model_type=bert \
--block_size=126 \
--per_gpu_train_batch_size=64 \
--per_gpu_eval_batch_size=64 \
--gradient_accumulation_steps=1 \
--num_train_epochs=44 \
--learning_rate=2e-4 \
--weight_decay=0.01 \
--warmup_steps=10000 \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE \
--col_data \
--split_sent \
--shuffle \
--mlm ${@:3} | tee $output/log.log
#--fp16 \
#--fp16_opt_level O2 \
================================================
FILE: scripts/small_wiki103_glue.bash
================================================
# The name of experiment
GPUS=$1
NAME=$2
# Create dirs and make backup
output=snap/bert/$NAME
mkdir -p $output/src
cp -r vlm/*.py $output/src/
cp $0 $output/run.bash
export TRAIN_FILE=data/wiki103-cased/wiki.train.raw
export TEST_FILE=data/wiki103-cased/wiki.valid.raw
# Pre-training
CUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \
--output_dir=$output \
--overwrite_output_dir \
--config_name=vlm/configs/bert-6L-512H.json \
--tokenizer_name=bert-base-uncased \
--model_type=bert \
--block_size=126 \
--per_gpu_train_batch_size=64 \
--per_gpu_eval_batch_size=64 \
--gradient_accumulation_steps=1 \
--num_train_epochs=44 \
--learning_rate=2e-4 \
--weight_decay=0.01 \
--warmup_steps=10000 \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE \
--col_data \
--split_sent \
--shuffle \
--mlm ${@:3} | tee $output/log.log
#--fp16 \
#--fp16_opt_level O2 \
# Wait for clearing the GPU cache
sleep 30
bash scripts/run_glue_epochs.bash $GPUS $output --snaps 4
================================================
FILE: scripts/xmatching_benchmark.bash
================================================
# Benchmarking the cross-modal matching model with
# 1. Retrieval scores.
# 2. Voken Diversity w.r.t words in specific language corpus.
# Please run this after image_key_retrivel and tokenization.
# i.e., step 1 and step2 in readme.md
MODEL=$2
MODELPATH=snap/xmatching/$MODEL
rm -rf $MODELPATH/analysis.log
# Retrieval scores
CUDA_VISIBLE_DEVICES=$1 unbuffer python vokenization/evaluate_retrieval.py \
--load $MODELPATH \
--image-sets coco_minival,cc_valid \
| tee -a $MODELPATH/analysis.log
# Diversity
# Test diversity of vision-and-language (captioning) datasets
CUDA_VISIBLE_DEVICES=$1 unbuffer python vokenization/evaluate_diversity.py \
--load $MODELPATH \
--image-sets vg_nococo \
--corpus coco_minival,cc_valid \
| tee -a $MODELPATH/analysis.log
# Test diversity of pure-language corpus
CUDA_VISIBLE_DEVICES=$1 unbuffer python vokenization/evaluate_diversity.py \
--load $MODELPATH \
--image-sets vg_nococo \
--corpus data/wiki103-cased/wiki.valid.raw \
--maxsents 95000 \
| tee -a $MODELPATH/analysis.log
================================================
FILE: snap/bert/.gitkeep
================================================
================================================
FILE: snap/vlm/.gitkeep
================================================
================================================
FILE: snap/xmatching/.gitkeep
================================================
/*
================================================
FILE: tokenization/to_hdf5.py
================================================
import h5py
import numpy as np
import tqdm
from transformers import AutoTokenizer
def validate_hdf5(fname, tokenizer_name):
print("--------------------------------------------")
print("Start to valid the hdf5 file", fname + '.' + tokenizer_name + '.hdf5')
with open(fname) as f:
lines = []
for line in f:
if 'wiki' in fname:
# Wiki103: remove document title
if line.startswith(' = '):
continue
# Full Wiki: Remove the too short lines.
if len(line.strip().split(' ')) < 5:
continue
if len(line.strip()) == 0:
# Always drop empty line
continue
lines.append(line)
# Use the slow tokenizer to validate the results of the fast tokenizer.
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
h5_file = h5py.File(fname + '.' + tokenizer_name + '.hdf5', 'r')
tokens = h5_file['tokens']
print("Start to check the first 10 lines:")
ids = []
for line in lines[:10]:
ids.extend(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(line)))
ids = np.array(ids)
first_tokens = np.array(tokens[:len(ids)])
if np.array_equal(ids, first_tokens):
print("PASS")
else:
print(' '.join(tokenizer.convert_ids_to_tokens(ids)))
print()
print(' '.join(tokenizer.convert_ids_to_tokens(first_tokens)))
assert False, "FAIL"
print("Start to check the last 10 lines:")
ids = []
for line in lines[-10:]:
ids.extend(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(line)))
ids = np.array(ids)
last_tokens = np.array(tokens[-len(ids):])
if np.array_equal(ids, last_tokens):
print("PASS")
else:
print(' '.join(tokenizer.convert_ids_to_tokens(ids)))
print(' '.join(tokenizer.convert_ids_to_tokens(last_tokens)))
assert False, "FAIL"
print("--------------------------------------------")
def to_hdf5(fname, tokenizer_name, validate=True):
print("Process %s" % fname)
h5_file = h5py.File(fname + '.' + tokenizer_name + '.hdf5', 'w')
dset = h5_file.create_dataset("tokens",
(0,),
maxshape=(None,),
dtype='int32')
dump_interval = 1000000
dump_iter = 0
with open('%s.%s' % (fname, tokenizer_name)) as f:
lines = 0
tokens = []
for line in tqdm.tqdm(f):
for token in map(int, line.split(' ')):
tokens.append(token)
if len(tokens) >= dump_interval:
dset.resize((dump_iter + len(tokens),))
dset[dump_iter: dump_iter + len(tokens)] = tokens
dump_iter += len(tokens)
tokens = []
lines += 1
dset.resize((dump_iter + len(tokens),))
dset[dump_iter: dump_iter + len(tokens)] = tokens
dump_iter += len(tokens)
assert len(dset) == dump_iter
h5_file.close()
if validate:
validate_hdf5(fname, tokenizer_name)
print()
================================================
FILE: tokenization/tokenize_dataset.py
================================================
# coding=utf-8
# Copyleft 2020 project COL.
import argparse
from pathlib import Path
from transformers import AutoTokenizer
import time
from to_hdf5 import to_hdf5
def tokenize_dataset(data_dir, fname, tokenizer_name, lines_are_sents=False):
data_path = Path(data_dir)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, use_fast=True)
f = open(data_path / fname)
g = open((data_path / ('%s.%s' % (fname, tokenizer_name))), 'w')
# Statistics
dcmt_cnt = 0
token_cnt = 0
line_cnt = 0
line_starts = []
# Logging and dumping hyper-parameters
cache = ''
log_interval = log_iter = 1000000
dump_interval = dump_iter = 100000
start_time = time.time()
for i, line in enumerate(f):
# Identify the start of documents, ignore it.
if 'wiki103' in data_dir:
if line.startswith(' = '):
dcmt_cnt += 1
continue
elif 'wiki' in data_dir:
if len(line.strip().split(' ')) == 1:
dcmt_cnt += 1
continue
if 'wiki' in data_dir:
# Remove too short lines. Book corpus does not need this.
if len(line.strip().split(' ')) < 5:
continue
# Drop empty line (1)
if len(line.strip()) == 0:
continue
tokenized_line = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(line))
# tokenized_line = tokenizer.encode(line, add_special_tokens=False)
if len(tokenized_line) == 0: # Drop empty line (2)
continue
line_cnt += 1
line_starts.append(token_cnt)
if i < 5:
print()
print('Line:', line)
print('Tokens:', ' '.join(tokenizer.convert_ids_to_tokens(tokenized_line)))
token_cnt += len(tokenized_line)
cache += ' '.join(map(str, tokenized_line)) + '\n'
if (token_cnt + 1) > dump_iter:
g.write(cache)
cache = ''
dump_iter += dump_interval
if (token_cnt + 1) > log_iter:
used_time = time.time() - start_time
print("Process %d tokens in %d seconds, %0.4f tokens per second." % (
token_cnt, used_time, token_cnt / used_time))
log_iter += log_interval
# Deal with the last remaining tokens.
line_starts.append(token_cnt)
g.write(cache)
# Dump Line starts
identifier = 'sent' if lines_are_sents else 'line'
with open(data_path / ('%s.%s.%s' % (fname, tokenizer_name, identifier)), 'w') as f:
for line_start in line_starts:
f.write(str(line_start) + "\n")
f.close()
g.close()
print(f"Documents: {dcmt_cnt}, Lines: {line_cnt}, Words: {token_cnt} in dataset {fname}")
to_hdf5(str(data_path / fname), tokenizer_name)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"datadir", default=None, type=str, help="The input training data file (a text file)."
)
parser.add_argument(
"fname", default=None, type=str, help="The input training data file (a text file)."
)
parser.add_argument(
"tokenizer_name", default=None, type=str, help="The input training data file (a text file)."
)
parser.add_argument(
"--lines-are-sents", action='store_true',
help="Add this if the line are already segmented to sentences, instead of paragraphs."
)
param = parser.parse_args()
tokenize_dataset(
param.datadir,
param.fname,
param.tokenizer_name,
param.lines_are_sents,
)
================================================
FILE: tokenization/tokenize_wiki103_bert.bash
================================================
DATA_DIR=data/wiki103-cased
TOKENIZER=bert-base-uncased
python tokenization/tokenize_dataset.py $DATA_DIR wiki.valid.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR wiki.test.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR wiki.train.raw $TOKENIZER
================================================
FILE: tokenization/tokenize_wiki103_roberta.bash
================================================
DATA_DIR=data/wiki103-cased
TOKENIZER=roberta-base
python tokenization/tokenize_dataset.py $DATA_DIR wiki.valid.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR wiki.test.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR wiki.train.raw $TOKENIZER
================================================
FILE: tokenization/tokenize_wiki_bert.bash
================================================
DATA_DIR=data/wiki-cased
TOKENIZER=bert-base-uncased
python tokenization/tokenize_dataset.py $DATA_DIR en.valid.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR en.test.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR en.train.raw $TOKENIZER
================================================
FILE: tokenization/tokenize_wiki_roberta.bash
================================================
DATA_DIR=data/wiki-cased-untokenized/
TOKENIZER=roberta-base
python tokenization/tokenize_dataset.py $DATA_DIR en.valid.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR en.test.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR en.train.raw $TOKENIZER
================================================
FILE: vlm/__init__.py
================================================
import data
================================================
FILE: vlm/configs/bert-12L-768H.json
================================================
{
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 30522
}
================================================
FILE: vlm/configs/bert-4L-768H.json
================================================
{
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 4,
"type_vocab_size": 2,
"vocab_size": 30522
}
================================================
FILE: vlm/configs/bert-6L-512H.json
================================================
{
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 512,
"initializer_range": 0.02,
"intermediate_size": 2048,
"max_position_embeddings": 512,
"num_attention_heads": 8,
"num_hidden_layers": 6,
"type_vocab_size": 2,
"vocab_size": 30522
}
================================================
FILE: vlm/configs/bert_base.json
================================================
{
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 30522
}
================================================
FILE: vlm/data.py
================================================
import copy
import os
import random
import h5py
import torch
from torch.utils.data import DataLoader, Dataset
import tqdm
class CoLDataset(Dataset):
IGNORE_ID = -100
sent_strategy = 'first'
def __init__(self, file_path, tokenizer_name, tokenizer, block_size=512,
split_sent=False, voken_dir=None, suffix=None, verbose=False,
voken_ablation=None):
# Open token's hdf5
token_path = file_path + '.' + tokenizer_name + '.hdf5'
assert os.path.isfile(token_path)
if verbose:
print("-------- Load Data -------")
print("Load tokens from", token_path)
self.token_hdf5 = h5py.File(token_path, 'r')
self.tokenizer = tokenizer
self.tokens = self.token_hdf5['tokens']
self.verbose = verbose
self.voken_ablation = voken_ablation
self._iter_cnt = 0
# Open voken's hdf5 and load voken ids
if voken_dir is not None:
assert suffix is not None, 'Please provide suffix of the voken, e.g., vg_nococo.5000.'
self.sent_level = 'sent' in voken_dir
dset_fname = os.path.split(file_path)[-1]
voken_path = os.path.join(voken_dir, f"{dset_fname}.{suffix}.hdf5")
voken_ids_path = os.path.join(voken_dir, f"{dset_fname}.{suffix}.ids")
if verbose:
print("Load vokens from", voken_path)
self.voken_hdf5 = h5py.File(voken_path, 'r')
self.vokens = self.voken_hdf5['vokens']
assert len(self.vokens) == len(self.tokens)
self._voken_ids = list(
map(lambda x: x.strip(),
open(voken_ids_path).readlines())
)
if verbose:
print("\t with voken size", self.voken_size)
print("\t top 5 voken ids are:", self._voken_ids[:5])
else:
self.vokens = None
# Split for every block_size tokens
# The last block without full length will be dropped.
num_tokens = len(self.tokens)
self.starts = list(range(0, num_tokens, block_size))
self.batches = list(zip(self.starts[:-1], self.starts[1:]))
manual_filtered =False
if "en.train.raw" in file_path and tokenizer_name == "bert-base-uncased":
self.batches = manual_filter(self.batches)
if verbose:
print("Data: Mannually filter the range for counties.")
manual_filtered = True
# batch_info
if verbose:
print("Split sent with block size", block_size)
print(f"Total batches: {len(self.batches)}")
print(f"Total tokens: {len(self.tokens)}")
if voken_dir is not None:
print(f"Total vokens: {len(self.vokens)}")
if voken_ablation is not None:
print("The model will process voken ablation strategy:", voken_ablation)
print()
block_check(self.batches, block_size, fixed_size=True, manual_filtered=manual_filtered)
if self.voken_ablation == 'token':
self._voken_ids = list(range(30522))
@property
def voken_size(self):
return len(self._voken_ids)
@property
def voken_ids(self):
return copy.copy(self._voken_ids)
def assert_equal_vokens(self, dataset):
assert self.voken_size == dataset.voken_size
for vid, vid1 in zip(self.voken_ids, dataset.voken_ids):
assert vid == vid1
def __len__(self):
return len(self.batches) - 1
def __getitem__(self, item):
token_start, token_end = self.batches[item]
if self._iter_cnt < 5 and self.verbose:
print(f"Data Loader: data iteration {self._iter_cnt}, with range {token_start} to {token_end}.")
self._iter_cnt += 1
tokens = list(self.tokens[token_start: token_end])
token_tensor = torch.tensor(
self.tokenizer.build_inputs_with_special_tokens(tokens),
dtype=torch.long)
if self.vokens is not None:
vokens = list(self.vokens[token_start: token_end])
vokens = self.maybe_do_sent_level(vokens)
vokens = self.maybe_do_ablation_study(vokens, tokens)
voken_tensor = torch.tensor(
[self.IGNORE_ID] + vokens + [self.IGNORE_ID],
dtype=torch.long
)
return token_tensor, voken_tensor
else:
return token_tensor
def maybe_do_sent_level(self, vokens):
if not self.sent_level:
return vokens
else:
if self.sent_strategy == 'all':
vokens = [
(-voken-1 if voken < 0 else voken)
for voken in vokens
]
elif self.sent_strategy == 'first':
vokens = [
(self.IGNORE_ID if voken < 0 else voken)
for voken in vokens
]
return vokens
def maybe_do_ablation_study(self, vokens, tokens):
if self.voken_ablation is None:
return vokens
else:
if self._iter_cnt < 5 and self.verbose:
print("Before voken ablation: ", vokens)
if self.voken_ablation == 'random':
vokens = [random.randint(0, self.voken_size - 1)
for _ in range(len(vokens))]
elif self.voken_ablation == 'shuffle':
random.shuffle(vokens)
elif self.voken_ablation == 'reverse':
vokens = vokens[::-1]
elif self.voken_ablation == 'token':
vokens = tokens
if self._iter_cnt < 5 and self.verbose:
print("After voken ablation: ", vokens)
return vokens
def get_item_info(self, item):
token_start = self.batches[item]
token_end = self.batches[item + 1]
return token_start, token_end
def __del__(self):
self.token_hdf5.close()
if self.vokens is not None:
self.voken_hdf5.close()
FORBIDDEN_RANGE = (
119314944, # Start of iter 3700
187053048 # End of iter 5800
)
def intersect(x, y):
x1, x2 = x
y1, y2 = y
if x2 <= y1 or x2 >= y2:
# Case 1: [ x )[ y )
# Case 2: [ y )[ x )
return False
return True
def manual_filter(batches):
batches = list(filter(
lambda x: not intersect(x, FORBIDDEN_RANGE),
batches
))
return batches
def block_check(batches, block_size, fixed_size=False, manual_filtered=False):
"""
Check whether the batches satisfy following requirements.
1. Monotonic
2. Mutually exclusive
3. Range < block_size
"""
last_end = 0
for start_token, end_token in batches:
assert last_end <= start_token
if fixed_size:
assert (end_token - start_token) == block_size, 'len([%d, %d)) != %d' % (start_token, end_token, block_size)
else:
assert (end_token - start_token) <= block_size, 'len([%d, %d)) > %d' % (start_token, end_token, block_size)
if manual_filtered:
assert not intersect((start_token, end_token), FORBIDDEN_RANGE)
last_end = end_token
def get_voken_feats(dataset: CoLDataset, feat_dir: str):
"""
Load pre-extracted visual features regarding img_ids of vokens.
"""
set2id2feat = {}
voken_feats = []
for voken_id in dataset.voken_ids:
voken_img_set, voken_img_id = voken_id.split('/')
if voken_img_set not in set2id2feat:
img_ids = list(map(
lambda x: x.rstrip(),
open(os.path.join(feat_dir, f"{voken_img_set}.ids"))
))
img_feats = h5py.File(
os.path.join(feat_dir, f"{voken_img_set}.hdf5"), 'r'
)['keys'][:]
id2feat = {}
assert len(img_ids) == len(img_feats)
for img_id, img_feat in zip(img_ids, img_feats):
id2feat[img_id] = img_feat
set2id2feat[voken_img_set] = id2feat
voken_feats.append(set2id2feat[voken_img_set][voken_img_id])
return voken_feats
================================================
FILE: vlm/model.py
================================================
import math
import torch
import torch.nn.functional as F
from torch.nn import CrossEntropyLoss, MSELoss, SmoothL1Loss
from torch import nn
from transformers import (
BertConfig,
BertForMaskedLM,
)
from transformers.modeling_bert import BertOnlyMLMHead
BertLayerNorm = torch.nn.LayerNorm
# The GLUE function is copied from huggingface transformers:
# https://github.com/huggingface/transformers/blob/c6acd246ec90857b70f449dcbcb1543f150821fc/src/transformers/activations.py
def _gelu_python(x):
""" Original Implementation of the gelu activation function in Google Bert repo when initially created.
For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
Also see https://arxiv.org/abs/1606.08415
"""
return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
if torch.__version__ < "1.4.0":
gelu = _gelu_python
else:
gelu = F.gelu
class CoLBertConfig(BertConfig):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.voken_size = None
self.voken_dim = None
self.do_voken_cls = False
self.do_voken_reg = False
self.do_voken_ctr = False
self.shared_head = False
self.verbose = False
class BertSharedHead(BertOnlyMLMHead):
"""Bert Head for masked language modeling."""
def __init__(self, config):
super().__init__(config)
self.do_voken_cls = config.do_voken_cls
self.do_voken_ctr = config.do_voken_ctr
assert int(self.do_voken_cls) + int(self.do_voken_ctr) == 1
if self.do_voken_cls:
self.visn_decoder = nn.Linear(config.hidden_size, config.voken_size, bias=True)
if self.do_voken_ctr:
self.visn_decoder = nn.Linear(config.voken_dim, config.hidden_size, bias=True)
def forward(self, features, **kwargs):
"""
:param features: [batch, length, dim]
:return: lang_scores [batch, length, vocab_size],
visn_scores [batch, length, voken_size]
"""
x = self.predictions.transform(features) # batch_size, length, dim
lang_scores = self.predictions.decoder(x) + self.predictions.bias
if self.do_voken_cls:
visn_scores = self.visn_decoder(x)
elif self.do_voken_ctr:
voken_feats = kwargs['voken_feats']
y = self.visn_decoder(voken_feats) # voken_size, dim
visn_scores = torch.einsum('bik,jk->bij', x, y)
else:
assert False
return lang_scores, visn_scores
class BertVLMClassificationHead(nn.Module):
"""Bert Head for masked language modeling."""
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.decoder = nn.Linear(config.hidden_size, config.voken_size, bias=True)
# self.decoder = nn.Sequential(
# nn.Linear(config.hidden_size, 256, bias=True),
# nn.Linear(256, config.voken_size, bias=True),
# )
if config.verbose:
print(f"VLM Classification Head: Build model with voken_size {config.voken_size}")
def forward(self, features, **kwargs):
x = self.dense(features)
x = gelu(x)
x = self.layer_norm(x)
x = self.decoder(x)
return x
class BertVLMContrastiveHeadNew(nn.Module):
"""Bert Head for masked language modeling."""
def __init__(self, config):
super().__init__()
self.joint_dim = 512
print(f"Contrastive Head: Using joint dim {self.joint_dim}")
self.voken_size = config.voken_size
self.dense = nn.Linear(config.hidden_size, self.joint_dim)
self.layer_norm_x = BertLayerNorm(self.joint_dim, eps=config.layer_norm_eps)
self.decoder_voken_feat = nn.Linear(config.voken_dim, self.joint_dim, bias=False)
self.layer_norm_y = BertLayerNorm(self.joint_dim, eps=config.layer_norm_eps)
def forward(self, bert_output, voken_feats, **kwargs):
# Process the bert output
x = self.dense(bert_output)
x = gelu(x)
x = self.layer_norm_x(x)
# Process the pre-trained voken feats.
y = self.decoder_voken_feat(voken_feats) # [v, f] --> [v, 64]
y = self.layer_norm_y(y)
score = torch.einsum('ijf,vf->ijv', x, y) / math.sqrt(self.joint_dim)
assert score.dim() == 3 and score.shape[2] == self.voken_size
return score
class BertVLMContrastiveHead(nn.Module):
"""Bert Head for masked language modeling."""
def __init__(self, config):
super().__init__()
self.voken_size = config.voken_size
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.joint_dim = 64
self.decoder_bert_output = nn.Linear(config.hidden_size, self.joint_dim, bias=False)
self.decoder_voken_feat = nn.Linear(config.voken_dim, self.joint_dim, bias=False)
def forward(self, bert_output, voken_feats, **kwargs):
# Process the bert output
x = self.dense(bert_output)
x = gelu(x)
x = self.layer_norm(x)
x = self.decoder_bert_output(x) # [b, l, f] --> [b, l, 64]
# Process the pre-trained voken feats.
y = self.decoder_voken_feat(voken_feats) # [v, f] --> [v, 64]
score = torch.einsum('ijf,vf->ijv', x, y) / math.sqrt(self.joint_dim)
assert score.dim() == 3 and score.shape[2] == self.voken_size
return score
class BertVLMRegressionHead(nn.Module):
"""Bert Head for masked language modeling."""
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.decoder = nn.Linear(config.hidden_size, config.voken_dim, bias=True)
def forward(self, features, **kwargs):
x = self.dense(features)
x = gelu(x)
x = self.layer_norm(x)
# project back to size of vocabulary with bias
x = self.decoder(x)
return x
class CoLwithBert(BertForMaskedLM):
config_class = CoLBertConfig
def __init__(self, config):
super().__init__(config)
self.do_voken_cls = config.do_voken_cls
self.do_voken_reg = config.do_voken_reg
self.do_voken_ctr = config.do_voken_ctr
self.shared_head = config.shared_head
self.verbose = config.verbose
if self.verbose:
print(f"Model: do voken cls -- {self.do_voken_cls}, do_voken_reg -- {self.do_voken_reg},"
f" do voken ctr -- {self.do_voken_ctr}")
self.token_cls_loss_fct = CrossEntropyLoss()
if self.shared_head:
if self.verbose:
print("Model: Using shared head for Voken and Token predictions.")
self.cls = BertSharedHead(config)
# Reinit the weight of the new head.
self.init_weights()
else:
# Voken Classification
if config.do_voken_cls:
self.visual_cls_head = BertVLMClassificationHead(config)
# Voken Regression
if config.do_voken_reg:
assert config.voken_dim is not None, "you need to set voken dim in the config."
self.visual_reg_head = BertVLMRegressionHead(config)
# Voken Constrastive
if config.do_voken_ctr:
assert config.voken_dim is not None, "you need to set voken dim in the config."
self.visual_ctr_head = BertVLMContrastiveHeadNew(config)
# Build voken features embeddings if needed.
if self.do_voken_ctr or self.do_voken_reg:
# The voken emb will be preloaded by func "init_voken_feat_emb"
self.voken_feat_emb = nn.Embedding(
config.voken_size,
config.voken_dim
)
# Freeze this embedding
for p in self.voken_feat_emb.parameters():
p.requires_grad = False
# Build Loss functions
if config.do_voken_cls:
# Voken Classification
self.voken_cls_loss_fct = CrossEntropyLoss()
if config.do_voken_reg:
# Voken Regression
self.voken_reg_loss_fct = SmoothL1Loss(reduction='none')
# self.voken_reg_loss_fct = torch.nn.L1Loss(reduction='none')
if config.do_voken_ctr:
# Voken Constrastive
self.voken_ctr_loss_fct = CrossEntropyLoss()
def init_voken_feat_emb(self, feats):
if self.verbose:
print(f"Model: load the voken features with shape {feats.shape}")
print("\tBefore Loading, std and mean are: ", self.voken_feat_emb.weight.std(), self.voken_feat_emb.weight.mean())
assert feats.shape == (self.config.voken_size, self.config.voken_dim)
self.voken_feat_emb.weight.data[:] = torch.Tensor(feats)
self.original_voken_feats = torch.Tensor(feats).clone()
self.original_voken_feats = self.original_voken_feats.half()
if self.verbose:
print("\tAfter Loading, std and mean are: ", self.voken_feat_emb.weight.std(), self.voken_feat_emb.weight.mean())
print("\tThe 1st, 2nd, and last voken feats are: ")
print("\t", self.voken_feat_emb.weight[0])
print("\t", self.voken_feat_emb.weight[1])
print("\t", self.voken_feat_emb.weight[-1])
assert not self.voken_feat_emb.weight.requires_grad
# print(self.voken_feat_emb.weight.dtype)
# assert torch.all(torch.eq(self.voken_feat_emb.weight.cuda(),
# self.original_voken_feats)), "The voken feats have been updated during training."
def to(self, *args):
if self.do_voken_ctr or self.do_voken_reg:
self.original_voken_feats = self.original_voken_feats.to(*args)
return super().to(*args)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
masked_lm_labels=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
lm_labels=None,
voken_labels=None,
):
outputs = self.bert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_attention_mask,
)
sequence_output = outputs[0]
if not self.shared_head:
voken_loss = 0.
if self.do_voken_cls:
assert voken_labels is not None
voken_scores = self.visual_cls_head(sequence_output)
voken_cls_loss = self.voken_cls_loss_fct(voken_scores.view(-1, self.config.voken_size), voken_labels.view(-1))
voken_loss += voken_cls_loss
if self.do_voken_reg:
assert voken_labels is not None
voken_prediction = self.visual_reg_head(sequence_output)
# Get the mask and pre-trained features
voken_label_mask = (voken_labels != -100) # Get a mask of [0, 1, 1, ...., 1, 0], [b, len]
safe_voken_labels = voken_labels.clone()
safe_voken_labels[~voken_label_mask] = 0
voken_feats = self.voken_feat_emb(safe_voken_labels) # [b, len] --> [b, len, f]
# Loss
voken_reg_loss = self.voken_reg_loss_fct(voken_prediction, voken_feats) # [b, len, f]
# [b, l, f] * ([b,l] --> [b, l, 1]) = [b, l, f]
voken_reg_loss = (voken_reg_loss * voken_label_mask.float().unsqueeze(-1))
# [b, l, f] --sum-> [b, l] --mean-> [1,]
voken_reg_loss = voken_reg_loss.sum(-1).mean()
voken_loss += voken_reg_loss
if self.do_voken_ctr:
assert torch.all(torch.eq(self.voken_feat_emb.weight,
self.original_voken_feats)), "The voken feats have been updated during training."
voken_scores = self.visual_ctr_head(
sequence_output, self.voken_feat_emb.weight
)
voken_ctr_loss = self.voken_ctr_loss_fct(
voken_scores.view(-1, self.config.voken_size),
voken_labels.view(-1)
)
voken_loss += voken_ctr_loss
if masked_lm_labels is not None:
prediction_scores = self.cls(sequence_output)
token_loss = self.token_cls_loss_fct(
prediction_scores.view(-1, self.config.vocab_size),
masked_lm_labels.view(-1))
else:
token_loss = torch.tensor(0.)
else:
voken_loss, token_loss = self.calculate_shared_loss(
sequence_output,
masked_lm_labels,
voken_labels,
)
return voken_loss, token_loss
def calculate_shared_loss(self, sequence_output, masked_lm_labels, voken_labels):
if self.do_voken_cls:
lang_scores, visn_scores = self.cls(sequence_output)
else:
lang_scores, visn_scores = self.cls(
sequence_output,
voken_feats=self.voken_feat_emb.weight
)
assert voken_labels is not None
voken_loss_func = self.voken_cls_loss_fct if self.do_voken_cls else self.voken_ctr_loss_fct
voken_loss = voken_loss_func(
visn_scores.view(-1, self.config.voken_size),
voken_labels.view(-1)
)
if masked_lm_labels is not None:
token_loss = self.token_cls_loss_fct(
lang_scores.view(-1, self.config.vocab_size),
masked_lm_labels.view(-1)
)
else:
token_loss = torch.tensor(0.)
return voken_loss, token_loss
class SimpleBertForMaskedLM(BertForMaskedLM):
def __init__(self, config):
super().__init__(config)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
masked_lm_labels=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
lm_labels=None,
):
outputs = self.bert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_attention_mask,
)
sequence_output = outputs[0]
prediction_scores = self.cls(sequence_output)
loss_fct = CrossEntropyLoss()
token_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
return token_loss,
================================================
FILE: vlm/param.py
================================================
import argparse
def process_args():
parser = argparse.ArgumentParser()
# Datasets
parser.add_argument(
"--train_data_file", default=None, type=str,
help="The input training data file (a text file).")
parser.add_argument(
"--eval_data_file", default=None, type=str,
help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")
# Data loader
parser.add_argument("--col_data", action="store_true", help="Using the specific dataset object in data.py")
parser.add_argument("--split_sent", action="store_true", help="Overwrite the cached training and evaluation sets")
parser.add_argument("--shuffle", action="store_true", help="Shuffle the training dataset")
parser.add_argument(
"--block_size", default=-1, type=int,
help="Optional input sequence length after tokenization."
"The training dataset will be truncated in block of this size for training."
"Default to the model max input length for single sentence inputs (take into account special tokens).",
)
# Logging and Saving
parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
parser.add_argument(
"--output_dir", type=str,
help="The output directory where the model predictions and checkpoints will be written.",)
parser.add_argument(
"--overwrite_output_dir", action="store_true",
help="Overwrite the content of the output directory")
# Model types
parser.add_argument(
"--model_type", type=str, help="The model architecture to be trained or fine-tuned.",)
parser.add_argument(
"--should_continue", action="store_true", help="Whether to continue from latest checkpoint in output_dir")
parser.add_argument(
"--model_name_or_path", default=None, type=str,
help="The model checkpoint for weights initialization. Leave None if you want to train a model from scratch.",)
parser.add_argument(
"--config_name", default=None, type=str,
help="Optional pretrained config name or path if not the same as model_name_or_path. If both are None, initialize a new config.",)
parser.add_argument(
"--tokenizer_name", default=None, type=str,
help="Optional pretrained tokenizer name or path if not the same as model_name_or_path. If both are None, initialize a new tokenizer.",)
parser.add_argument(
"--cache_dir", default=None, type=str,
help="Optional directory to store the pre-trained models downloaded from s3 (instead of the default one)",)
parser.add_argument(
"--overwrite_cache", action="store_true",
help="Overwrite the cached training and evaluation sets")
# MLM tasks
parser.add_argument(
"--mlm", action="store_true", help="Train with masked-language modeling loss instead of language modeling.")
parser.add_argument(
"--mlm_probability", type=float, default=0.15, help="Ratio of tokens to mask for masked language modeling loss")
parser.add_argument(
"--mlm_ratio", type=float, default=1., help="The ratio of mlm loss in the total loss.")
# VLM related params
parser.add_argument("--voken_dir", type=str, default='snap1/coco_hinge05_dim64_resxt101_robertal4/vokens',
help='Where the vokens are saved')
parser.add_argument("--voken_suffix", type=str, default='vg_nococo.10000',
help='The suffix after the voken file, e.g., en.train.raw.{suffix} where suffix==vgcoco.1000')
parser.add_argument("--voken_labels", type=str, default='all',
help='all: Calculate voken loss for all tokens;'
'mask: Calculate voken loss for masked tokens.'
'nonmask: Calculate voken loss for non-masked tokens.')
parser.add_argument("--voken_feat_dir", type=str, default=None,
help='Where the vokens are saved')
parser.add_argument("--do_voken_cls", action='store_true', help='Will do voken classification task')
parser.add_argument("--do_voken_reg", action='store_true', help='Will do voken regression task (not used in this paper)')
parser.add_argument("--do_voken_ctr", action='store_true', help='Will do voken contrastive task (not used in this paper)')
parser.add_argument("--shared_head", action='store_true', help='Share the head if more than one tasks (e.g., cls, reg, ctr) are used (not used in this paper)')
# Batch Size and Training Steps
parser.add_argument("--seed", type=int, default=95, help="random seed for initialization")
parser.add_argument("--per_gpu_train_batch_size", default=4, type=int, help="Batch size per GPU/CPU for training.")
parser.add_argument("--per_gpu_eval_batch_size", default=4, type=int, help="Batch size per GPU/CPU for evaluation.")
parser.add_argument("--gradient_accumulation_steps", type=int, default=1,
help="Number of updates steps to accumulate before performing a backward/update pass.",)
parser.add_argument("--num_train_epochs", default=1.0, type=float, help="Total number of training epochs to perform.")
parser.add_argument("--max_steps", default=-1, type=int,
help="If > 0: set total number of training steps to perform. Override num_train_epochs.",)
# Optimizer
parser.add_argument("--lamb", action="store_true", help='Use the LAMB optimizer in apex')
parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
parser.add_argument("--warmup_ratio", default=0., type=float, help="Linear warmup over warmup_steps.")
parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
# Distributed Training
parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
parser.add_argument("--nodes", type=int, default=1)
parser.add_argument("--nr", type=int, default=0)
# Half Precision
parser.add_argument(
"--fp16", action="store_true",
help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",)
parser.add_argument(
"--fp16_opt_level", type=str, default="O1",
help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
"See details at https://nvidia.github.io/apex/amp.html",)
# Ablation Study
parser.add_argument("--voken_ablation", default=None,
help="random, shuffle, reverse, token")
args = parser.parse_args()
return args
================================================
FILE: vlm/run_glue.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Finetuning the library models for sequence classification on GLUE (Bert, XLM, XLNet, RoBERTa, Albert, XLM-RoBERTa)."""
import argparse
import glob
import json
import logging
import os
import random
import numpy as np
import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
from transformers import (
MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,
WEIGHTS_NAME,
AdamW,
AutoConfig,
AutoModelForSequenceClassification,
AutoTokenizer,
get_linear_schedule_with_warmup,
glue_compute_metrics as compute_metrics,
glue_convert_examples_to_features as convert_examples_to_features,
glue_output_modes as output_modes,
glue_processors as processors,
)
# from transformers import glue_compute_metrics as compute_metrics
# from transformers import glue_convert_examples_to_features as convert_examples_to_features
# from transformers import glue_output_modes as output_modes
# from transformers import glue_processors as processors
try:
from torch.utils.tensorboard import SummaryWriter
except ImportError:
from tensorboardX import SummaryWriter
logger = logging.getLogger(__name__)
#MODEL_CONFIG_CLASSES = list(MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys())
#MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
#ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in MODEL_CONFIG_CLASSES), (),)
def set_seed(args):
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
if args.n_gpu > 0:
torch.cuda.manual_seed_all(args.seed)
def train(args, train_dataset, model, tokenizer):
""" Train the model """
# if args.local_rank in [-1, 0]:
# tb_writer = SummaryWriter()
args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
if args.max_steps > 0:
t_total = args.max_steps
args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
else:
t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
{
"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
"weight_decay": args.weight_decay,
},
{"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
]
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
num_warmup_steps = int(t_total * args.warmup_steps)
scheduler = get_linear_schedule_with_warmup(
optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=t_total
)
# Check if saved optimizer or scheduler states exist
#if os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt")) and os.path.isfile(
#os.path.join(args.model_name_or_path, "scheduler.pt")
#):
## Load in optimizer and scheduler states
#optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
#scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))
if args.fp16:
try:
from apex import amp
except ImportError:
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
# multi-gpu training (should be after apex fp16 initialization)
if args.n_gpu > 1:
model = torch.nn.DataParallel(model)
# Distributed training (should be after apex fp16 initialization)
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(
model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True,
)
# Train!
logger.info("***** Running training *****")
logger.info(" Num examples = %d", len(train_dataset))
logger.info(" Num Epochs = %d", args.num_train_epochs)
logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
logger.info(
" Total train batch size (w. parallel, distributed & accumulation) = %d",
args.train_batch_size
* args.gradient_accumulation_steps
* (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
)
logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
logger.info(" Total optimization steps = %d", t_total)
global_step = 0
epochs_trained = 0
steps_trained_in_current_epoch = 0
# Check if continuing training from a checkpoint
#if os.path.exists(args.model_name_or_path):
# set global_step to global_step of last saved checkpoint from model path
#try:
#global_step = int(args.model_name_or_path.split("-")[-1].split("/")[0])
#except ValueError:
#global_step = 0
#epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
#steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)
#logger.info(" Continuing training from checkpoint, will skip to saved global_step")
#logger.info(" Continuing training from epoch %d", epochs_trained)
#logger.info(" Continuing training from global step %d", global_step)
#logger.info(" Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)
tr_loss, logging_loss = 0.0, 0.0
model.zero_grad()
train_iterator = trange(
epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0],
)
set_seed(args) # Added here for reproductibility
for _ in train_iterator:
epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
for step, batch in enumerate(epoch_iterator):
# Skip past any already trained steps if resuming training
if steps_trained_in_current_epoch > 0:
steps_trained_in_current_epoch -= 1
continue
model.train()
batch = tuple(t.to(args.device) for t in batch)
inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
if args.model_type != "distilbert":
inputs["token_type_ids"] = (
batch[2] if args.model_type in ["bert", "xlnet", "albert"] else None
) # XLM, DistilBERT, RoBERTa, and XLM-RoBERTa don't use segment_ids
outputs = model(**inputs)
loss = outputs[0] # model outputs are always tuple in transformers (see doc)
if args.n_gpu > 1:
loss = loss.mean() # mean() to average on multi-gpu parallel training
if args.gradient_accumulation_steps > 1:
loss = loss / args.gradient_accumulation_steps
if args.fp16:
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
else:
loss.backward()
tr_loss += loss.item()
if (step + 1) % args.gradient_accumulation_steps == 0 or (
# last step in epoch but step is always smaller than gradient_accumulation_steps
len(epoch_iterator) <= args.gradient_accumulation_steps
and (step + 1) == len(epoch_iterator)
):
if args.fp16:
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
else:
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
optimizer.step()
scheduler.step() # Update learning rate schedule
model.zero_grad()
global_step += 1
if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
logs = {}
if (
args.local_rank == -1 and args.evaluate_during_training
): # Only evaluate when single GPU otherwise metrics may not average well
results = evaluate(args, model, tokenizer)
for key, value in results.items():
eval_key = "eval_{}".format(key)
logs[eval_key] = value
loss_scalar = (tr_loss - logging_loss) / args.logging_steps
learning_rate_scalar = scheduler.get_lr()[0]
logs["learning_rate"] = learning_rate_scalar
logs["loss"] = loss_scalar
logging_loss = tr_loss
#for key, value in logs.items():
#tb_writer.add_scalar(key, value, global_step)
print(json.dumps({**logs, **{"step": global_step}}))
if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
# Save model checkpoint
output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
if not os.path.exists(output_dir):
os.makedirs(output_dir)
model_to_save = (
model.module if hasattr(model, "module") else model
) # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
torch.save(args, os.path.join(output_dir, "training_args.bin"))
logger.info("Saving model checkpoint to %s", output_dir)
torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
logger.info("Saving optimizer and scheduler states to %s", output_dir)
if args.max_steps > 0 and global_step > args.max_steps:
epoch_iterator.close()
break
if args.max_steps > 0 and global_step > args.max_steps:
train_iterator.close()
break
#if args.local_rank in [-1, 0]:
#tb_writer.close()
return global_step, tr_loss / global_step
def evaluate(args, model, tokenizer, prefix=""):
# Loop to handle MNLI double evaluation (matched, mis-matched)
eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
eval_outputs_dirs = (args.output_dir, args.output_dir + "-MM") if args.task_name == "mnli" else (args.output_dir,)
results = {}
for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)
if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
os.makedirs(eval_output_dir)
args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
# Note that DistributedSampler samples randomly
eval_sampler = SequentialSampler(eval_dataset)
eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
# multi-gpu eval
if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
model = torch.nn.DataParallel(model)
# Eval!
logger.info("***** Running evaluation {} *****".format(prefix))
logger.info(" Num examples = %d", len(eval_dataset))
logger.info(" Batch size = %d", args.eval_batch_size)
eval_loss = 0.0
nb_eval_steps = 0
preds = None
out_label_ids = None
for batch in tqdm(eval_dataloader, desc="Evaluating"):
model.eval()
batch = tuple(t.to(args.device) for t in batch)
with torch.no_grad():
inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
if args.model_type != "distilbert":
inputs["token_type_ids"] = (
batch[2] if args.model_type in ["bert", "xlnet", "albert"] else None
) # XLM, DistilBERT, RoBERTa, and XLM-RoBERTa don't use segment_ids
outputs = model(**inputs)
tmp_eval_loss, logits = outputs[:2]
eval_loss += tmp_eval_loss.mean().item()
nb_eval_steps += 1
if preds is None:
preds = logits.detach().cpu().numpy()
out_label_ids = inputs["labels"].detach().cpu().numpy()
else:
preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
out_label_ids = np.append(out_label_ids, inputs["labels"].detach().cpu().numpy(), axis=0)
eval_loss = eval_loss / nb_eval_steps
if args.output_mode == "classification":
preds = np.argmax(preds, axis=1)
elif args.output_mode == "regression":
preds = np.squeeze(preds)
result = compute_metrics(eval_task, preds, out_label_ids)
results.update(result)
print(eval_output_dir, prefix)
output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results {} *****".format(prefix))
for key in sorted(result.keys()):
logger.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
return results
def load_and_cache_examples(args, task, tokenizer, evaluate=False):
if args.local_rank not in [-1, 0] and not evaluate:
torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache
processor = processors[task]()
output_mode = output_modes[task]
# Load data features from cache or dataset file
cached_features_file = os.path.join(
args.data_dir,
"cached_{}_{}_{}_{}".format(
"dev" if evaluate else "train",
#list(filter(None, args.model_name_or_path.split("/"))).pop(),
args.tokenizer_name,
str(args.max_seq_length),
str(task),
),
)
if os.path.exists(cached_features_file) and not args.overwrite_cache:
logger.info("Loading features from cached file %s", cached_features_file)
features = torch.load(cached_features_file)
else:
logger.info("Creating features from dataset file at %s", args.data_dir)
label_list = processor.get_labels()
if task in ["mnli", "mnli-mm"] and args.model_type in ["roberta", "xlmroberta"]:
# HACK(label indices are swapped in RoBERTa pretrained model)
label_list[1], label_list[2] = label_list[2], label_list[1]
examples = (
processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
)
features = convert_examples_to_features(
examples,
tokenizer,
label_list=label_list,
max_length=args.max_seq_length,
output_mode=output_mode,
# pad_on_left=bool(args.model_type in ["xlnet"]), # pad on the left for xlnet
# pad_token=tokenizer.pad_token_id,
# pad_token_segment_id=tokenizer.pad_token_type_id,
)
if args.local_rank in [-1, 0]:
logger.info("Saving features into cached file %s", cached_features_file)
torch.save(features, cached_features_file)
for i in range(3):
print('ids:', features[i].input_ids)
print('tokens:', tokenizer.convert_ids_to_tokens(features[i].input_ids))
print('att:', features[i].attention_mask)
if args.local_rank == 0 and not evaluate:
torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache
# Convert to Tensors and build dataset
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
if output_mode == "classification":
all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
elif output_mode == "regression":
all_labels = torch.tensor([f.label for f in features], dtype=torch.float)
dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
return dataset
def main():
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--data_dir",
default=None,
type=str,
required=True,
help="The input data dir. Should contain the .tsv files (or other data files) for the task.",
)
parser.add_argument(
"--model_type",
default=None,
type=str,
required=True,
#help="Model type selected in the list: " + ", ".join(MODEL_TYPES),
)
parser.add_argument(
"--model_name_or_path",
default=None,
type=str,
required=True,
#help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
)
parser.add_argument(
"--task_name",
default=None,
type=str,
required=True,
help="The name of the task to train selected in the list: " + ", ".join(processors.keys()),
)
parser.add_argument(
"--output_dir",
default=None,
type=str,
required=True,
help="The output directory where the model predictions and checkpoints will be written.",
)
# Other parameters
parser.add_argument(
"--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name",
)
parser.add_argument(
"--tokenizer_name",
default="",
type=str,
help="Pretrained tokenizer name or path if not the same as model_name",
)
parser.add_argument(
"--cache_dir",
default="",
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
)
parser.add_argument(
"--max_seq_length",
default=128,
type=int,
help="The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded.",
)
parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")
parser.add_argument(
"--evaluate_during_training", action="store_true", help="Run evaluation during training at each logging step.",
)
parser.add_argument(
"--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model.",
)
parser.add_argument(
"--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.",
)
parser.add_argument(
"--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for evaluation.",
)
parser.add_argument(
"--gradient_accumulation_steps",
type=int,
default=1,
help="Number of updates steps to accumulate before performing a backward/update pass.",
)
parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
parser.add_argument(
"--num_train_epochs", default=3.0, type=float, help="Total number of training epochs to perform.",
)
parser.add_argument(
"--max_steps",
default=-1,
type=int,
help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
)
parser.add_argument("--warmup_steps", default=0, type=float, help="Linear warmup over warmup_steps.")
parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
parser.add_argument(
"--eval_all_checkpoints",
action="store_true",
help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number",
)
parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available")
parser.add_argument("--from_scratch", action="store_true", help="Avoid using CUDA when available")
parser.add_argument(
"--overwrite_output_dir", action="store_true", help="Overwrite the content of the output directory",
)
parser.add_argument(
"--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets",
)
parser.add_argument(
"--nopooler", action="store_true", help="Do not load the pooler",
)
parser.add_argument("--seed", type=int, default=9595, help="random seed for initialization")
parser.add_argument(
"--fp16",
action="store_true",
help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",
)
parser.add_argument(
"--fp16_opt_level",
type=str,
default="O1",
help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
"See details at https://nvidia.github.io/apex/amp.html",
)
parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
parser.add_argument("--server_ip", type=str, default="", help="For distant debugging.")
parser.add_argument("--server_port", type=str, default="", help="For distant debugging.")
args = parser.parse_args()
if (
os.path.exists(args.output_dir)
and os.listdir(args.output_dir)
and args.do_train
and not args.overwrite_output_dir
):
raise ValueError(
"Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
args.output_dir
)
)
# Setup distant debugging if needed
if args.server_ip and args.server_port:
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
import ptvsd
print("Waiting for debugger attach")
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
ptvsd.wait_for_attach()
# Setup CUDA, GPU & distributed training
if args.local_rank == -1 or args.no_cuda:
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count()
else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
torch.cuda.set_device(args.local_rank)
device = torch.device("cuda", args.local_rank)
torch.distributed.init_process_group(backend="nccl")
args.n_gpu = 1
args.device = device
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
)
logger.warning(
"Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
args.local_rank,
device,
args.n_gpu,
bool(args.local_rank != -1),
args.fp16,
)
# Set seed
set_seed(args)
# Prepare GLUE task
args.task_name = args.task_name.lower()
if args.task_name not in processors:
raise ValueError("Task not found: %s" % (args.task_name))
processor = processors[args.task_name]()
args.output_mode = output_modes[args.task_name]
label_list = processor.get_labels()
num_labels = len(label_list)
# Load pretrained model and tokenizer
if args.local_rank not in [-1, 0]:
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
args.model_type = args.model_type.lower()
config = AutoConfig.from_pretrained(
args.config_name if args.config_name else args.model_name_or_path,
num_labels=num_labels,
finetuning_task=args.task_name,
cache_dir=args.cache_dir if args.cache_dir else None,
)
tokenizer = AutoTokenizer.from_pretrained(
args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
do_lower_case=args.do_lower_case,
cache_dir=args.cache_dir if args.cache_dir else None,
)
model = AutoModelForSequenceClassification.from_pretrained(
args.model_name_or_path,
from_tf=bool(".ckpt" in args.model_name_or_path),
config=config,
cache_dir=args.cache_dir if args.cache_dir else None,
)
if args.nopooler:
model.bert.pooler.apply(model._init_weights)
if args.local_rank == 0:
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
model.to(args.device)
logger.info("Training/evaluation parameters %s", args)
# Training
if args.do_train:
train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
# Create output directory if needed
if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
os.makedirs(args.output_dir)
logger.info("Saving model checkpoint to %s", args.output_dir)
# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = (
model.module if hasattr(model, "module") else model
) # Take care of distributed/parallel training
model_to_save.save_pretrained(args.output_dir)
tokenizer.save_pretrained(args.output_dir)
# Good practice: save your training arguments together with the trained model
torch.save(args, os.path.join(args.output_dir, "training_args.bin"))
# Load a trained model and vocabulary that you have fine-tuned
model = AutoModelForSequenceClassification.from_pretrained(args.output_dir)
tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
model.to(args.device)
# Evaluation
results = {}
if args.do_eval and args.local_rank in [-1, 0]:
tokenizer = AutoTokenizer.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
checkpoints = [args.output_dir]
if args.eval_all_checkpoints:
checkpoints = list(
os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
)
logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging
logger.info("Evaluate the following checkpoints: %s", checkpoints)
for checkpoint in checkpoints:
global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""
prefix = prefix if 'checkpoint' in prefix else ''
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
model.to(args.device)
result = evaluate(args, model, tokenizer, prefix=prefix)
result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
results.update(result)
return results
if __name__ == "__main__":
main()
================================================
FILE: vlm/run_glue_epochs.py
================================================
import argparse
import math
import os
from pathlib import Path
from pprint import pprint
import subprocess
import threading
import time
import torch
parser = argparse.ArgumentParser()
parser.add_argument(
"--load", default=None, type=str,
help="The model loaded, e.g., snap/vlm/wiki103_small"
)
parser.add_argument(
"--gpus", default=None, type=str,
help="The list of GPU ids, separated by comma, e.g., '2,3'"
)
parser.add_argument(
"--snaps", default=1, type=int,
help="The number of snaps evaluated with GLUE benchmark. "
"-1 means all."
)
parser.add_argument(
"--start-from", default=0, type=int
)
args = parser.parse_args()
if args.gpus is None:
# Get all gpus available in this server.
num_gpus = torch.cuda.device_count()
# The device id are labeled from 0 to num_gpus-1.
available_gpus = list(range(num_gpus))
else:
available_gpus = [int(gpu_id) for gpu_id in args.gpus.split(",")]
num_gpus = len(available_gpus)
resource = threading.Semaphore(num_gpus)
def get_snap_paths(load):
load_path = Path(load)
paths = []
for dir_path in load_path.iterdir():
if dir_path.name.startswith("checkpoint-"):
paths.append(dir_path)
return paths
def sorted_paths(paths):
pathXkey = []
for path in paths:
name = path.name
identifier = name[len("checkpoint-"):]
if identifier == 'last':
continue
if 'epoch' in identifier:
key = identifier
else:
key = int(identifier)
pathXkey.append((path, key))
pathXkey = sorted(pathXkey, key=lambda x: x[1])
paths = list(map(lambda x: x[0], pathXkey))
return paths
def get_test_paths(paths, snaps):
"""
Return $snaps paths to be tested on GLUE
"""
if snaps == -1:
return paths
interval = len(paths) * 1. / snaps
test_paths = []
for i in range(1, snaps+1):
idx = int(math.ceil(interval * i)) - 1
test_paths.append(paths[idx])
return test_paths
# Get all paths needs to be processed
paths = get_snap_paths(args.load)
paths = sorted_paths(paths)
paths = paths[args.start_from:]
paths = get_test_paths(paths, args.snaps)
paths = paths[::-1] # Run the last epochs first.
path_lock = threading.Lock()
def run_glue():
while True:
# Only have one atomic operation (list.pop) here, do not need lock.
# A Semaphore is enough to control the resources.
resource.acquire()
gpu_id = available_gpus.pop(0)
# Involve multiple atomic operations (list.__len__, list.pop),
# thus introduce a lock here.
path_lock.acquire()
if len(paths) > 0:
path = paths.pop(0)
else:
path_lock.release()
break
path_lock.release()
model = path.parent
ckpt = path.name
print(gpu_id, model, ckpt)
process = subprocess.Popen(
['bash',
'scripts/run_glue_at_epoch.bash',
str(gpu_id), # Use GPU
'3', # Number of epochs
model,
ckpt
],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
available_gpus.append(gpu_id)
resource.release()
# Sleep here allows the script (run_glue_at_epoch.bash) to finish
# thus all memory in GPU will be cleared.
time.sleep(5)
return
# Allocate #threads which equals to #GPUs
threads = []
for _ in range(num_gpus):
threads.append(
threading.Thread(target=run_glue)
)
for thread in threads:
thread.start()
# Join to the main thread, thus the main thread will wait for all the threads.
for thread in threads:
thread.join()
================================================
FILE: vlm/run_lm_distributed.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
using a masked language modeling (MLM) loss.
"""
import argparse
import glob
import json
import logging
import os
import pickle
import random
import re
import shutil
import sys
from typing import Dict, List, Tuple
from datetime import datetime
import numpy as np
import torch
from torch.nn.utils.rnn import pad_sequence
import torch.multiprocessing as mp
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
from transformers import (
WEIGHTS_NAME,
AdamW,
BertConfig,
BertForMaskedLM,
BertTokenizer,
CamembertConfig,
CamembertForMaskedLM,
CamembertTokenizer,
DistilBertConfig,
DistilBertForMaskedLM,
DistilBertTokenizer,
GPT2Config,
GPT2LMHeadModel,
GPT2Tokenizer,
OpenAIGPTConfig,
OpenAIGPTLMHeadModel,
OpenAIGPTTokenizer,
PreTrainedModel,
PreTrainedTokenizer,
RobertaConfig,
RobertaForMaskedLM,
RobertaTokenizer,
get_linear_schedule_with_warmup,
)
sys.path.append(
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
)
from vlm.data import CoLDataset
from vlm.param import process_args
from vlm.model import SimpleBertForMaskedLM
try:
from torch.utils.tensorboard import SummaryWriter
except ImportError:
from tensorboardX import SummaryWriter
logger = logging.getLogger(__name__)
MODEL_CLASSES = {
"gpt2": (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
"openai-gpt": (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
"bert": (BertConfig, SimpleBertForMaskedLM, BertTokenizer),
"roberta": (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer),
"distilbert": (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer),
"camembert": (CamembertConfig, CamembertForMaskedLM, CamembertTokenizer),
}
class TextDataset(Dataset):
def __init__(self, tokenizer: PreTrainedTokenizer, args, file_path: str, block_size=512):
assert os.path.isfile(file_path)
block_size = block_size - (tokenizer.max_len - tokenizer.max_len_single_sentence)
directory, filename = os.path.split(file_path)
cached_features_file = os.path.join(
directory, args.model_type + "_cached_lm_" + str(block_size) + "_" + filename
)
if os.path.exists(cached_features_file) and not args.overwrite_cache:
logger.info("Loading features from cached file %s", cached_features_file)
with open(cached_features_file, "rb") as handle:
self.examples = pickle.load(handle)
else:
logger.info("Creating features from dataset file at %s", directory)
self.examples = []
with open(file_path, encoding="utf-8") as f:
text = f.read()
tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
for i in range(0, len(tokenized_text) - block_size + 1, block_size): # Truncate in block of block_size
self.examples.append(tokenizer.build_inputs_with_special_tokens(tokenized_text[i : i + block_size]))
# Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
# If your dataset is small, first you should loook for a bigger one :-) and second you
# can change this behavior by adding (model specific) padding.
logger.info("Saving features into cached file %s", cached_features_file)
with open(cached_features_file, "wb") as handle:
pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
def __len__(self):
return len(self.examples)
def __getitem__(self, item):
return torch.tensor(self.examples[item], dtype=torch.long)
class LineByLineTextDataset(Dataset):
def __init__(self, tokenizer: PreTrainedTokenizer, args, file_path: str, block_size=512):
assert os.path.isfile(file_path)
# Here, we do not cache the features, operating under the assumption
# that we will soon use fast multithreaded tokenizers from the
# `tokenizers` repo everywhere =)
logger.info("Creating features from dataset file at %s", file_path)
with open(file_path, encoding="utf-8") as f:
lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
self.examples = tokenizer.batch_encode_plus(lines, add_special_tokens=True, max_length=block_size)["input_ids"]
def __len__(self):
return len(self.examples)
def __getitem__(self, i):
return torch.tensor(self.examples[i], dtype=torch.long)
def load_and_cache_examples(args, tokenizer, evaluate=False):
file_path = args.eval_data_file if evaluate else args.train_data_file
if args.col_data:
return CoLDataset(file_path, args.tokenizer_name, tokenizer, args.block_size,
split_sent=args.split_sent,
verbose=(args.gpu == 0))
elif args.line_by_line:
return LineByLineTextDataset(tokenizer, args, file_path=file_path, block_size=args.block_size)
else:
return TextDataset(tokenizer, args, file_path=file_path, block_size=args.block_size)
def set_seed(args):
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
def mask_tokens(inputs: torch.Tensor, tokenizer: PreTrainedTokenizer, args) -> Tuple[torch.Tensor, torch.Tensor]:
""" Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
if tokenizer.mask_token is None:
raise ValueError(
"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer."
)
labels = inputs.clone()
# We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
probability_matrix = torch.full(labels.shape, args.mlm_probability)
special_tokens_mask = [
tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
]
probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
if tokenizer._pad_token is not None:
padding_mask = labels.eq(tokenizer.pad_token_id)
probability_matrix.masked_fill_(padding_mask, value=0.0)
masked_indices = torch.bernoulli(probability_matrix).bool()
labels[~masked_indices] = -100 # We only compute loss on masked tokens
# 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
# 10% of the time, we replace masked input tokens with random word
indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
inputs[indices_random] = random_words[indices_random]
# The rest of the time (10% of the time) we keep the masked input tokens unchanged
return inputs, labels
def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
set_seed(args) # Added here for reproducibility
""" Train the model """
if args.gpu == 0:
current_time = datetime.now().strftime('%b%d_%H-%M-%S')
tb_writer = SummaryWriter(args.output_dir + '/runs/' + current_time)
args.train_batch_size = args.per_gpu_train_batch_size
def collate(examples: List[torch.Tensor]):
if tokenizer._pad_token is None:
return pad_sequence(examples, batch_first=True)
return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)
if args.shuffle:
logger.info(f"Shuffle the dataset in training,"
f"GPU: {args.gpu},"
f"Rank: {args.rank},"
f"Total: {args.world_size}")
train_sampler = DistributedSampler(
train_dataset,
num_replicas=args.world_size,
rank=args.rank,
shuffle=args.shuffle,
)
train_dataloader = DataLoader(
train_dataset, sampler=train_sampler, shuffle=False, num_workers=0,
batch_size=args.train_batch_size, collate_fn=collate, pin_memory=True
)
if args.max_steps > 0:
t_total = args.max_steps
args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
else:
t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
{
"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
"weight_decay": args.weight_decay,
},
{"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
]
optimizer = AdamW(optimizer_grouped_parameters,
# betas=(0.9, 0.98),
lr=args.learning_rate,
eps=args.adam_epsilon)
if args.warmup_ratio > 0.:
assert args.warmup_steps == 0
args.warmup_steps = int(t_total * args.warmup_ratio)
if args.gpu == 0:
print("Optimized with lr %f, steps %d, warmup steps %d, and use beta, epsilon %0.8f." % (
args.learning_rate, t_total, args.warmup_steps, optimizer.defaults['eps']
), optimizer.defaults['betas'])
scheduler = get_linear_schedule_with_warmup(
optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
)
# Check if saved optimizer or scheduler states exist
if (
args.model_name_or_path
and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))
and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))
):
# Load in optimizer and scheduler states
optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))
if args.fp16:
try:
from apex import amp
except ImportError:
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level,
verbosity=0)
from apex.parallel import DistributedDataParallel as DDP
model = DDP(model)
else:
model = torch.nn.parallel.DistributedDataParallel(
model, device_ids=[args.gpu], find_unused_parameters=True
)
# Train!
logger.info("***** Running training *****")
logger.info(" Num examples = %d", len(train_dataset))
logger.info(" Num Epochs = %d", args.num_train_epochs)
logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
logger.info(
" Total train batch size (w. distributed & accumulation) = %d",
args.train_batch_size
* args.gradient_accumulation_steps
* args.world_size
)
logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
logger.info(" Total optimization steps = %d", t_total)
global_step = 0
epochs_trained = 0
# Check if continuing training from a checkpoint
# if args.model_name_or_path and os.path.exists(args.model_name_or_path):
# try:
# # set global_step to gobal_step of last saved checkpoint from model path
# checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
# epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
# steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)
# logger.info(" Continuing training from checkpoint, will skip to saved global_step")
# logger.info(" Continuing training from epoch %d", epochs_trained)
# except ValueError:
# logger.info(" Do not load model from %s, restart training" % args.model_name_or_path)
# model_to_resize = model.module if hasattr(model, "module") else model # Take care of distributed/parallel training
# model_to_resize.resize_token_embeddings(len(tokenizer))
model.zero_grad()
train_iterator = trange(
epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.gpu != 0
)
for epoch in train_iterator:
epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.gpu != 0)
tr_loss, logging_loss = 0.0, 0.0
model.zero_grad() # Support of accumulating gradients
for step, batch in enumerate(epoch_iterator):
inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch)
inputs = inputs.to(args.device)
labels = labels.to(args.device)
# If some of the input is padded, then the attention mask is needed
attention_mask = (inputs != tokenizer.pad_token_id) # word_tokens --> 1, pad_token --> 0
if attention_mask.all():
attention_mask = None
if epoch == 0 and step < 3 and args.gpu == 0:
print(inputs.shape)
print(inputs[0])
print(tokenizer.convert_ids_to_tokens(inputs[0].cpu().numpy()))
print(labels[0])
print(attention_mask)
model.train()
outputs = model(inputs,
attention_mask=attention_mask,
masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
loss = outputs[0] # model outputs are always tuple in transformers (see doc)
if args.gradient_accumulation_steps > 1:
loss = loss / args.gradient_accumulation_steps
if args.fp16:
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
else:
loss.backward()
tr_loss += loss.item()
if (step + 1) % args.gradient_accumulation_steps == 0:
if args.max_grad_norm > 0.:
if args.fp16:
total_norm = torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
else:
total_norm =torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
optimizer.step()
scheduler.step() # Update learning rate schedule
model.zero_grad()
global_step += 1
if args.gpu == 0 and args.logging_steps > 0 and (step + 1) % args.logging_steps == 0:
# Log metrics
tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
if args.fp16:
try:
from apex.amp import _amp_state
tb_writer.add_scalar("loss_scale", _amp_state.loss_scalers[0]._loss_scale, global_step)
tb_writer.add_scalar("scaled_loss", scaled_loss.item(), global_step)
except ImportError:
logger.warning("Cannot import apex.amp._amp_state, "
"would not state the loss_scale in the log")
if args.max_grad_norm > 0.: # Only clip the grad when it is valid
tb_writer.add_scalar("grad_norm", total_norm, global_step)
tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
logging_loss = tr_loss
if args.max_steps > 0 and global_step >= args.max_steps:
break
# Save it each epoch
if args.gpu == 0:
# Save checkpoints
checkpoint_name = "checkpoint-epoch%04d" % epoch
save_model(args, checkpoint_name, model, tokenizer, optimizer, scheduler)
last_path = os.path.join(args.output_dir, 'checkpoint-last')
# if os.path.exists(last_path):
# print(last_path)
# os.remove(last_path)
# os.symlink(os.path.join(args.output_dir, checkpoint_name), last_path)
# Evaluate the model
logger.info(" Training loss of Epoch %d: %0.4f" % (epoch, tr_loss / step))
logger.info(" Evaluation Results of Epoch %d: " % epoch)
results = evaluate(args, model, tokenizer)
for key, value in results.items():
tb_writer.add_scalar("eval_{}".format(key), value, global_step)
logger.info("\t %s: %0.4f" % (key, value))
output_eval_file = os.path.join(args.output_dir, checkpoint_name, "eval_results.json")
json.dump(results, open(output_eval_file, 'w'), sort_keys=True, indent=4)
if args.max_steps > 0 and global_step >= args.max_steps:
epoch_iterator.close()
train_iterator.close()
break
if args.gpu == 0:
tb_writer.close()
def save_model(args, name, model, tokenizer, optimizer, scheduler):
# Save model checkpoint
output_dir = os.path.join(args.output_dir, name)
os.makedirs(output_dir, exist_ok=True)
model_to_save = (
model.module if hasattr(model, "module") else model
) # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
torch.save(args, os.path.join(output_dir, "training_args.bin"))
logger.info("Saving model checkpoint to %s", output_dir)
# torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
# torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
# logger.info("Saving optimizer and scheduler states to %s", output_dir)
def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, prefix="") -> Dict:
# Loop to handle MNLI double evaluation (matched, mis-matched)
eval_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)
args.eval_batch_size = args.per_gpu_eval_batch_size
# Note that DistributedSampler samples randomly
def collate(examples: List[torch.Tensor]):
if tokenizer._pad_token is None:
return pad_sequence(examples, batch_first=True)
return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)
eval_sampler = SequentialSampler(eval_dataset)
eval_dataloader = DataLoader(
eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate
)
# Eval!
logger.info("***** Running evaluation {} *****".format(prefix))
logger.info(" Num examples = %d", len(eval_dataset))
logger.info(" Batch size = %d", args.eval_batch_size)
eval_loss = 0.0
nb_eval_steps = 0
model.eval()
for batch in tqdm(eval_dataloader, desc="Evaluating"):
inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch)
inputs = inputs.to(args.device)
labels = labels.to(args.device)
# If some of the input is padded, then the attention mask is needed
attention_mask = (inputs != tokenizer.pad_token_id) # word_tokens --> 1, pad_token --> 0
if attention_mask.all():
attention_mask = None
with torch.no_grad():
outputs = model(inputs, attention_mask=attention_mask, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
lm_loss = outputs[0]
eval_loss += lm_loss.mean().item()
nb_eval_steps += 1
eval_loss = eval_loss / nb_eval_steps
perplexity = torch.exp(torch.tensor(eval_loss)).item()
result = {"perplexity": perplexity}
return result
def is_port_in_use(port):
import socket
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
return s.connect_ex(('localhost', port)) == 0
def main():
args = process_args()
os.environ['MASTER_ADDR'] = '127.0.0.1'
port = 9595
while is_port_in_use(port):
port += 1
print("Use port", port)
os.environ['MASTER_PORT'] = str(port)
# Using all available gpus for multi-processing distributed
args.gpus = torch.cuda.device_count()
print("Use gpus ", list(range(args.gpus)))
args.world_size = args.gpus * args.nodes
mp.spawn(setup, nprocs=args.gpus, args=(args,))
def setup(gpu, args):
if args.should_continue:
args.model_name_or_path = 'checkpoint-last'
# Setup CUDA, GPU & distributed training
torch.cuda.set_device(gpu)
device = torch.device("cuda", gpu)
args.gpu = gpu # Local device id.
args.device = device # Local device object.
args.rank = args.nr * args.gpus + gpu # The gpu id in the world.
torch.distributed.init_process_group(
backend="nccl",
init_method='env://',
world_size=args.world_size,
rank=args.rank
)
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO if args.gpu == 0 else logging.WARN,
)
logger.warning(
"Process GPU: %s, num_of_total_GPUs: %s, distributed training: True, 16-bits training: %s",
args.gpu, args.gpus, args.fp16,
)
# Set seed
set_seed(args)
# Load pretrained model and token
# Barrier to make sure only the first process in distributed training
# download model & vocabizer
if gpu != 0:
torch.distributed.barrier()
config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
# Get Config
if args.config_name:
config = config_class.from_pretrained(args.config_name, cache_dir=args.cache_dir)
elif args.model_name_or_path:
config = config_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
else:
raise ValueError(
"Why do you want the default config?? Please use --config_name or --model_name_or_path"
)
# Get Tokenizer
if args.tokenizer_name:
tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
# BERT always needs lower cased tokens.
assert tokenizer.init_kwargs.get("do_lower_case", False)
elif args.model_name_or_path:
tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
else:
raise ValueError(
"You are instantiating a new {} tokenizer. This is not supported, "
"but you can do it from another script, save it,"
"and load it from here, using --tokenizer_name".format(tokenizer_class.__name__)
)
assert args.block_size <= tokenizer.max_len
if args.model_name_or_path:
model = model_class.from_pretrained(
args.model_name_or_path,
from_tf=bool(".ckpt" in args.model_name_or_path),
config=config,
cache_dir=args.cache_dir,
)
else:
logger.info("Training new model from scratch")
model = model_class(config=config)
model.to(args.device)
# End of barrier to make sure only the first process waiting other processes
if gpu == 0:
torch.distributed.barrier()
logger.info("Training/evaluation parameters %s", args)
# Training
if args.do_train:
# Barrier to make sure only the first process in distributed training process the dataset,
# and the others will use the cache
if gpu != 0:
torch.distributed.barrier()
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
if gpu == 0:
torch.distributed.barrier()
train(args, train_dataset, model, tokenizer)
# Evaluation
if args.do_eval and gpu == 0:
result = evaluate(args, model, tokenizer)
if __name__ == "__main__":
main()
================================================
FILE: vlm/run_vlm_distributed.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
using a masked language modeling (MLM) loss.
"""
from datetime import datetime
import json
import logging
import os
import random
import sys
import time
from typing import Dict, List, Tuple
import numpy as np
import torch
from torch.nn.utils.rnn import pad_sequence
import torch.multiprocessing as mp
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
from transformers import (
MODEL_WITH_LM_HEAD_MAPPING,
WEIGHTS_NAME,
AdamW,
AutoConfig,
AutoModelWithLMHead,
AutoTokenizer,
BertConfig,
BertForMaskedLM,
BertTokenizer,
CamembertConfig,
CamembertForMaskedLM,
CamembertTokenizer,
DistilBertConfig,
DistilBertForMaskedLM,
DistilBertTokenizer,
GPT2Config,
GPT2LMHeadModel,
GPT2Tokenizer,
OpenAIGPTConfig,
OpenAIGPTLMHeadModel,
OpenAIGPTTokenizer,
PreTrainedModel,
PreTrainedTokenizer,
RobertaConfig,
RobertaForMaskedLM,
RobertaTokenizer,
get_linear_schedule_with_warmup,
)
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from vlm.data import CoLDataset, get_voken_feats
from vlm.param import process_args
from vlm.model import CoLBertConfig, CoLwithBert
try:
from torch.utils.tensorboard import SummaryWriter
except ImportError:
from tensorboardX import SummaryWriter
logger = logging.getLogger(__name__)
MODEL_CLASSES = {
"gpt2": (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
"openai-gpt": (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
"bert": (CoLBertConfig, CoLwithBert, BertTokenizer),
"roberta": (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer),
"distilbert": (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer),
"camembert": (CamembertConfig, CamembertForMaskedLM, CamembertTokenizer),
}
def load_and_cache_examples(args, tokenizer, evaluate=False):
file_path = args.eval_data_file if evaluate else args.train_data_file
return CoLDataset(file_path, args.tokenizer_name, tokenizer, args.block_size,
split_sent=args.split_sent, voken_dir=args.voken_dir,
suffix=args.voken_suffix,
verbose=(args.gpu == 0),
voken_ablation=args.voken_ablation)
def set_seed(args):
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
def mask_tokens(tokens: torch.Tensor, vokens: torch.Tensor, tokenizer: PreTrainedTokenizer, args) \
-> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
""" Notice that this function would have a side affect of manipulating the Tensor tokens.
Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
if tokenizer.mask_token is None:
raise ValueError(
"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer."
)
labels = tokens.clone()
# We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
probability_matrix = torch.full(labels.shape, args.mlm_probability)
special_tokens_mask = [
tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
]
probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
if tokenizer._pad_token is not None:
padding_mask = labels.eq(tokenizer.pad_token_id)
probability_matrix.masked_fill_(padding_mask, value=0.0)
masked_indices = torch.bernoulli(probability_matrix).bool()
labels[~masked_indices] = -100 # We only compute loss on masked tokens
if args.voken_labels == 'mask':
vokens[~masked_indices] = -100
elif args.voken_labels == 'nonmask':
vokens[masked_indices] = -100
elif args.voken_labels == 'all':
pass
else:
assert "Do not support the voken loss of type %s" % args.voken_labels
# 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
tokens[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
# 10% of the time, we replace masked input tokens with random word
indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
tokens[indices_random] = random_words[indices_random]
# The rest of the time (10% of the time) we keep the masked input tokens unchanged
return tokens, labels, vokens
def train(args, train_dataset: CoLDataset, valid_dataset: CoLDataset,
model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
set_seed(args) # Added here for reproducibility
""" Train the model """
if args.gpu == 0:
current_time = datetime.now().strftime('%b%d_%H-%M-%S')
tb_writer = SummaryWriter(args.output_dir + '/runs/' + current_time)
args.train_batch_size = args.per_gpu_train_batch_size
def col_collate(examples):
tokens, vokens = zip(*examples)
if tokenizer._pad_token is None:
tokens = pad_sequence(tokens, batch_first=True)
else:
tokens = pad_sequence(tokens, batch_first=True, padding_value=tokenizer.pad_token_id)
vokens = pad_sequence(vokens, batch_first=True, padding_value=-100)
return tokens, vokens
if args.shuffle:
logger.info(f"Shuffle the dataset in training,"
f"GPU: {args.gpu},"
f"Rank: {args.rank},"
f"Total: {args.world_size}")
train_sampler = DistributedSampler(
train_dataset,
num_replicas=args.world_size,
rank=args.rank,
shuffle=args.shuffle,
)
train_dataloader = DataLoader(
train_dataset, sampler=train_sampler, shuffle=False, num_workers=0,
batch_size=args.train_batch_size, collate_fn=col_collate, pin_memory=True
)
if args.max_steps > 0:
t_total = args.max_steps
args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
# args.num_train_epochs = 9595
else:
t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
# Prepare optimizer and schedule (linear warmup and decay)
if args.lamb:
no_decay = ['bias', 'gamma', 'beta', 'LayerNorm']
else:
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
{
"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
"weight_decay": args.weight_decay,
},
{
"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
"weight_decay": 0.0,
},
]
if args.lamb:
logger.info(f"Using LAMB Optimizer with max grad norm {args.max_grad_norm}")
import apex
optimizer = apex.optimizers.FusedLAMB(
optimizer_grouped_parameters,
lr=args.learning_rate,
eps=args.adam_epsilon,
max_grad_norm=args.max_grad_norm
)
else:
optimizer = AdamW(optimizer_grouped_parameters,
lr=args.learning_rate,
#betas=(0.9, 0.98),
eps=args.adam_epsilon)
if args.gpu == 0:
print(f"Optimized with lr: {optimizer.defaults['lr']}, total steps: {t_total},"
f" warmup steps: {args.warmup_steps}, epsilon {optimizer.defaults['eps']},"
f" beta: {optimizer.defaults['betas']}, weight decay {args.weight_decay}.")
scheduler = get_linear_schedule_with_warmup(
optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
)
# Check if saved optimizer or scheduler states exist
if (
args.model_name_or_path
and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))
and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))
):
# Load in optimizer and scheduler states
optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))
if args.fp16:
try:
from apex import amp
except ImportError:
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
from apex.parallel import DistributedDataParallel as DDP
model = DDP(model)
else:
model = torch.nn.parallel.DistributedDataParallel(
model, device_ids=[args.gpu], find_unused_parameters=True
)
# Allow not calculating the lm heads.
if args.mlm_ratio == 0.:
model.lm_head = None
# Train!
logger.info("***** Running training *****")
logger.info(" Num examples = %d", len(train_dataset))
logger.info(" Num Epochs = %d", args.num_train_epochs)
logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
logger.info(
" Total train batch size (w. distributed & accumulation) = %d",
args.train_batch_size
* args.gradient_accumulation_steps
* args.world_size
)
logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
logger.info(" Total optimization steps = %d", t_total)
global_step = 0
epochs_trained = 0
# Check if continuing training from a checkpoint
# if args.model_name_or_path and os.path.exists(args.model_name_or_path):
# try:
# # set global_step to gobal_step of last saved checkpoint from model path
# checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
# epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
# steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)
# logger.info(" Continuing training from checkpoint, will skip to saved global_step")
# logger.info(" Continuing training from epoch %d", epochs_trained)
# except ValueError:
# logger.info(" Do not load model from %s, restart training" % args.model_name_or_path)
model_to_resize = model.module if hasattr(model, "module") else model # Take care of distributed/parallel training
assert model_to_resize.config.vocab_size == len(tokenizer)
# model_to_resize.resize_token_embeddings(len(tokenizer))
model.zero_grad()
train_iterator = trange(
epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.gpu != 0
)
set_seed(args) # Added here for reproducibility
LOSS_NAMES = ['token_loss', 'voken_loss', 'total_loss']
for epoch in train_iterator:
epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.gpu != 0)
tr_loss, logging_loss = np.zeros(len(LOSS_NAMES)), 0.0
model.zero_grad()
for step, (tokens, vokens) in enumerate(epoch_iterator):
token_inputs, token_labels, voken_labels = mask_tokens(tokens, vokens, tokenizer, args)
token_inputs = token_inputs.to(args.device)
token_labels = token_labels.to(args.device) if args.mlm_ratio != 0. else None
voken_labels = voken_labels.to(args.device)
# If some of the input is padded, then the attention mask is needed
attention_mask = (token_inputs != tokenizer.pad_token_id) # word_tokens --> 1, pad_token --> 0
if attention_mask.all():
attention_mask = None
if epoch == 0 and step < 3 and args.gpu == 0:
print()
print("Token inputs:", token_inputs.shape, token_inputs[0])
print("Token inputs (in str): ", tokenizer.convert_ids_to_tokens(token_inputs[0].cpu().numpy()))
print("Attention Mask:", attention_mask)
print("Token Labels: ", token_labels[0] if token_labels is not None else token_labels)
print("Token Labels (in str): ", tokenizer.convert_ids_to_tokens(token_labels[0].cpu().numpy()) if token_labels is not None else token_labels)
print("Voken Labels: ", voken_labels[0])
print()
model.train()
outputs = model(token_inputs,
attention_mask=attention_mask,
masked_lm_labels=token_labels,
voken_labels=voken_labels)
voken_loss = outputs[0]
token_loss = outputs[1]
if args.mlm_ratio == 0.:
loss = voken_loss
else:
loss = voken_loss + args.mlm_ratio * token_loss
if args.gradient_accumulation_steps > 1:
loss = loss / args.gradient_accumulation_steps
if args.fp16:
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
else:
loss.backward()
# print(f"GPU: {args.gpu}, Global Step: {global_step + 1}, "
# f"Step: {step}, "
# f"Range: {train_dataset.get_item_info(step * args.world_size + args.gpu)}, "
# f"Loss: {loss.item()}, "
# f"Scaled Loss: {scaled_loss.item()}")
tr_loss += np.array((token_loss.item() / args.gradient_accumulation_steps,
voken_loss.item() / args.gradient_accumulation_steps,
loss.item()))
if (step + 1) % args.gradient_accumulation_steps == 0:
if args.max_grad_norm > 0. and not args.lamb:
# Only clip the grad when it is valid and not using LAMB optimizer,
# because the LAMB optimizer already apply grad clipping
if args.fp16:
total_norm = torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
else:
total_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
elif args.max_grad_norm <= 0. and step <= args.gradient_accumulation_steps:
logger.warning("Have not clipped the gradient because "
"the max_grad_norm is set to %0.2f" % args.max_grad_norm)
optimizer.step()
scheduler.step() # Update learning rate schedule
model.zero_grad()
global_step += 1
if args.gpu == 0 and args.logging_steps > 0 and (step + 1) % args.logging_steps == 0:
# Log metrics
tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
if args.fp16:
try:
from apex.amp import _amp_state
tb_writer.add_scalar("loss_scale", _amp_state.loss_scalers[0]._loss_scale, global_step)
tb_writer.add_scalar("scaled_loss", scaled_loss.item(), global_step)
except ImportError:
logger.warning("Cannot import apex.amp._amp_state, "
"would not state the loss_scale in the log")
if args.max_grad_norm > 0. and not args.lamb: # Only clip the grad when it is valid
tb_writer.add_scalar("grad_norm", total_norm, global_step)
interval_loss = (tr_loss - logging_loss) / args.logging_steps
for loss_idx, loss_name in enumerate(LOSS_NAMES):
tb_writer.add_scalar(loss_name, interval_loss[loss_idx], global_step)
logging_loss = tr_loss.copy()
if args.max_steps > 0 and global_step >= args.max_steps:
break
# if step == 200:
# break
#
# Save it each epoch
if args.gpu == 0:
# Save checkpoints
checkpoint_name = "checkpoint-epoch%04d" % epoch
save_model(args, checkpoint_name, model, tokenizer, optimizer, scheduler)
# last_path = os.path.join(args.output_dir, 'checkpoint-last')
# if os.path.exists(last_path):
# os.remove(last_path)
# os.symlink(os.path.join(args.output_dir, checkpoint_name), last_path)
# Evaluate the model
for loss_idx, loss_name in enumerate(LOSS_NAMES):
logger.info(" Training %s of Epoch %d: %0.4f" % (
loss_name, epoch, tr_loss[loss_idx] / len(train_dataloader)))
if args.do_eval:
logger.info(" Evaluation Results of Epoch %d: " % epoch)
old_eval_batch_size = args.per_gpu_eval_batch_size
while args.per_gpu_eval_batch_size > 0:
try:
results = evaluate(args, valid_dataset, model, tokenizer)
break
except RuntimeError as e:
args.per_gpu_eval_batch_size = int(args.per_gpu_eval_batch_size / 2)
print("HALVE THE BATCH SIZE in EVAL.")
if args.per_gpu_eval_batch_size == 0:
raise e
time.sleep(5)
args.per_gpu_eval_batch_size = old_eval_batch_size
for key, value in results.items():
tb_writer.add_scalar("eval_{}".format(key), value, global_step)
logger.info("\t %s: %0.4f" % (key, value))
tb_writer.add_scalar("epoch", epoch, global_step)
output_eval_file = os.path.join(args.output_dir, checkpoint_name, "eval_results.json")
json.dump(results, open(output_eval_file, 'w'), sort_keys=True, indent=4)
# Currently, only GPU 0 is responsible for the evaluation.
# torch.cuda.empty_cache()
# torch.distributed.barrier()
else:
pass
# torch.cuda.empty_cache()
# torch.distributed.barrier()
if args.max_steps > 0 and global_step >= args.max_steps:
epoch_iterator.close()
train_iterator.close()
break
if args.gpu == 0:
tb_writer.close()
def save_model(args, name, model, tokenizer, optimizer, scheduler):
# Save model checkpoint
output_dir = os.path.join(args.output_dir, name)
os.makedirs(output_dir, exist_ok=True)
model_to_save = (
model.module if hasattr(model, "module") else model
) # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
torch.save(args, os.path.join(output_dir, "training_args.bin"))
logger.info("Saving model checkpoint to %s", output_dir)
# torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
# torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
# logger.info("Saving optimizer and scheduler states to %s", output_dir)
def evaluate(args, eval_dataset: CoLDataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, prefix="") -> Dict:
torch.cuda.empty_cache()
# # Loop to handle MNLI double evaluation (matched, mis-matched)
# eval_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)
args.eval_batch_size = args.per_gpu_eval_batch_size
# Note that DistributedSampler samples randomly
def col_collate(examples):
tokens, vokens = zip(*examples)
if tokenizer._pad_token is None:
tokens = pad_sequence(tokens, batch_first=True)
else:
tokens = pad_sequence(tokens, batch_first=True, padding_value=tokenizer.pad_token_id)
vokens = pad_sequence(vokens, batch_first=True, padding_value=-100)
return tokens, vokens
eval_sampler = SequentialSampler(eval_dataset)
eval_dataloader = DataLoader(
eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=col_collate
)
# Eval!
logger.info("***** Running evaluation {} *****".format(prefix))
logger.info(" Num examples = %d", len(eval_dataset))
logger.info(" Batch size = %d", args.eval_batch_size)
total_token_loss = 0.0
total_voken_loss = 0.0
nb_eval_steps = 0
model.eval()
for tokens, vokens in tqdm(eval_dataloader, desc="Evaluating"):
token_inputs, token_labels, voken_labels = mask_tokens(tokens, vokens, tokenizer, args)
token_inputs = token_inputs.to(args.device)
token_labels = token_labels.to(args.device) if args.mlm_ratio != 0 else None
voken_labels = voken_labels.to(args.device)
# If some of the input is padded, then the attention mask is needed
attention_mask = (token_inputs != tokenizer.pad_token_id) # word_tokens --> 1, pad_token --> 0
if attention_mask.all():
attention_mask = None
with torch.no_grad():
outputs = model(token_inputs,
attention_mask=attention_mask,
masked_lm_labels=token_labels,
voken_labels=voken_labels)
voken_loss = outputs[0]
token_loss = outputs[1]
total_voken_loss += voken_loss.item()
total_token_loss += token_loss.item()
nb_eval_steps += 1
total_token_loss = total_token_loss / nb_eval_steps
perplexity = torch.exp(torch.tensor(total_token_loss)).item()
result = {"perplexity": perplexity,
"voken_loss": total_voken_loss / nb_eval_steps}
torch.cuda.empty_cache()
return result
def is_port_in_use(port):
import socket
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
return s.connect_ex(('localhost', port)) == 0
def main():
args = process_args()
os.environ['MASTER_ADDR'] = '127.0.0.1'
port = 9595
while is_port_in_use(port):
port += 1
print("Use port", port)
os.environ['MASTER_PORT'] = str(port)
# Using all available gpus for multi-processing distributed
args.gpus = torch.cuda.device_count()
print("Use gpus ", list(range(args.gpus)))
args.world_size = args.gpus * args.nodes
mp.spawn(setup, nprocs=args.gpus, args=(args,))
def setup(gpu, args):
if args.should_continue:
args.model_name_or_path = 'checkpoint-last'
# Setup CUDA, GPU & distributed training
torch.cuda.set_device(gpu)
device = torch.device("cuda", gpu)
args.gpu = gpu # Local device id.
args.device = device # Local device object.
args.rank = args.nr * args.gpus + gpu # The gpu id in the world.
torch.distributed.init_process_group(
backend="nccl",
init_method='env://',
world_size=args.world_size,
rank=args.rank
)
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO if args.gpu == 0 else logging.WARN,
)
logger.warning(
"Process GPU: %s, num_of_total_GPUs: %s, distributed training: True, 16-bits training: %s",
args.gpu, args.gpus, args.fp16,
)
# Set seed
set_seed(args)
# Load pretrained model and token
# Barrier to make sure only the first process in distributed training
# download model & vocabizer
if gpu != 0:
torch.distributed.barrier()
# Use self-defined models, thus avoiding Auto***.
config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
# Next, we will initialize the training process in the following order:
# 1. tokenizer --> 2. dataset --> 3. config --> 4. model.
# because A) dataset relies on the tokenizer.special_tokens.
# B) config relies on the dataset.voken_size.
# Get Tokenizer
if args.tokenizer_name:
tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
elif args.model_name_or_path:
tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
else:
raise ValueError(
"You are instantiating a new {} tokenizer. This is not supported, "
"but you can do it from another script, save it,"
"and load it from here, using --tokenizer_name".format(tokenizer_class.__name__)
)
assert args.block_size <= tokenizer.max_len
# Barrier to make sure only the first process in distributed training process the dataset,
# and the others will use the cache
if gpu != 0:
torch.distributed.barrier()
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
valid_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)
if gpu == 0:
torch.distributed.barrier()
# Assert the vokens are equal in valid and eval.
valid_dataset.assert_equal_vokens(train_dataset)
config_kwargs = {}
if args.do_voken_reg or args.do_voken_ctr:
assert args.voken_feat_dir is not None
voken_feats = get_voken_feats(train_dataset, args.voken_feat_dir)
config_kwargs['voken_dim'] = len(voken_feats[0])
if gpu == 0:
logger.info(f"Load voken feats from {args.voken_feat_dir}"
f"with {len(voken_feats)} features and dimension {len(voken_feats[0])}")
# Get Config
if args.config_name:
config = config_class.from_pretrained(
args.config_name,
cache_dir=args.cache_dir,
voken_size=train_dataset.voken_size,
do_voken_cls=args.do_voken_cls,
do_voken_reg=args.do_voken_reg,
do_voken_ctr=args.do_voken_ctr,
shared_head=args.shared_head,
verbose=(args.gpu == 0),
**config_kwargs
)
elif args.model_name_or_path:
config = config_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
else:
raise ValueError(
"Why do you want the default config?? Please use --config_name or --model_name_or_path"
)
if args.model_name_or_path:
logger.info(f"Training model from the weight {args.model_name_or_path}.")
model = model_class.from_pretrained(
args.model_name_or_path,
from_tf=bool(".ckpt" in args.model_name_or_path),
config=config,
cache_dir=args.cache_dir,
)
else:
logger.info("Training new model from scratch")
model = model_class(config=config)
if args.do_voken_reg or args.do_voken_ctr:
voken_feats = torch.tensor(voken_feats)
model.init_voken_feat_emb(voken_feats)
model.to(args.device)
# End of barrier to make sure only the first process waiting other processes
if gpu == 0:
torch.distributed.barrier()
if args.model_name_or_path:
if gpu == 0:
logger.info("Evaluate the performance of the loaded model.")
results = evaluate(args, valid_dataset, model, tokenizer)
for key, value in results.items():
logger.info("\t %s: %0.4f" % (key, value))
torch.distributed.barrier()
else:
torch.distributed.barrier()
logger.info("Training/evaluation parameters %s", args)
# Training
if args.do_train:
train(args, train_dataset, valid_dataset, model, tokenizer)
# Evaluation
if args.do_eval and gpu == 0:
results = evaluate(args, valid_dataset, model, tokenizer)
for key, value in results.items():
logger.info("\t %s: %0.4f" % (key, value))
if __name__ == "__main__":
main()
================================================
FILE: vlm/show_glue_results_epochs.py
================================================
import os
from pathlib import Path
root = Path(
'snap'
)
task2major = {
'QQP': 'acc_and_f1',
'STS-B': 'corr',
'MRPC': 'acc_and_f1',
}
# The tasks sorted by the amount of data
all_tasks = [
# 'WNLI',
'RTE',
'MRPC',
'STS-B',
'CoLA',
'SST-2',
'QNLI',
'QQP',
'MNLI',
'MNLI-MM',
]
def print_result(glue_dir):
print(glue_dir)
results = {}
for task in glue_dir.iterdir():
if task.is_dir():
eval_fpath = task / 'eval_results.txt'
task_name = task.name
if eval_fpath.exists():
with eval_fpath.open() as f:
for line in f:
metric, value = line.split('=')
metric = metric.strip()
value = float(value.strip())
if task_name in task2major:
if metric == task2major[task_name]:
results[task_name] = value
else:
results[task_name] = value
if len(results) > 0:
# sorted_keys = sorted(list(results.keys()))
# for key in sorted_keys:
# print("%8s" % key, end='')
# print("%8s" % 'GLUE', end='')
# print()
# for key in sorted_keys:
# print("%8.2f" % (results[key] * 100.), end='')
# print("%8.2f" % (sum(results.values()) * 100. / len(results)), end='')
# print()
for task in all_tasks:
print("%8s" % task, end='')
print("%8s" % 'GLUE', end='')
print()
for task in all_tasks:
if task in results:
result = results[task]
print("%8.2f" % (result * 100), end='')
else:
print(" " * 8, end='')
mean = lambda x: sum(x) / max(len(x), 1)
avg_result = mean([value for key, value in results.items() if key in all_tasks])
print("%8.2f" % (avg_result * 100.), end='')
print()
def search(path):
def sorted_key(path):
try:
return path.stat().st_mtime
except Exception:
return 0.
path_list = sorted(
path.iterdir(),
key=sorted_key
# x.name
)
for subdir in path_list:
if subdir.is_dir():
if 'glueepoch_' in subdir.name:
print_result(subdir)
else:
search(subdir)
search(root)
================================================
FILE: vokenization/__init__.py
================================================
================================================
FILE: vokenization/common.py
================================================
import os
# Name of image sets
IMAGE_SETS = [
'coco_train',
'coco_nominival',
'coco_minival',
'vg_nococo',
'cc_train',
'cc_valid',
]
# Root of each dataset
# CC_ROOT, COCO_ROOT, VG_ROOT should contain the `images` folder
# CC_ROOT -- images
# |-- training
# |-- training_00009486 # Jpeg files but does not have the extension.
# |-- ....
# |-- validation
# |-- validation_00009486
# |-- ...
# CC_ROOT = os.getenv('CC_ROOT', 'data/cc')
# COCO_ROOT = os.getenv('COCO_ROOT', 'data/mscoco')
# VG_ROOT = os.getenv('VG_ROOT', 'data/vg')
# LXRT_ROOT = os.getenv('LXRT_ROOT', 'data/lxrt')
CC_ROOT = 'data/cc'
COCO_ROOT = 'data/mscoco'
VG_ROOT = 'data/vg'
LXRT_ROOT = 'data/lxmert'
# THe local directory to save essential image infos
# (e.g., image ids for the vokenizer, image paths in this server)
# LOCAL_DIR
# |- images
# |- coco_train_ids.txt
# |- coco_train_paths.txt
# |- cc_train_ids.txt
# |- cc_train_paths.txt
# |- ..............
# Running create_image_ids.py will build *_ids.txt
# Running extract_vision_keys.py will build *_paths.txt
LOCAL_DIR = 'data/vokenization'
================================================
FILE: vokenization/create_image_ids.py
================================================
import json
import os
from pathlib import Path
import sys
# sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import common
imgset2lxrtfname = {
'coco_train': 'mscoco_train.json',
'coco_nominival': 'mscoco_nominival.json',
'coco_minival': 'mscoco_minival.json',
'vg_nococo': 'vgnococo.json',
}
imgset2ccfname = {
'cc_train': 'training.tsv',
'cc_valid': 'validation.tsv'
}
def write_ids(img_set, img_ids):
"""
Write the indexed image ids 'img_ids' for image set 'img_set' to
the local file.
"""
info_dir = os.path.join(common.LOCAL_DIR, 'images')
os.makedirs(info_dir, exist_ok=True)
print("Write %d image ids for image set %s to %s." % (
len(img_ids), img_set, os.path.join(info_dir, img_set + '.ids')))
ids_path = os.path.join(info_dir, img_set + '.ids')
if os.path.exists(ids_path):
# If there is an existing ids_path, make sure that they are the same.
print(f"Already exist the image ids for image set {img_set} at path {ids_path}.")
print("Now, we want to make sure that they are equal:")
with open(ids_path, 'r') as f:
exist_img_ids = list(map(lambda x: x.strip(), f.readlines()))
success = True
for i, (exist_img_id, img_id) in zip(exist_img_ids):
if exist_img_id != img_id:
print(f"The image id at line {i} is different:")
print(f"\tIn the file: {exist_img_id}, In this script: {img_id}")
success = False
if success:
print("PASS!")
else:
with open(ids_path, 'w') as f:
for img_id in img_ids:
f.write(img_id + '\n')
for img_set in common.IMAGE_SETS:
if img_set in imgset2lxrtfname:
lxrt_path = Path(common.LXRT_ROOT)
img_ids = []
fname = imgset2lxrtfname[img_set]
for datum in json.load((lxrt_path / fname).open()):
img_id = datum['img_id']
img_ids.append(img_id)
write_ids(img_set, img_ids)
if img_set in imgset2ccfname:
cc_path = Path(common.CC_ROOT)
img_ids = []
fname = imgset2ccfname[img_set]
if not (cc_path / fname).exists():
print("No such file", cc_path / fname)
continue
for i, line in enumerate((cc_path / fname).open()):
sent, img_id = line.split('\t')
img_ids.append(img_id.strip())
write_ids(img_set, img_ids)
================================================
FILE: vokenization/evaluate_diversity.py
================================================
import argparse
from collections import defaultdict
import json
import os
import sys
import numpy as np
import tqdm
from vokenization import Vokenizer, load_model_and_tokenizer
import common
imgset2fname = {
'coco_train': 'mscoco_train.json',
'coco_nominival': 'mscoco_nominival.json',
'coco_minival': 'mscoco_minival.json',
'vg_nococo': 'vgnococo.json',
'cc_train': 'training.tsv',
'cc_valid': 'validation.tsv',
}
tokenizer_name = 'bert-base-uncased'
def load_lang_data(corpus_name, topk=10000):
"""
Load {topk} sentences from the corpus named by {corpus_name}.
"""
fpath = corpus_name + '.' + tokenizer_name
tokens = []
with open(fpath) as f:
for i, line in enumerate(f):
tokens.append(list(map(int, line.split(' '))))
if (i + 1) == topk:
break
print("Read %d sentences from the corpus %s located at %s." % (
len(tokens), corpus_name, fpath
))
return tokens
def load_cc_data(img_set):
fname = os.path.join(common.CC_ROOT, imgset2fname[img_set])
sents = []
with open(fname) as f:
for line in f:
sent, _ = line.split('\t')
sents.append(sent)
print("Load the %d sentences for image set %s from %s" % (
len(sents), img_set, fname))
return sents
def load_lxrt_data(img_set):
fname = os.path.join(common.LXRT_ROOT, imgset2fname[img_set])
sents = []
with open(fname) as f:
data = json.load(f)
for datum in data:
sents.extend(datum['sentf']['mscoco'])
print("Load the %d sentences for image set %s from %s" % (
len(sents), img_set, fname))
return sents
def analyze(token2info):
"""
:param token2info: token2info: token --> (img_id --> cnt)
:return:
"""
names = ['Num Images', 'Max Cnt', 'Avg Cnt', 'Std Cnt']
results = np.zeros(4)
num_tokens = 0
for token in token2info:
img2cnt = token2info[token]
cnts = np.array(list(img2cnt.values()))
num_imgs = len(cnts)
max_cnt = cnts.max()
avg_cnt = cnts.mean()
std_cnt = cnts.std()
results += (num_imgs, max_cnt, avg_cnt, std_cnt)
num_tokens += 1
print("With %d tokens, " % num_tokens)
results /= num_tokens
for name, result in zip(names, results):
print("Average of %s is %0.2f" % (name, result))
corpus_info = defaultdict(lambda: 0)
for info in token2info.values():
for img, cnt in info.items():
corpus_info[img] += cnt
print("Cover %d images" % len(corpus_info))
# load = '/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_bertl4'
parser = argparse.ArgumentParser()
parser.add_argument('--load', type=str, default='/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_robertal4',
help='The directory saved the model (containing'
'BEST.pth.model).')
parser.add_argument('--image-sets', type=str, default='coco_minival',
help='The splits of images to be extracted')
parser.add_argument('--corpus', type=str, default='wiki103',
help='Evaluated corpus')
parser.add_argument('--maxsents', type=int, default=10000,
help='The maximum sentences to be evaluated in the corpus')
args = parser.parse_args()
keys_path = os.path.join(args.load, 'keys')
print("Evaluate for model %s on image sets %s" % (args.load, args.image_sets))
model, tokenizer = load_model_and_tokenizer(args.load)
img_sets = args.image_sets.split(',')
vokenizer = Vokenizer(model, tokenizer, keys_path, img_sets)
corpus_list = args.corpus.split(',')
for corpus in corpus_list:
corpus = corpus.strip()
print("\nProcessing corpus %s for diversity test:" % corpus)
# token2info: token --> (img_id --> cnt)
token2info = defaultdict(lambda: defaultdict(lambda: 0))
if corpus in imgset2fname:
if 'cc' in corpus:
sents = load_cc_data(corpus)
else:
sents = load_lxrt_data(corpus)
batch_size = 32
for start_id in tqdm.tqdm(range(0, len(sents), batch_size)):
batch_sents = sents[start_id: start_id + batch_size]
scores, ids, tokens, paths = vokenizer.vokenize_sents(batch_sents, topk=None)
for i in range(len(paths)):
for token, path in zip(tokens[i][1:-1], paths[i][1:-1]):
token2info[token][path] += 1
else:
tokens_list = load_lang_data(corpus, args.maxsents)
batch_size = 16
for start_id in tqdm.tqdm(range(0, len(tokens_list), batch_size)):
batch_tokens = tokens_list[start_id: start_id + batch_size]
scores, ids, tokens, paths = vokenizer.vokenize_ids(batch_tokens, topk=None)
for i in range(len(paths)):
for token, path in zip(tokens[i][1:-1], paths[i][1:-1]):
token2info[token][path] += 1
analyze(token2info)
================================================
FILE: vokenization/evaluate_retrieval.py
================================================
import argparse
from collections import defaultdict
import json
import os
import tqdm
from vokenization import Vokenizer, load_model_and_tokenizer
import common
imgset2fname = {
'coco_train': 'mscoco_train.json',
'coco_nominival': 'mscoco_nominival.json',
'coco_minival': 'mscoco_minival.json',
'vg_nococo': 'vg_nococo.json',
'cc_train': 'training.tsv',
'cc_valid': 'validation.tsv',
}
def load_cc_data(img_set):
fname = os.path.join(common.CC_ROOT, imgset2fname[img_set])
sentXimgname = []
with open(fname) as f:
for line in f:
sent, gt_img_name = line.split('\t')
gt_img_name = gt_img_name.strip()
sentXimgname.append((sent, gt_img_name))
print("Load the %d (img, sent) pairs for image set %s from %s" % (
len(sentXimgname), img_set, fname))
return sentXimgname
def load_lxrt_data(img_set):
fname = os.path.join(common.LXRT_ROOT, imgset2fname[img_set])
sentXimgname = []
with open(fname) as f:
data = json.load(f)
for datum in data:
gt_img_name = datum['img_id'] + '.jpg'
sents = datum['sentf']['mscoco']
for sent in sents:
sentXimgname.append((sent, gt_img_name))
print("Load the %d (img, sent) pairs for image set %s from %s" % (
len(sentXimgname), img_set, fname))
return sentXimgname
# load = '/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_bertl4'
parser = argparse.ArgumentParser()
parser.add_argument('--load', type=str, default='/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_robertal4',
help='The directory saved the model (containing'
'BEST.pth.model).')
parser.add_argument('--image-sets', type=str, default='coco_minival',
help='The splits of images to be extracted')
args = parser.parse_args()
keys_path = os.path.join(args.load, 'keys')
print("Evaluate for model %s on image sets %s" % (args.load, args.image_sets))
model, tokenizer = load_model_and_tokenizer(args.load)
img_sets = args.image_sets.split(',')
sent_level = 'sent' in args.load
for img_set in img_sets:
vokenizer = Vokenizer(model, tokenizer, keys_path, [img_set],
sent_level=sent_level)
if 'cc' in img_set:
sentXimgname = load_cc_data(img_set)
else:
sentXimgname = load_lxrt_data(img_set)
topks = [1, 5, 10]
print("\nEvaluate image set", img_set, "for topk retrieval:", topks)
total = 0
arg_topk = None if max(topks) == 1 else max(topks)
results = defaultdict(lambda: 0)
batch_size = 32
for start_id in tqdm.tqdm(range(0, len(sentXimgname), batch_size)):
batch_sentXimg = sentXimgname[start_id: start_id + batch_size]
sents, gt_img_names = zip(*batch_sentXimg)
sents = list(sents)
scores, ids, tokens, paths_list = vokenizer.vokenize_sents(sents, topk=arg_topk)
if sent_level:
paths_list = [x[:3] for x in paths_list] # Only eval the first vokens.
if arg_topk is None:
paths_list = [[[img_id] for img_id in sent] for sent in paths_list]
for paths, gt_img_name in zip(paths_list, gt_img_names): # for each sent in batch
for topk_paths in paths[1:-1]: # for each token in sent
for k, kth_path in enumerate(topk_paths): # for each img_path in topk image paths of a token
img_name = os.path.split(kth_path)[-1]
if img_name == gt_img_name:
results[k + 1] += 1
total += sum(map(lambda x: len(x) - 2, paths_list))
accumulate = 0
for i in range(1, max(topks)+1):
accumulate += results[i]
if i in topks:
print("R%d: %0.2f%%, (Random: %0.4f%%)" % (
i,
accumulate / total * 100.,
i / vokenizer.img_num * 100.
))
del vokenizer
================================================
FILE: vokenization/extract_vision_keys.py
================================================
# In this file, we extract the vision features as the keys in retrieval.
import argparse
import os
import pickle
import shutil
import sys
import h5py
import torch
from torchvision import transforms
from torchvision.datasets.folder import default_loader
import tqdm
from transformers import BertTokenizer
from PIL import Image
import common
# Load all images
Image.MAX_IMAGE_PIXELS = None
def get_img_path(img_set, img_id):
"""
Get the paths regarding the img_set and img_id.
THIS FUNCTION MIGHT NEED TO BE MODIFIED.
"""
source, tag = img_set.split('_')
if source == 'cc':
split_tag, _ = img_id.split('_')
return "%s/images/%s/%s" % (common.CC_ROOT, split_tag, img_id)
elif 'COCO' in img_id:
_, split_tag, _ = img_id.split('_')
return "%s/images/%s/%s" % (common.COCO_ROOT, split_tag, img_id + '.jpg')
else: # VG images
return "%s/images/%s.jpg" % (common.VG_ROOT, img_id)
def get_img_paths_and_ids(img_set):
"""
Return a list of images paths and image ids in this 'img_set'.
"""
# Load the image ids from the common local dir,
# thus make sure that the order of the images are the same.
info_dir = os.path.join(common.LOCAL_DIR, 'images')
img_paths = []
with open(os.path.join(info_dir, img_set + '.ids')) as f:
img_ids = list(map(lambda x: x.strip(), f.readlines()))
for img_id in img_ids:
img_paths.append(get_img_path(img_set, img_id))
return img_paths, img_ids
def save_img_paths_and_ids(img_set, img_paths, img_ids, output):
info_dir = os.path.join(common.LOCAL_DIR, 'images')
# Save
gitextract_6h_zq3l4/
├── LICENSE
├── README.md
├── data/
│ ├── lxmert/
│ │ └── .gitignore
│ ├── mscoco/
│ │ └── .gitignore
│ ├── vg/
│ │ └── .gitignore
│ ├── wiki/
│ │ ├── get_data_cased.bash
│ │ ├── get_data_cased_untokenized.bash
│ │ ├── install-tools.sh
│ │ └── tools/
│ │ ├── remove_accent.py
│ │ ├── segment_th.py
│ │ └── tokenize.sh
│ └── wiki103/
│ ├── get_data_cased.sh
│ └── get_data_uncased.sh
├── requirements.txt
├── scripts/
│ ├── base_vlm_wiki.bash
│ ├── base_vlm_wiki_glue.bash
│ ├── base_wiki.bash
│ ├── base_wiki_glue.bash
│ ├── extract_keys.bash
│ ├── mpvokenize_wiki.bash
│ ├── mpvokenize_wiki103.bash
│ ├── run_glue_at_epoch.bash
│ ├── run_glue_epochs.bash
│ ├── run_xmatching.bash
│ ├── small_vlm_wiki103.bash
│ ├── small_vlm_wiki103_glue.bash
│ ├── small_wiki103.bash
│ ├── small_wiki103_glue.bash
│ └── xmatching_benchmark.bash
├── snap/
│ ├── bert/
│ │ └── .gitkeep
│ ├── vlm/
│ │ └── .gitkeep
│ └── xmatching/
│ └── .gitkeep
├── tokenization/
│ ├── to_hdf5.py
│ ├── tokenize_dataset.py
│ ├── tokenize_wiki103_bert.bash
│ ├── tokenize_wiki103_roberta.bash
│ ├── tokenize_wiki_bert.bash
│ └── tokenize_wiki_roberta.bash
├── vlm/
│ ├── __init__.py
│ ├── configs/
│ │ ├── bert-12L-768H.json
│ │ ├── bert-4L-768H.json
│ │ ├── bert-6L-512H.json
│ │ └── bert_base.json
│ ├── data.py
│ ├── model.py
│ ├── param.py
│ ├── run_glue.py
│ ├── run_glue_epochs.py
│ ├── run_lm_distributed.py
│ ├── run_vlm_distributed.py
│ └── show_glue_results_epochs.py
├── vokenization/
│ ├── __init__.py
│ ├── common.py
│ ├── create_image_ids.py
│ ├── evaluate_diversity.py
│ ├── evaluate_retrieval.py
│ ├── extract_vision_keys.py
│ ├── indexing.py
│ ├── revokenization.py
│ ├── revokenize_corpus_mp.py
│ ├── vokenization.py
│ └── vokenize_corpus_mp.py
└── xmatching/
├── __init__.py
├── data.py
├── frozen_batch_norm.py
├── loss.py
├── main.py
├── metric.py
├── model.py
└── param.py
SYMBOL INDEX (190 symbols across 27 files)
FILE: data/wiki/tools/remove_accent.py
function convert_to_unicode (line 13) | def convert_to_unicode(text):
function run_strip_accents (line 29) | def run_strip_accents(text):
FILE: tokenization/to_hdf5.py
function validate_hdf5 (line 8) | def validate_hdf5(fname, tokenizer_name):
function to_hdf5 (line 63) | def to_hdf5(fname, tokenizer_name, validate=True):
FILE: tokenization/tokenize_dataset.py
function tokenize_dataset (line 12) | def tokenize_dataset(data_dir, fname, tokenizer_name, lines_are_sents=Fa...
FILE: vlm/data.py
class CoLDataset (line 11) | class CoLDataset(Dataset):
method __init__ (line 15) | def __init__(self, file_path, tokenizer_name, tokenizer, block_size=512,
method voken_size (line 83) | def voken_size(self):
method voken_ids (line 87) | def voken_ids(self):
method assert_equal_vokens (line 90) | def assert_equal_vokens(self, dataset):
method __len__ (line 95) | def __len__(self):
method __getitem__ (line 98) | def __getitem__(self, item):
method maybe_do_sent_level (line 122) | def maybe_do_sent_level(self, vokens):
method maybe_do_ablation_study (line 138) | def maybe_do_ablation_study(self, vokens, tokens):
method get_item_info (line 157) | def get_item_info(self, item):
method __del__ (line 162) | def __del__(self):
function intersect (line 174) | def intersect(x, y):
function manual_filter (line 184) | def manual_filter(batches):
function block_check (line 192) | def block_check(batches, block_size, fixed_size=False, manual_filtered=F...
function get_voken_feats (line 211) | def get_voken_feats(dataset: CoLDataset, feat_dir: str):
FILE: vlm/model.py
function _gelu_python (line 20) | def _gelu_python(x):
class CoLBertConfig (line 35) | class CoLBertConfig(BertConfig):
method __init__ (line 36) | def __init__(self, *args, **kwargs):
class BertSharedHead (line 47) | class BertSharedHead(BertOnlyMLMHead):
method __init__ (line 50) | def __init__(self, config):
method forward (line 62) | def forward(self, features, **kwargs):
class BertVLMClassificationHead (line 84) | class BertVLMClassificationHead(nn.Module):
method __init__ (line 87) | def __init__(self, config):
method forward (line 100) | def forward(self, features, **kwargs):
class BertVLMContrastiveHeadNew (line 109) | class BertVLMContrastiveHeadNew(nn.Module):
method __init__ (line 112) | def __init__(self, config):
method forward (line 123) | def forward(self, bert_output, voken_feats, **kwargs):
class BertVLMContrastiveHead (line 139) | class BertVLMContrastiveHead(nn.Module):
method __init__ (line 142) | def __init__(self, config):
method forward (line 152) | def forward(self, bert_output, voken_feats, **kwargs):
class BertVLMRegressionHead (line 168) | class BertVLMRegressionHead(nn.Module):
method __init__ (line 171) | def __init__(self, config):
method forward (line 178) | def forward(self, features, **kwargs):
class CoLwithBert (line 189) | class CoLwithBert(BertForMaskedLM):
method __init__ (line 192) | def __init__(self, config):
method init_voken_feat_emb (line 250) | def init_voken_feat_emb(self, feats):
method to (line 269) | def to(self, *args):
method forward (line 274) | def forward(
method calculate_shared_loss (line 358) | def calculate_shared_loss(self, sequence_output, masked_lm_labels, vok...
class SimpleBertForMaskedLM (line 386) | class SimpleBertForMaskedLM(BertForMaskedLM):
method __init__ (line 388) | def __init__(self, config):
method forward (line 391) | def forward(
FILE: vlm/param.py
function process_args (line 4) | def process_args():
FILE: vlm/run_glue.py
function set_seed (line 65) | def set_seed(args):
function train (line 73) | def train(args, train_dataset, model, tokenizer):
function evaluate (line 264) | def evaluate(args, model, tokenizer, prefix=""):
function load_and_cache_examples (line 334) | def load_and_cache_examples(args, task, tokenizer, evaluate=False):
function main (line 397) | def main():
FILE: vlm/run_glue_epochs.py
function get_snap_paths (line 43) | def get_snap_paths(load):
function sorted_paths (line 52) | def sorted_paths(paths):
function get_test_paths (line 69) | def get_test_paths(paths, snaps):
function run_glue (line 92) | def run_glue():
FILE: vlm/run_lm_distributed.py
class TextDataset (line 96) | class TextDataset(Dataset):
method __init__ (line 97) | def __init__(self, tokenizer: PreTrainedTokenizer, args, file_path: st...
method __len__ (line 130) | def __len__(self):
method __getitem__ (line 133) | def __getitem__(self, item):
class LineByLineTextDataset (line 137) | class LineByLineTextDataset(Dataset):
method __init__ (line 138) | def __init__(self, tokenizer: PreTrainedTokenizer, args, file_path: st...
method __len__ (line 150) | def __len__(self):
method __getitem__ (line 153) | def __getitem__(self, i):
function load_and_cache_examples (line 157) | def load_and_cache_examples(args, tokenizer, evaluate=False):
function set_seed (line 169) | def set_seed(args):
function mask_tokens (line 175) | def mask_tokens(inputs: torch.Tensor, tokenizer: PreTrainedTokenizer, ar...
function train (line 209) | def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTra...
function save_model (line 425) | def save_model(args, name, model, tokenizer, optimizer, scheduler):
function evaluate (line 443) | def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenize...
function is_port_in_use (line 491) | def is_port_in_use(port):
function main (line 497) | def main():
function setup (line 513) | def setup(gpu, args):
FILE: vlm/run_vlm_distributed.py
function load_and_cache_examples (line 93) | def load_and_cache_examples(args, tokenizer, evaluate=False):
function set_seed (line 102) | def set_seed(args):
function mask_tokens (line 108) | def mask_tokens(tokens: torch.Tensor, vokens: torch.Tensor, tokenizer: P...
function train (line 153) | def train(args, train_dataset: CoLDataset, valid_dataset: CoLDataset,
function save_model (line 449) | def save_model(args, name, model, tokenizer, optimizer, scheduler):
function evaluate (line 467) | def evaluate(args, eval_dataset: CoLDataset, model: PreTrainedModel, tok...
function is_port_in_use (line 531) | def is_port_in_use(port):
function main (line 537) | def main():
function setup (line 553) | def setup(gpu, args):
FILE: vlm/show_glue_results_epochs.py
function print_result (line 29) | def print_result(glue_dir):
function search (line 73) | def search(path):
FILE: vokenization/create_image_ids.py
function write_ids (line 22) | def write_ids(img_set, img_ids):
FILE: vokenization/evaluate_diversity.py
function load_lang_data (line 25) | def load_lang_data(corpus_name, topk=10000):
function load_cc_data (line 42) | def load_cc_data(img_set):
function load_lxrt_data (line 54) | def load_lxrt_data(img_set):
function analyze (line 66) | def analyze(token2info):
FILE: vokenization/evaluate_retrieval.py
function load_cc_data (line 21) | def load_cc_data(img_set):
function load_lxrt_data (line 34) | def load_lxrt_data(img_set):
FILE: vokenization/extract_vision_keys.py
function get_img_path (line 22) | def get_img_path(img_set, img_id):
function get_img_paths_and_ids (line 38) | def get_img_paths_and_ids(img_set):
function save_img_paths_and_ids (line 54) | def save_img_paths_and_ids(img_set, img_paths, img_ids, output):
function extract_vision_feature_keys (line 82) | def extract_vision_feature_keys(model, img_transform, img_sets, output, ...
function get_visn_arch (line 160) | def get_visn_arch(arch):
class VisnModel (line 172) | class VisnModel(nn.Module):
method __init__ (line 173) | def __init__(self, arch='resnet50', pretrained=True):
method forward (line 188) | def forward(self, img):
function img_transform_func (line 234) | def img_transform_func(img):
FILE: vokenization/indexing.py
class GPUIndexer (line 6) | class GPUIndexer(object):
method __init__ (line 7) | def __init__(self, keys, gpus=(0,), fp16=False):
method topk (line 14) | def topk(self, query, topk: int = 1):
method batch_topk (line 17) | def batch_topk(self, query, topk: int = 1):
method batch_top1 (line 20) | def batch_top1(self, query):
class TorchGPUIndexer (line 24) | class TorchGPUIndexer(GPUIndexer):
method __init__ (line 25) | def __init__(self, keys, gpus=(0,), fp16=False):
method topk (line 33) | def topk(self, query, topk: int = 1):
method batch_topk (line 43) | def batch_topk(self, query, topk: int = 1):
method batch_top1 (line 53) | def batch_top1(self, query):
method batch_top1_l2 (line 63) | def batch_top1_l2(self, query):
class FaissGPUIndexer (line 76) | class FaissGPUIndexer(GPUIndexer):
method __init__ (line 77) | def __init__(self, keys, gpus=(0,), fp16=False):
method batch_topk (line 92) | def batch_topk(self, query, topk: int = 1):
method batch_top1 (line 102) | def batch_top1(self, query):
FILE: vokenization/revokenization.py
class ReVokenizer (line 6) | class ReVokenizer:
method __init__ (line 10) | def __init__(self, forward_tokenizer_name, backward_tokenizer_name, vo...
method vokenize_sent (line 23) | def vokenize_sent(self, sents, topk=None):
method vokenize_ids (line 26) | def vokenize_ids(self, input_ids, topk=None, verbose=False):
method show_alignments (line 50) | def show_alignments(self, sents, forward_inputs, backward_inputs, alig...
method show_input (line 75) | def show_input(self, sents, forward_inputs, backward_inputs, input_ids):
method backward_decode (line 94) | def backward_decode(self, input_id):
method process (line 103) | def process(self, input_ids):
method _safe_guard (line 143) | def _safe_guard(inputs):
method _remove_special_tokens (line 150) | def _remove_special_tokens(inputs):
method _fix_nouns (line 158) | def _fix_nouns(backward_input):
method _fix_length (line 168) | def _fix_length(backward_input, input_ids):
method _calibrate_backward_offset (line 185) | def _calibrate_backward_offset(self, backward_input):
method prepare_for_unicode (line 215) | def prepare_for_unicode(self):
method show (line 242) | def show(self, ids_list):
method batch_map_back (line 248) | def batch_map_back(results, alignments):
method batch_calculate_alignment (line 263) | def batch_calculate_alignment(batch_forward_offsets, batch_backward_of...
function IoU (line 289) | def IoU(a, b):
FILE: vokenization/revokenize_corpus_mp.py
function processer (line 30) | def processer(args, input_queue, output_queue):
function reducer (line 88) | def reducer(output_fname, output_queue, total_tokens):
function setup_mp (line 141) | def setup_mp(args, tokens, sent_ranges, vokens_path):
function segment_sent (line 184) | def segment_sent(
FILE: vokenization/vokenization.py
class Vokenizer (line 22) | class Vokenizer:
method __init__ (line 23) | def __init__(self, model, tokenizer, keys_dir, img_sets=('coco_minival...
method img_num (line 82) | def img_num(self):
method dump_img_ids (line 85) | def dump_img_ids(self, fname):
method __len__ (line 94) | def __len__(self):
method indexing (line 97) | def indexing(self):
method vokenize_sents (line 134) | def vokenize_sents(self, sents, topk=None):
method vokenize_ids (line 145) | def vokenize_ids(self, input_ids, attention_mask=None, topk=None):
function memory_safe_apply (line 300) | def memory_safe_apply(func, *args):
function load_model_and_tokenizer (line 329) | def load_model_and_tokenizer(load, cpu=False):
FILE: vokenization/vokenize_corpus_mp.py
function processer (line 28) | def processer(args, input_queue, output_queue):
function reducer (line 86) | def reducer(output_fname, output_queue, total_tokens):
function setup_mp (line 139) | def setup_mp(args, tokens, sent_ranges, vokens_path):
function segment_sent (line 182) | def segment_sent(
FILE: xmatching/data.py
function make_uid (line 38) | def make_uid(img_id, source, sent_id):
function get_img_path (line 45) | def get_img_path(source, img_id):
function make_datum (line 56) | def make_datum(source: str, img_id: str, sent_id: int, sent: str):
class ImgSentDataset (line 75) | class ImgSentDataset:
method __init__ (line 76) | def __init__(self, img_splits: str, lang_splits: str, tiny=False, fast...
method __len__ (line 129) | def __len__(self):
method __getitem__ (line 132) | def __getitem__(self, item):
method shuffle (line 135) | def shuffle(self):
class ImgSentTorchDataset (line 140) | class ImgSentTorchDataset(Dataset):
method __init__ (line 141) | def __init__(self,
method __len__ (line 152) | def __len__(self):
method __getitem__ (line 155) | def __getitem__(self, item: int):
FILE: xmatching/frozen_batch_norm.py
class FrozenBatchNorm2d (line 11) | class FrozenBatchNorm2d(nn.Module):
method __init__ (line 29) | def __init__(self, num_features, eps=1e-5):
method forward (line 38) | def forward(self, x):
method __repr__ (line 60) | def __repr__(self):
method convert_frozen_batchnorm (line 64) | def convert_frozen_batchnorm(cls, module):
FILE: xmatching/loss.py
function hinge (line 4) | def hinge(x):
function paired_hinge_rank_loss (line 8) | def paired_hinge_rank_loss(
function batchwise_hinge_rank_loss (line 53) | def batchwise_hinge_rank_loss(
FILE: xmatching/main.py
function is_port_in_use (line 22) | def is_port_in_use(port):
function main (line 28) | def main():
function train (line 46) | def train(gpu, args):
function valid (line 285) | def valid(args, model, criterion, valid_loader, use_tqdm=True):
FILE: xmatching/metric.py
function batchwise_accuracy (line 4) | def batchwise_accuracy(lang_output, visn_output, lang_mask):
function batchwise_recall (line 37) | def batchwise_recall(lang_output, visn_output, lang_mask, recalls=(1,)):
FILE: xmatching/model.py
function get_visn_arch (line 24) | def get_visn_arch(arch):
class VisnModel (line 32) | class VisnModel(nn.Module):
method __init__ (line 33) | def __init__(self, dim, arch='resnet50', pretrained=True, finetuning=F...
method forward (line 76) | def forward(self, img):
class LangModel (line 92) | class LangModel(nn.Module):
method __init__ (line 93) | def __init__(self, dim, arch='BERT', layers=(-1,), pretrained=True, fi...
method forward (line 132) | def forward(self, input_ids, attention_mask, token_type_ids=None):
class JointModel (line 175) | class JointModel(nn.Module):
method __init__ (line 176) | def __init__(self, lang_model, visn_model):
method forward (line 181) | def forward(self, lang_input, visn_input):
FILE: xmatching/param.py
function get_optimizer (line 12) | def get_optimizer(optim):
function parse_args (line 34) | def parse_args():
Condensed preview — 70 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (317K chars).
[
{
"path": "LICENSE",
"chars": 1064,
"preview": "MIT License\n\nCopyright (c) 2020 Hao Tan\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof"
},
{
"path": "README.md",
"chars": 21935,
"preview": "# Vokenization\n\nPyTorch code for the EMNLP 2020 paper \"[Vokenization: Improving Language Understanding with Contextualiz"
},
{
"path": "data/lxmert/.gitignore",
"chars": 78,
"preview": "/mscoco_minival.json\n/mscoco_nominival.json\n/mscoco_train.json\n/vgnococo.json\n"
},
{
"path": "data/mscoco/.gitignore",
"chars": 8,
"preview": "/images\n"
},
{
"path": "data/vg/.gitignore",
"chars": 8,
"preview": "/images\n"
},
{
"path": "data/wiki/get_data_cased.bash",
"chars": 2163,
"preview": "# Copyright (c) 2019-present, Facebook, Inc.\n# Copy frrom https://github.com/facebookresearch/XLM\n# All rights reserved."
},
{
"path": "data/wiki/get_data_cased_untokenized.bash",
"chars": 2130,
"preview": "# Copyright (c) 2019-present, Facebook, Inc.\n# Copy frrom https://github.com/facebookresearch/XLM\n# All rights reserved."
},
{
"path": "data/wiki/install-tools.sh",
"chars": 1907,
"preview": "# Copyright (c) 2019-present, Facebook, Inc.\n# All rights reserved.\n#\n# This source code is licensed under the license f"
},
{
"path": "data/wiki/tools/remove_accent.py",
"chars": 1234,
"preview": "# Copyright (c) 2019-present, Facebook, Inc.\n# All rights reserved.\n#\n# This source code is licensed under the license f"
},
{
"path": "data/wiki/tools/segment_th.py",
"chars": 355,
"preview": "# Copyright (c) 2019-present, Facebook, Inc.\n# All rights reserved.\n#\n# This source code is licensed under the license f"
},
{
"path": "data/wiki/tools/tokenize.sh",
"chars": 1228,
"preview": "# Copyright (c) 2019-present, Facebook, Inc.\n# All rights reserved.\n#\n# This source code is licensed under the license f"
},
{
"path": "data/wiki103/get_data_cased.sh",
"chars": 273,
"preview": "OUTPUT=data/wiki103-cased\nwget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip -P $OUTPUT"
},
{
"path": "data/wiki103/get_data_uncased.sh",
"chars": 248,
"preview": "OUTPUT=data/wiki103\n\nwget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip -P $OUTPUT/\nunzip $"
},
{
"path": "requirements.txt",
"chars": 432,
"preview": "torch\n#==1.4.0\ntorchvision\n#==0.5.0\ntransformers==3.3.0\ntensorboardX\n\n# For GLUE evaluation\nsklearn\n\n# Fiass supports fa"
},
{
"path": "scripts/base_vlm_wiki.bash",
"chars": 1248,
"preview": "# The name of experiment\nGPUS=$1\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/bert/$NAME\nmkdir -p $output/src\ncp -"
},
{
"path": "scripts/base_vlm_wiki_glue.bash",
"chars": 1350,
"preview": "# The name of experiment\nGPUS=$1\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/bert/$NAME\nmkdir -p $output/src\ncp -"
},
{
"path": "scripts/base_wiki.bash",
"chars": 1057,
"preview": "GPUS=$1\n# The name of experiment\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/bert/$NAME\nmkdir -p $output/src\ncp -"
},
{
"path": "scripts/base_wiki_glue.bash",
"chars": 1173,
"preview": "GPUS=$1\n# The name of experiment\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/bert/$NAME\nmkdir -p $output/src\ncp -"
},
{
"path": "scripts/extract_keys.bash",
"chars": 179,
"preview": "CUDA_VISIBLE_DEVICES=$1 python vokenization/extract_vision_keys.py \\\n --image-sets vg_nococo,coco_minival,coco_nomini"
},
{
"path": "scripts/mpvokenize_wiki.bash",
"chars": 387,
"preview": "GPU=$1\n\nLOAD=snap/xmatching/$2\nDATA_DIR=data/wiki-cased\nTOKENIZER=bert-base-uncased\n\nfor DATA_NAME in en.valid.raw en.te"
},
{
"path": "scripts/mpvokenize_wiki103.bash",
"chars": 395,
"preview": "GPU=$1\n\nLOAD=snap/xmatching/$2\nWIKI_DIR=data/wiki103-cased\nTOKENIZER=bert-base-uncased\n\nfor DATA_NAME in wiki.valid.raw "
},
{
"path": "scripts/run_glue_at_epoch.bash",
"chars": 766,
"preview": "export GLUE_DIR=data/glue/\nEPOCHS=$2\nMODEL=$3\nCKPT=$4\n\nfor TASK_NAME in WNLI RTE MRPC STS-B CoLA SST-2 QNLI QQP MNLI\ndo\n"
},
{
"path": "scripts/run_glue_epochs.bash",
"chars": 90,
"preview": "GPUS=$1\nMODEL=$2\n \npython vlm/run_glue_epochs.py --gpus $GPUS --load $MODEL \\\n ${@:3}\n\n"
},
{
"path": "scripts/run_xmatching.bash",
"chars": 662,
"preview": "GPUS=$1\n# The name of experiment\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/xmatching/$NAME\nmkdir -p $output/src"
},
{
"path": "scripts/small_vlm_wiki103.bash",
"chars": 1257,
"preview": "# The name of experiment\nGPUS=$1\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/vlm/$NAME\nmkdir -p $output/src\ncp -r"
},
{
"path": "scripts/small_vlm_wiki103_glue.bash",
"chars": 1359,
"preview": "# The name of experiment\nGPUS=$1\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/vlm/$NAME\nmkdir -p $output/src\ncp -r"
},
{
"path": "scripts/small_wiki103.bash",
"chars": 974,
"preview": "# The name of experiment\nGPUS=$1\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/bert/$NAME\nmkdir -p $output/src\ncp -"
},
{
"path": "scripts/small_wiki103_glue.bash",
"chars": 1075,
"preview": "# The name of experiment\nGPUS=$1\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/bert/$NAME\nmkdir -p $output/src\ncp -"
},
{
"path": "scripts/xmatching_benchmark.bash",
"chars": 1079,
"preview": "# Benchmarking the cross-modal matching model with\n# 1. Retrieval scores.\n# 2. Voken Diversity w.r.t words in sp"
},
{
"path": "snap/bert/.gitkeep",
"chars": 0,
"preview": ""
},
{
"path": "snap/vlm/.gitkeep",
"chars": 0,
"preview": ""
},
{
"path": "snap/xmatching/.gitkeep",
"chars": 3,
"preview": "/*\n"
},
{
"path": "tokenization/to_hdf5.py",
"chars": 3162,
"preview": "import h5py\nimport numpy as np\nimport tqdm\n\nfrom transformers import AutoTokenizer\n\n\ndef validate_hdf5(fname, tokenizer_"
},
{
"path": "tokenization/tokenize_dataset.py",
"chars": 3636,
"preview": "# coding=utf-8\n# Copyleft 2020 project COL.\n\nimport argparse\nfrom pathlib import Path\n\nfrom transformers import AutoToke"
},
{
"path": "tokenization/tokenize_wiki103_bert.bash",
"chars": 283,
"preview": "DATA_DIR=data/wiki103-cased\nTOKENIZER=bert-base-uncased\npython tokenization/tokenize_dataset.py $DATA_DIR wiki.valid.raw"
},
{
"path": "tokenization/tokenize_wiki103_roberta.bash",
"chars": 278,
"preview": "DATA_DIR=data/wiki103-cased\nTOKENIZER=roberta-base\npython tokenization/tokenize_dataset.py $DATA_DIR wiki.valid.raw $TOK"
},
{
"path": "tokenization/tokenize_wiki_bert.bash",
"chars": 274,
"preview": "DATA_DIR=data/wiki-cased\nTOKENIZER=bert-base-uncased\npython tokenization/tokenize_dataset.py $DATA_DIR en.valid.raw $TOK"
},
{
"path": "tokenization/tokenize_wiki_roberta.bash",
"chars": 282,
"preview": "DATA_DIR=data/wiki-cased-untokenized/\nTOKENIZER=roberta-base\npython tokenization/tokenize_dataset.py $DATA_DIR en.valid."
},
{
"path": "vlm/__init__.py",
"chars": 12,
"preview": "import data\n"
},
{
"path": "vlm/configs/bert-12L-768H.json",
"chars": 361,
"preview": "{\n \"architectures\": [\n \"BertForMaskedLM\"\n ],\n \"attention_probs_dropout_prob\": 0.1,\n \"hidden_act\": \"gelu\",\n \"hidd"
},
{
"path": "vlm/configs/bert-4L-768H.json",
"chars": 360,
"preview": "{\n \"architectures\": [\n \"BertForMaskedLM\"\n ],\n \"attention_probs_dropout_prob\": 0.1,\n \"hidden_act\": \"gelu\",\n \"hidd"
},
{
"path": "vlm/configs/bert-6L-512H.json",
"chars": 359,
"preview": "{\n \"architectures\": [\n \"BertForMaskedLM\"\n ],\n \"attention_probs_dropout_prob\": 0.1,\n \"hidden_act\": \"gelu\",\n \"hidd"
},
{
"path": "vlm/configs/bert_base.json",
"chars": 361,
"preview": "{\n \"architectures\": [\n \"BertForMaskedLM\"\n ],\n \"attention_probs_dropout_prob\": 0.1,\n \"hidden_act\": \"gelu\",\n \"hidd"
},
{
"path": "vlm/data.py",
"chars": 8248,
"preview": "import copy\nimport os\nimport random\n\nimport h5py\nimport torch\nfrom torch.utils.data import DataLoader, Dataset\nimport tq"
},
{
"path": "vlm/model.py",
"chars": 15704,
"preview": "import math\n\nimport torch\nimport torch.nn.functional as F\nfrom torch.nn import CrossEntropyLoss, MSELoss, SmoothL1Loss\nf"
},
{
"path": "vlm/param.py",
"chars": 7174,
"preview": "import argparse\n\n\ndef process_args():\n parser = argparse.ArgumentParser()\n\n # Datasets\n parser.add_argument(\n "
},
{
"path": "vlm/run_glue.py",
"chars": 30027,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018,"
},
{
"path": "vlm/run_glue_epochs.py",
"chars": 3812,
"preview": "import argparse\nimport math\nimport os\nfrom pathlib import Path\nfrom pprint import pprint\nimport subprocess\nimport thread"
},
{
"path": "vlm/run_lm_distributed.py",
"chars": 25616,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018,"
},
{
"path": "vlm/run_vlm_distributed.py",
"chars": 29458,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018,"
},
{
"path": "vlm/show_glue_results_epochs.py",
"chars": 2476,
"preview": "import os\nfrom pathlib import Path\n\nroot = Path(\n 'snap'\n)\n\ntask2major = {\n 'QQP': 'acc_and_f1',\n 'STS-B': 'cor"
},
{
"path": "vokenization/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "vokenization/common.py",
"chars": 1258,
"preview": "import os\n\n# Name of image sets\nIMAGE_SETS = [\n 'coco_train',\n 'coco_nominival',\n 'coco_minival',\n 'vg_nococ"
},
{
"path": "vokenization/create_image_ids.py",
"chars": 2483,
"preview": "import json\nimport os\nfrom pathlib import Path\nimport sys\n\n# sys.path.append(os.path.dirname(os.path.dirname(os.path.abs"
},
{
"path": "vokenization/evaluate_diversity.py",
"chars": 5008,
"preview": "import argparse\nfrom collections import defaultdict\nimport json\nimport os\nimport sys\n\nimport numpy as np\nimport tqdm\n\nfr"
},
{
"path": "vokenization/evaluate_retrieval.py",
"chars": 4016,
"preview": "import argparse\nfrom collections import defaultdict\nimport json\nimport os\n\nimport tqdm\n\nfrom vokenization import Vokeniz"
},
{
"path": "vokenization/extract_vision_keys.py",
"chars": 10853,
"preview": "# In this file, we extract the vision features as the keys in retrieval.\nimport argparse\nimport os\nimport pickle\nimport "
},
{
"path": "vokenization/indexing.py",
"chars": 4159,
"preview": "import numpy as np\nimport torch\nimport tqdm\n\n\nclass GPUIndexer(object):\n def __init__(self, keys, gpus=(0,), fp16=Fal"
},
{
"path": "vokenization/revokenization.py",
"chars": 13460,
"preview": "# Copyleft 2020 project COL.\n\nfrom transformers import AutoTokenizer\n\n\nclass ReVokenizer:\n \"\"\"\n Convert a\n \"\"\"\n"
},
{
"path": "vokenization/revokenize_corpus_mp.py",
"chars": 13486,
"preview": "# coding=utf-8\n# Copyleft 2020 project COL.\n\nimport argparse\nimport copy\nfrom multiprocessing import Queue, Process\nimpo"
},
{
"path": "vokenization/vokenization.py",
"chars": 15438,
"preview": "# coding=utf-8\n# Copyleft 2020 project COL.\n\nfrom collections import defaultdict\nimport math\nimport pickle\nimport os\nimp"
},
{
"path": "vokenization/vokenize_corpus_mp.py",
"chars": 13348,
"preview": "# coding=utf-8\n# Copyleft 2020 project COL.\n\nimport argparse\nimport copy\nfrom multiprocessing import Queue, Process\nimpo"
},
{
"path": "xmatching/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "xmatching/data.py",
"chars": 6069,
"preview": "# coding=utf-8\nimport json\nfrom pathlib import Path\nimport random\n\nfrom torch.utils.data import Dataset\nfrom torchvision"
},
{
"path": "xmatching/frozen_batch_norm.py",
"chars": 3880,
"preview": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved\n\n# Note: This file is copied from https://github."
},
{
"path": "xmatching/loss.py",
"chars": 4453,
"preview": "import torch\n\n\ndef hinge(x):\n return torch.clamp(x, min=0.)\n\n\ndef paired_hinge_rank_loss(\n lang_output: torch."
},
{
"path": "xmatching/main.py",
"chars": 12349,
"preview": "import collections\nimport os\nimport pickle\nimport sys\n\nimport torch\nimport torch.multiprocessing as mp\nimport torchvisio"
},
{
"path": "xmatching/metric.py",
"chars": 3289,
"preview": "import torch\n\n\ndef batchwise_accuracy(lang_output, visn_output, lang_mask):\n \"\"\"\n Calculate the accuracy of contex"
},
{
"path": "xmatching/model.py",
"chars": 6639,
"preview": "import torch\nfrom torch import nn\nimport torchvision.models as models\nfrom transformers import *\n\nfrom .frozen_batch_nor"
},
{
"path": "xmatching/param.py",
"chars": 4417,
"preview": "# coding=utf-8\n# Copyleft 2020 project COL.\n# Copyleft 2019 project LXRT.\n\nimport argparse\nimport random\n\nimport numpy a"
}
]
About this extraction
This page contains the full source code of the airsplay/vokenization GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 70 files (295.5 KB), approximately 75.4k tokens, and a symbol index with 190 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.