Repository: airsplay/vokenization
Branch: master
Commit: 5601b799184e
Files: 70
Total size: 295.5 KB

Directory structure:
gitextract_6h_zq3l4/

├── LICENSE
├── README.md
├── data/
│   ├── lxmert/
│   │   └── .gitignore
│   ├── mscoco/
│   │   └── .gitignore
│   ├── vg/
│   │   └── .gitignore
│   ├── wiki/
│   │   ├── get_data_cased.bash
│   │   ├── get_data_cased_untokenized.bash
│   │   ├── install-tools.sh
│   │   └── tools/
│   │       ├── remove_accent.py
│   │       ├── segment_th.py
│   │       └── tokenize.sh
│   └── wiki103/
│       ├── get_data_cased.sh
│       └── get_data_uncased.sh
├── requirements.txt
├── scripts/
│   ├── base_vlm_wiki.bash
│   ├── base_vlm_wiki_glue.bash
│   ├── base_wiki.bash
│   ├── base_wiki_glue.bash
│   ├── extract_keys.bash
│   ├── mpvokenize_wiki.bash
│   ├── mpvokenize_wiki103.bash
│   ├── run_glue_at_epoch.bash
│   ├── run_glue_epochs.bash
│   ├── run_xmatching.bash
│   ├── small_vlm_wiki103.bash
│   ├── small_vlm_wiki103_glue.bash
│   ├── small_wiki103.bash
│   ├── small_wiki103_glue.bash
│   └── xmatching_benchmark.bash
├── snap/
│   ├── bert/
│   │   └── .gitkeep
│   ├── vlm/
│   │   └── .gitkeep
│   └── xmatching/
│       └── .gitkeep
├── tokenization/
│   ├── to_hdf5.py
│   ├── tokenize_dataset.py
│   ├── tokenize_wiki103_bert.bash
│   ├── tokenize_wiki103_roberta.bash
│   ├── tokenize_wiki_bert.bash
│   └── tokenize_wiki_roberta.bash
├── vlm/
│   ├── __init__.py
│   ├── configs/
│   │   ├── bert-12L-768H.json
│   │   ├── bert-4L-768H.json
│   │   ├── bert-6L-512H.json
│   │   └── bert_base.json
│   ├── data.py
│   ├── model.py
│   ├── param.py
│   ├── run_glue.py
│   ├── run_glue_epochs.py
│   ├── run_lm_distributed.py
│   ├── run_vlm_distributed.py
│   └── show_glue_results_epochs.py
├── vokenization/
│   ├── __init__.py
│   ├── common.py
│   ├── create_image_ids.py
│   ├── evaluate_diversity.py
│   ├── evaluate_retrieval.py
│   ├── extract_vision_keys.py
│   ├── indexing.py
│   ├── revokenization.py
│   ├── revokenize_corpus_mp.py
│   ├── vokenization.py
│   └── vokenize_corpus_mp.py
└── xmatching/
    ├── __init__.py
    ├── data.py
    ├── frozen_batch_norm.py
    ├── loss.py
    ├── main.py
    ├── metric.py
    ├── model.py
    └── param.py

================================================
FILE CONTENTS
================================================

================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2020 Hao Tan

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# Vokenization

PyTorch code for the EMNLP 2020 paper "[Vokenization: Improving Language Understanding with Contextualized, 
Visual-Grounded Supervision](https://arxiv.org/pdf/2010.06775.pdf)" (Hao Tan and Mohit Bansal).

**Outline**
* [Contextualized Cross-Modal Matching](#contextualized-cross-modal-matching-xmatching)
    * [Downloading Image and Captioning Data](#download-image-and-captioning-data)
    * [Model Training](#training-the-cross-modal-matching-model)
    * [Benchmark (Optional)](#benchmarking-cross-modal-matching-models-optional)
* [Vokenization](#vokenization-vokenization)
    * [Downloading Pure-Language Data](#downloading-and-pre-processing-pure-language-data)
    * [Extracting Visual Feature](#extracting-image-features)
    * [Vokenization Process](#the-vokenization-process)
* [Visually-Supervised Language Model](#visually-supervised-language-model-vlm)
    * [VLM Pre-training](#pre-training-with-vlm)
    * [GLUE Evaluation](#glue-evaluation)
    * [MLM Pre-training (as baselines)](#bert-as-baselines)
    
> Note: I recommend to focus on "Wiki103" first and 
> ingore the code blocks related to "English Wikipedia".
> "Eng Wiki" might take too long to complete.

## Installation
```shell script
pip install -r requirements.txt
```

Require python 3.6 + (to support huggingface [transformers](https://github.com/huggingface/transformers)).

## Contextualized Cross-Modal Matching (xmatching)
In this [module](xmatching) (corresponding to Sec 3.2 of the [paper](https://arxiv.org/pdf/2010.06775.pdf)), 
we want to learn a token-image matching model from sentence-image aligned data (i.e., image captioning data).
The model "contextually" measures the relevance between tokens (i.e., words) and images.
The terminology "contextual" emphasize the nature that 
the sentences (the context) are considered
when measuring the token-image relevance score.


### Download Image and Captioning Data
1. Download MS COCO images:
    ```shell script
    # MS COCO (Train 13G, Valid 6G)
    mkdir -p data/mscoco
    wget http://images.cocodataset.org/zips/train2014.zip -P data/mscoco
    wget http://images.cocodataset.org/zips/val2014.zip -P data/mscoco
    unzip data/mscoco/train2014.zip -d data/mscoco/images/ && rm data/mscoco/train2014.zip
    unzip data/mscoco/val2014.zip -d data/mscoco/images/ && rm data/mscoco/val2014.zip
    ```
   If you already have COCO image on disk. Save them as 
    ```
    data
      |-- mscoco
            |-- images
                 |-- train2014
                         |-- COCO_train2014_000000000009.jpg
                         |-- COCO_train2014_000000000025.jpg
                         |-- ......
                 |-- val2014
                         |-- COCO_val2014_000000000042.jpg
                         |-- ......
    ```

2. Download captions (split following the LXMERT project):
    ```shell script
    mkdir -p data/lxmert
    wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_train.json -P data/lxmert/
    wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_nominival.json -P data/lxmert/
    wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/vgnococo.json -P data/lxmert/
    wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_minival.json -P data/lxmert/
    ```

### Training the Cross-Modal Matching Model
The model is trained on MS COCO with pairwise hinge loss (details in Sec. 3.2 of the [paper](https://arxiv.org/pdf/2010.06775.pdf)).

Running Commands:
```bash
# Run the cross-modal matching model with single-machine multi-processing distributed training
# "0,1" indicates using the GPUs 0 and 1.
# "bert_resnext" is the name of this snapshot and would be saved at snap/xmatching/bert_resnext
# "--visn resnext101_32x8d" is the vision backbone
# "--lang bert" is the langaugae backbone
# Speed: 20 min ~ 30 min / 1 Epoch, 20 Epochs by default.
bash scripts/run_xmatching.bash 0,1 bert_resnext --visn resnext101_32x8d --lang bert
```
The options `--visn` and `--lang` specify the architecture of the encoder.
Tested options 
```
--visn $VISN_MODEL
VISN_MODEL={resnet18, resnet34, resnet50, resnet101, resnet152, 
            wide_resnet50_2, wide_resnet101_2, resnext101_32x8d (default), ...} 
--lang $LANG_MODEL
LANG_MODEL={bert, roberta, xlnet, bert-large, ...}
```
For visual backbones, the models in [torchvision](https://pytorch.org/docs/stable/torchvision/models.html) are mostly supported.
You might need to handle the last FC layer, because it is written differently in different backbones.
The language backbones are initialized from huggingface [transformers](https://github.com/huggingface/transformers).

> We found that the results with XLNet is pretty low but have not identified 
> the reason. Results of other backbones are similar.

## Vokenization (vokenization)
The vokenization is a bridge between the cross-modality (words-and-image) matching models (xmatching) and 
visually-supervised lagnauge models (vlm).
The final goal is to convert the language tokens to related images 
(we called them **vokens**).
These **vokens** enable the visual supervision of the language model.
We mainly provide pr-eprocessing tools (i.e., feature extraction, tokenization, and vokenization) and
evaluation tools of previous cross-modal matching models here.
Here is a diagram of these processes and we next discuss them one-by-one:
```
Extracting Image Features-----> Benchmakring the Matching Models (Optional) --> Vokenization
Downloading Language Data --> Tokenization -->-->--/
```

### Downloading and Pre-Processing Pure-Language Data 
We provide scripts to get the datasets "wiki103" and "wiki".
We would note them as "XX-cased" or "XX-uncased" where the suffix "cased" / "uncased" only indicates
the property of the raw text.
1. **Wiki103**. The [wiki103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) dataset
is a seleted subset of English Wikipedia, containing around 100M tokens.
    ```shell script
    bash data/wiki103/get_data_cased.sh
    ```
2. **English Wikipedia**. 
The script to download and process wiki data are modified from [XLM](https://github.com/facebookresearch/XLM).
It will download a 17G file. 
The speed depends on the networking and it usually takes several hours to filter the data.
The process ends with around 2.8B tokens.
    ```shell script
    bash data/wiki/get_data_cased.bash en
    ```
    Note: For *RoBERTa*, it requires an untokenized version of wiki (o.w. the results would be much lower), 
    so please use the following command:
    ```shell script
    bash data/wiki/get_data_cased_untokenized.bash en
    ```
   
> Note: I recommend to focus on "Wiki103" first and 
> ingore the code blocks related to "English Wikipedia".
> "Eng Wiki" might take too long to complete.
   
### Tokenization of Language Data
We next tokenize the language corpus.
It would locally save three files: 
"$dataset_name.$tokenizer_name", 
"$dataset_name.$tokenizer_name.hdf5",
and "$dataset_name.$tokenizer_name.line".
Taking the wiki103 dataset and BERT tokenizer as an example, 
we convert the training file into
```
data 
 |-- wiki103-cased 
        |-- wiki.train.raw.bert-base-uncased
        |-- wiki.train.raw.bert-base-uncased.hdf5
        |-- wiki.train.raw.bert-base-uncased.line
```
The txt file `wiki.train.raw.bert-base-uncased` saves the tokens and each line in this file is the tokens of a line 
in the original file,
The hdf5 file `wiki.train.raw.bert-base-uncased.hdf5` stores all the tokens continuously and use
`wiki.train.raw.bert-base-uncased.line` to index the starting token index of each line.
The ".line" file has `L+1` lines where `L` is the number of lines in the original files.
Each line has a range "line[i]" to "line[i+1]" in the hdf5 file.

Commands:
1. Wiki103 (around 10 min)
    ```shell script
    bash tokenization/tokenize_wiki103_bert.bash 
    ```
2. English Wikipedia (around 3 hours)
    ```shell script
    bash tokenization/tokenize_wiki_bert.bash 
    ```

### Extracting Image Features
The image pre-processing extracts the image features to build the keys in the vokenization retrieval process.

#### Download the Visual Genome (VG) images
Since MS COCO images are used in training the cross-modal matching model
as in [xmatching](#contextualized-cross-modal-matching-xmatching).
We will use the [Visual Genome](https://visualgenome.org/) images as 
candidate vokens for retrievel.
We here download the images first.
```shell script
wget https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip -P data/vg/
wget https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip -P data/vg/
unzip data/vg/images.zip -d data/vg/images && rm data/vg/images.zip
unzip data/vg/images2.zip -d data/vg/images && rm data/vg/images2.zip
cd data/vg/images
mv VG_100K/* .
mv VG_100K_2/* .
rm -rf VG_100K VG_100K_2
cd ../../../
```
If you already have Visual Genome image on disk. Save them as 
```
data
|-- vg
    |-- images
         |-- 1000.jpg
         |-- 1001.jpg
         |-- ......
```
    
#### Build Universal Image Ids
We first build a list of universal image indexes with 
[vokenization/create_image_ids.py](vokenization/create_image_ids.py). 
It is used to unify the image ids in different experiments 
thus the feature array stored in hdf5 could be universally indexed.
The image ids are saved under a shared path `LOCAL_DIR` (default to `data/vokenization`)
 defined in [vokenization/common.py](vokenization/common.py).
The image ids are saved under `data/vokenization/images` with format `{IMAGE_SET}_ids.txt`.
We will make sure that all the experiments agree with this meta info,
so that we would not get different indexing in different retrieval experiments.

> Note: The ids created by [create_image_ids.py](vokenization/create_image_ids.py) are only the order of the images.
> The actual images in the dictionary are provided by `extract_keys.bash`, thus is corresponding to the 
> `_paths.txt`, because the `extract_keys` will filter all broken images and non-existing images.

Commands:
```bash
# Step 1, Build image orders.
python vokenization/create_image_ids.py  
```

#### Extracting Image Features

Extract image features regarding the list built above, using code 
[vokenization/extract_vision_keys.py](vokenization/extract_vision_keys.py). 
The code will first read the image ids saved in `data/vokenization/images/{IMAGE_SET}_ids.txt` and locate the images.
The features will be saved under `snap/xmatching/bert_resnext/keys/{IMAGE_SET}.hdf5`.
It finishes within 1 hour.

Commands:
```bash
# Step 2, Extract features. 
# bash scripts/extract_keys.bash $GPU_ID $MODEL_NAME 
bash scripts/extract_keys.bash 0 bert_resnext 
```


### Benchmarking Cross-Modal Matching Models (Optional)
> Before evaluating, please make sure that `extracting_image_features` and `tokenization` are completed.

We benchmark the performance of cross-modal matching models from large scale.
The evaluation includes two different metrics: diversity and the retrieval performance.

Diversity 
(in [vokenization/evaluate_diversity.py](vokenization/evaluate_diversity.py))
ensures that the same [token type](https://arxiv.org/pdf/1902.06006.pdf)
is mapped to diverse images regarding its context (i.e., the sentence).
Retrieval 
(in [vokenization/evaluate_retrieval.py](vokenization/evaluate_retrieval.py)) 
measures the correspondence of the token and the retrieved images.

We gather these two utils into one script and the command here:
```bash
bash scripts/xmatching_benchmark.bash 0 bert_resnext
```

### The Vokenization Process
After all these steps, we could start to vokenize the language corpus.
It would load the tokens saved in `dataset_name.tokenizer_name.hdf5` 
and uses the line-split information in `dataset_name.tokenzier_name.line`.

The code is optimized and could be continued by just rerunning it.
The vokens will be saved in `snap/xmatching/bert_resnext/vokens/wiki.train.raw.vg_nococo.hdf5` by default.
The file `snap/xmatching/bert_resnext/vokens/wiki.train.raw.vg_nococo.ids` contains the universal image ids 
for each voken, 
e.g., the image id `vg_nococo/8` corresponds to 8-th feature
saved in `snap/xmatching/bert_resnext/keys/vg_nococo.hdf5`.


> Note: `--tokenizer-name` must be provided in the script.


Commands
1. Wiki103 (around 1 hour on 4 Titan V)
    ```shell script
    # Note: mp is the abbreviation for "multi-processing"
    # bash scripts/mpvokenize_wiki103.bash $USE_GPUS $SNAP_NAME
    bash scripts/mpvokenize_wiki103.bash 0,1,2,3 bert_resnext
    ```
2. English Wikipedia (around 1 day on 4 Titan V)
    ```shell script
    # bash scripts/mpvokenize_wiki.bash $USE_GPUS $SNAP_NAME
    bash scripts/mpvokenize_wiki.bash 0,1,2,3 bert_resnext
    ```

> The script will call
> [vokenization/vokenize_corpus_mp.py](vokenization/vokenize_corpus_mp.py)
> to vokenize a corpus. 
> The vokenziation happens in [vokenization/vokenization.py](vokenization/vokenization.py) and
> it use [vokenization/indexing.py](vokenization/indexing.py) to do nearest neighbor search
> (based on [faiss](https://github.com/facebookresearch/faiss)).


## Visually-Supervised Language Model (vlm)

### Pre-Training with VLM
As discussed in Sec. 2 of the [paper](https://arxiv.org/pdf/2010.06775.pdf),
we use previous generated vokens to pre-train the model 
with visual supervision.

#### Wiki103 
After the [vokenization process](#the-vokenization-process) of wiki103,
we could run the model with command:
```shell script
# bash scripts/small_vlm_wiki103_glue.bash $GPUs $SNAP_NAME
bash scripts/small_vlm_wiki103.bash 0,1,2,3 wiki103_bert_small
```
It will call 
[vlm/run_vlm_distributed.py](vlm/run_vlm_distributed.py)
and run a BERT-6Layers-512Hiddens model on [wiki103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)
dataset with the support of voken supervisions.
The snapshot will be saved to `snap/vlm/wiki103_bert_small`.
We recommend to run this Wiki103 experiment first since it will finish 
in a reasonable time (20 hours).
The pure BERT pre-training option is also available [later](#bert-as-baselines)
for comparisons.

Note: defautly, the mixed-precision training is not used.
To support the mixed precision pre-training, 
please install the [nvidia/apex](https://github.com/NVIDIA/apex) library with command:
```shell script
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
```
After that, you could bring back the option `--fp16` and `--fp16_opt_level O2` in 
the script `scripts/small_vlm_wiki103.bash`.
I recommend to use `--fp16_opt_level O2`.
Although the option O2 might be [unstable](https://github.com/NVIDIA/apex/issues/818#issuecomment-639012282),
it saves a lot memory:
the max per-gpu-batch-size is 32 with O1 but 64 with O2.

#### English Wikipedia
After the [vokenization process](#the-vokenization-process) of wiki103,
we could run the model with command:
```shell script
# bash scripts/base_vlm_wiki.bash $GPUs $SNAP_NAME
bash scripts/base_vlm_wiki.bash 0,1,2,3 wiki_bert_base
```
It will run a BERT-12Layers-768Hiddens (same as BERT_BASE) model on the English Wikipedia
dataset with the support of voken supervisions.
The snapshot will be saved to `snap/vlm/wiki_bert_base`.

It takes around 3-5 days on 4 Titan V / GTX 2080
and around 5-7 days to finish in 4 Titan Pascal/T4 cards.
(This estimation is accurate since I inevitably run experiments on all these servers...).
Titan V / 2080 / T4 have native support of mixed precision training (triggered by `--fp16` option and need
installing [apex](https://github.com/NVIDIA/apex)).
The speed would be much faster.
Titan Pascal would also save some memory with the `--fp16` option.


### GLUE Evaluation
We defautly use the [GLUE](https://gluebenchmark.com/) benchmark
(e.g., [SST](https://nlp.stanford.edu/sentiment/index.html),
[MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398),
[QQP](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs),
[MNLI](https://cims.nyu.edu/~sbowman/multinli/),
[QNLI](https://rajpurkar.github.io/SQuAD-explorer/),)
 as downstreaming tasks.
Other tasks could be evaluated following the setup [here](https://github.com/huggingface/transformers/tree/28d183c90cbf91e94651cf4a655df91a52ea1033/examples)
by changing the option `--model_name_or_path` to the correct snapshot path `snap/bert/wiki103`.

#### Download GLUE dataset
This downloaindg scrip is copied from [huggingface transformers](https://github.com/huggingface/transformers/tree/master/examples/text-classification)
project.
Since the [transformers](https://github.com/huggingface/transformers) is still under dense
development, the change of APIs might affect the code. 
I have upgraded the code compaticability to transformers==3.3.
```shell script
wget https://raw.githubusercontent.com/huggingface/transformers/master/utils/download_glue_data.py
python download_glue_data.py --data_dir data/glue --tasks all
```

#### Finetuning on GLUE Tasks
The pre-trained snapshots are evaluated by fine-tuning them on the [GLUE](https://gluebenchmark.com/) 
benchmark.
The code are modified from the huggingface [transformers](https://github.com/huggingface/transformers).

Running GLUE evaluation for snapshots from different epochs:
```bash
# bash scripts/run_glue_epochs.bash $GPUS #SNAP_PATH --snaps $NUM_OF_SNAPS                            
bash scripts/run_glue_epochs.bash 0,1,2,3 snap/vlm/wiki103_bert_small --snaps 7                            
```
It will assess 7 snaps using all 0,1,2,3 GPUs. 
Setting `snaps=-1` will assess all checkpoints.
If you just want to evaluate the last (usually the best) snapshot, please use:
```
bash scripts/run_glue_epochs.bash 0 snap/vlm/wiki103_bert_small --snaps 1
```

#### Showing the results
For all results saved under `snap/` (whatever the dir names),
running the folloing command will print out all the results.
```bash
python vlm/show_glue_results_epochs.py 
```

It will print results like
```
snap/vlm/test_finetune/glueepoch_checkpoint-epoch0019
     RTE    MRPC   STS-B    CoLA   SST-2    QNLI     QQP    MNLI MNLI-MM    GLUE
   54.51   84.72   87.18   52.32   90.02   88.36   87.16   81.92   82.57   78.75
snap/vlm/bert_6L_512H_wiki103_sharedheadctr_noshuffle/glueepoch_checkpoint-epoch0029
     RTE    MRPC   STS-B    CoLA   SST-2    QNLI     QQP    MNLI MNLI-MM    GLUE
   58.12   82.76   84.45   26.74   89.56   84.40   86.52   77.56   77.99   74.23
```

### BERT (As baselines)
We also provide pure language-model pre-training as baselines.

#### Wiki103
```shell script
# bash scripts/small_wiki103.bash $GPUs $SNAP_NAME
bash scripts/small_wiki103.bash 0,1,2,3 bert_small
```
It will call 
[vlm/run_lm_distributed.py](vlm/run_lm_distributed.py)
and run a BERT-6Layers-512Hiddens model on [wiki103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)
dataset with the masked language model only.
The snapshot will be saved to `snap/bert/wiki103_bert_small`.

Or you could directly using the script `small_wiki103_glue.bash` to 
enable GLUE evaluation after finishing pre-training.
```shell script
bash scripts/small_wiki103_glue.bash 0,1,2,3 bert_small
```

#### English Wikipedia
Command:
```shell script
# bash scripts/base_wiki.bash $GPUs $SNAP_NAME
bash scripts/base_wiki.bash 0,1,2,3 bert_wiki
```

With GLUE evaluation:
```shell script
bash scripts/base_wiki_glue.bash 0,1,2,3 bert_wiki
```

## Pre-processed Data and Pre-trained Models
### Data

Wiki103 (100M tokens)
```
mkdir -p data/wiki103-cased
wget  https://nlp.cs.unc.edu/data/vokenization/wiki103-cased/wiki.test.raw.bert-base-uncased.hdf5 -P data/wiki103-cased
wget https://nlp.cs.unc.edu/data/vokenization/wiki103-cased/wiki.train.raw.bert-base-uncased.hdf5 -P data/wiki103-cased
wget https://nlp.cs.unc.edu/data/vokenization/wiki103-cased/wiki.valid.raw.bert-base-uncased.hdf5 -P data/wiki103-cased
```

Wiki (2800 M tokens)
```
mkdir -p data/wiki-cased
wget  https://nlp.cs.unc.edu/data/vokenization/wiki-cased/en.test.raw.bert-base-uncased.hdf5 -P data/wiki-cased
wget  https://nlp.cs.unc.edu/data/vokenization/wiki-cased/en.train.raw.bert-base-uncased.hdf5 -P data/wiki-cased
wget  https://nlp.cs.unc.edu/data/vokenization/wiki-cased/en.valid.raw.bert-base-uncased.hdf5 -P data/wiki-cased
```

### Models
- Cross-Modal Matching model: [https://nlp.cs.unc.edu/data/vokenization/coco_hinge05_dim64_resxt101_bertl4.zip](https://nlp.cs.unc.edu/data/vokenization/coco_hinge05_dim64_resxt101_bertl4.zip)
- BERT (on Wiki): [https://nlp.cs.unc.edu/data/vokenization/bert_12L_768H_wiki.zip](https://nlp.cs.unc.edu/data/vokenization/bert_12L_768H_wiki.zip)
- BERT + VLM (on Wiki): [https://nlp.cs.unc.edu/data/vokenization/vlm_12L_768H_wiki.zip](https://nlp.cs.unc.edu/data/vokenization/vlm_12L_768H_wiki.zip)
- RoBERTa + VLM (on Wiki): [https://nlp.cs.unc.edu/data/vokenization/vlm_roberta_12L_768H_wiki.zip](https://nlp.cs.unc.edu/data/vokenization/vlm_roberta_12L_768H_wiki.zip)

## Reference
If you find our project useful, please cite this paper:
```
@inproceedings{tan2020vokenization,
  title={Vokenization: Improving Language Understanding with Contextualized, 
Visual-Grounded Supervision},
  author={Tan, Hao and Bansal, Mohit},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing},
  year={2020}
}
```

## Acknowledgement
I thank the support from [Bloomberg Data Science Ph.D. Fellowship](https://www.techatbloomberg.com/bloomberg-data-science-ph-d-fellowship/).
We thank the reviewers and [Yixin Nie](https://easonnie.github.io/) 
and [Jie Lei](https://www.cs.unc.edu/~jielei/)
for their helpful discussions.
Part of the code are built based on huggingface [transformers](https://github.com/huggingface/transformers) and 
facebook [xlm](https://github.com/facebookresearch/XLM) and [faiss](https://github.com/facebookresearch/faiss).

4K3.


================================================
FILE: data/lxmert/.gitignore
================================================
/mscoco_minival.json
/mscoco_nominival.json
/mscoco_train.json
/vgnococo.json


================================================
FILE: data/mscoco/.gitignore
================================================
/images


================================================
FILE: data/vg/.gitignore
================================================
/images


================================================
FILE: data/wiki/get_data_cased.bash
================================================
# Copyright (c) 2019-present, Facebook, Inc.
# Copy frrom https://github.com/facebookresearch/XLM
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#

#
# Usage: ./get-data-wiki.sh $lg (en)
#

set -e

lg=$1  # input language

# data path
WIKI_PATH=data/wiki-cased
MAIN_PATH=$WIKI_PATH

# tools paths
TOOLS_PATH=$MAIN_PATH/tools
TOKENIZE=$TOOLS_PATH/tokenize.sh
REMOVE_ACCENT=$TOOLS_PATH/remove_accent.py

# Wiki data
WIKI_DUMP_NAME=${lg}wiki-latest-pages-articles.xml.bz2
WIKI_DUMP_LINK=https://dumps.wikimedia.org/${lg}wiki/latest/$WIKI_DUMP_NAME

# install tools
data/wiki/install-tools.sh $TOOLS_PATH

# create Wiki paths
mkdir -p $WIKI_PATH/bz2
mkdir -p $WIKI_PATH/txt

# download Wikipedia dump
echo "Downloading $lg Wikipedia dump from $WIKI_DUMP_LINK ..."
wget -c $WIKI_DUMP_LINK -P $WIKI_PATH/bz2/
echo "Downloaded $WIKI_DUMP_NAME in $WIKI_PATH/bz2/$WIKI_DUMP_NAME"

# extract and tokenize Wiki data
echo "*** Cleaning and tokenizing $lg Wikipedia dump ... ***"
#python -m $TOOLS_PATH/wikiextractor/wikiextractor/WikiExtractor $WIKI_PATH/bz2/$WIKI_DUMP_NAME --processes 24 -q -o - \
if [ ! -f $WIKI_PATH/txt/$lg.all.raw ]; then
  python $TOOLS_PATH/wikiextractor/WikiExtractor.py $WIKI_PATH/bz2/$WIKI_DUMP_NAME --processes 24 -q -o - \
  | sed "/^\s*\$/d" \
  | grep -v "^<doc id=" \
  | grep -v "</doc>\$" \
  | $TOKENIZE $lg $TOOLS_PATH \
  | python $REMOVE_ACCENT \
  > $WIKI_PATH/txt/$lg.all.raw
fi
echo "*** Tokenized ( + accent-removal) $lg Wikipedia dump to $WIKI_PATH/txt/train.${lg} ***"

# split into train / valid / test
echo "*** Split into train / valid / test ***"
split_data() {
    NLINES=`wc -l $1  | awk -F " " '{print $1}'`;
    NTRAIN=$((NLINES - 10000));
    NVAL=$((NTRAIN + 5000));
    cat $1 | head -$NTRAIN             > $2;
    cat $1 | head -$NVAL | tail -5000  > $3;
    cat $1 | tail -5000                > $4;
}
split_data $WIKI_PATH/txt/$lg.all.raw $WIKI_PATH/txt/$lg.train.raw $WIKI_PATH/txt/$lg.valid.raw $WIKI_PATH/txt/$lg.test.raw

# File structure
mv $WIKI_PATH/txt/* $WIKI_PATH/
rm -rf $WIKI_PATH/bz2
rm -rf $WIKI_PATH/txt


================================================
FILE: data/wiki/get_data_cased_untokenized.bash
================================================
# Copyright (c) 2019-present, Facebook, Inc.
# Copy frrom https://github.com/facebookresearch/XLM
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#

#
# Usage: ./get-data-wiki.sh $lg (en)
#

set -e

lg=$1  # input language

# data path
WIKI_PATH=data/wiki-cased-untokenized
MAIN_PATH=$WIKI_PATH

# tools paths
TOOLS_PATH=$MAIN_PATH/tools
TOKENIZE=$TOOLS_PATH/tokenize.sh
REMOVE_ACCENT=$TOOLS_PATH/remove_accent.py

# Wiki data
WIKI_DUMP_NAME=${lg}wiki-latest-pages-articles.xml.bz2
WIKI_DUMP_LINK=https://dumps.wikimedia.org/${lg}wiki/latest/$WIKI_DUMP_NAME

# install tools
data/wiki/install-tools.sh $TOOLS_PATH

# create Wiki paths
mkdir -p $WIKI_PATH/bz2
mkdir -p $WIKI_PATH/txt

# download Wikipedia dump
if [ ! -f $WIKI_PATH/bz2/enwiki-latest-pages-articles.xml.bz2 ]; then
    echo "Downloading $lg Wikipedia dump from $WIKI_DUMP_LINK ..."
    wget -c $WIKI_DUMP_LINK -P $WIKI_PATH/bz2/
    echo "Downloaded $WIKI_DUMP_NAME in $WIKI_PATH/bz2/$WIKI_DUMP_NAME"
fi

# extract and tokenize Wiki data
#cd $MAIN_PATH
echo "*** Cleaning and tokenizing $lg Wikipedia dump ... ***"
if [ ! -f $WIKI_PATH/txt/$lg.all.raw ]; then
  python $TOOLS_PATH/wikiextractor/WikiExtractor.py $WIKI_PATH/bz2/$WIKI_DUMP_NAME --processes 24 -q -o - \
  | sed "/^\s*\$/d" \
  | grep -v "^<doc id=" \
  | grep -v "</doc>\$" \
  | python $REMOVE_ACCENT \
  > $WIKI_PATH/txt/$lg.all.raw
fi
echo "*** Not Tokenized ( but + accent-removal) $lg Wikipedia dump to $WIKI_PATH/txt/train.${lg} ***"

# split into train / valid / test
echo "*** Split into train / valid / test ***"
split_data() {
    NLINES=`wc -l $1  | awk -F " " '{print $1}'`;
    NTRAIN=$((NLINES - 10000));
    NVAL=$((NTRAIN + 5000));
    cat $1 | head -$NTRAIN             > $2;
    cat $1 | head -$NVAL | tail -5000  > $3;
    cat $1 | tail -5000                > $4;
}
split_data $WIKI_PATH/txt/$lg.all.raw $WIKI_PATH/txt/$lg.train.raw $WIKI_PATH/txt/$lg.valid.raw $WIKI_PATH/txt/$lg.test.raw

# File structure
mv $WIKI_PATH/txt/* $WIKI_PATH/
rm -rf $WIKI_PATH/bz2
rm -rf $WIKI_PATH/txt


================================================
FILE: data/wiki/install-tools.sh
================================================
# Copyright (c) 2019-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#

set -e

# data path
TOOLS_PATH=$1

# tools
MOSES_DIR=mosesdecoder
FASTBPE_DIR=fastBPE
FASTBPE=fast
WMT16_SCRIPTS=wmt16-scripts

# tools path
mkdir -p $TOOLS_PATH

# Copy the scripts to TOOLS_PATH
cp -r data/wiki/tools/* $TOOLS_PATH


#
# Download and install tools
#

old=$(pwd)
cd $TOOLS_PATH


# Download Moses
if [ ! -d "$MOSES_DIR" ]; then
  echo "Cloning Moses from GitHub repository..."
  git clone https://github.com/moses-smt/mosesdecoder.git
fi

# Download fastBPE
if [ ! -d "$FASTBPE_DIR" ]; then
  echo "Cloning fastBPE from GitHub repository..."
  git clone https://github.com/glample/fastBPE
fi

# Compile fastBPE
if [ ! -f "$FASTBPE_DIR/$FASTBPE" ]; then
  echo "Compiling fastBPE..."
  cd fastBPE
  g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
  cd ..
fi

# Download Sennrich's tools
if [ ! -d "$WMT16_SCRIPTS" ]; then
  echo "Cloning WMT16 preprocessing scripts..."
  git clone https://github.com/rsennrich/wmt16-scripts.git
fi

# Download WikiExtractor
if [ ! -d wikiextractor ]; then
    echo "Cloning WikiExtractor from GitHub repository..."
    git clone https://github.com/attardi/wikiextractor.git
    cd wikiextractor
    git checkout e4abb4cbd019b0257824ee47c23dd163919b731b
    cd ..
fi

cd $old

# # Chinese segmenter
# if ! ls $TOOLS_PATH/stanford-segmenter-* 1> /dev/null 2>&1; then
#   echo "Stanford segmenter not found at $TOOLS_PATH/stanford-segmenter-*"
#   echo "Please install Stanford segmenter in $TOOLS_PATH"
#   exit 1
# fi
# 
# # Thai tokenizer
# if ! python -c 'import pkgutil; exit(not pkgutil.find_loader("pythainlp"))'; then
#   echo "pythainlp package not found in python"
#   echo "Please install pythainlp (pip install pythainlp)"
#   exit 1
# fi
# 


================================================
FILE: data/wiki/tools/remove_accent.py
================================================
# Copyright (c) 2019-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#

import sys
import unicodedata
import six


def convert_to_unicode(text):
    """
    Converts `text` to Unicode (if it's not already), assuming UTF-8 input.
    """
    # six_ensure_text is copied from https://github.com/benjaminp/six
    def six_ensure_text(s, encoding='utf-8', errors='strict'):
        if isinstance(s, six.binary_type):
            return s.decode(encoding, errors)
        elif isinstance(s, six.text_type):
            return s
        else:
            raise TypeError("not expecting type '%s'" % type(s))

    return six_ensure_text(text, encoding="utf-8", errors="ignore")


def run_strip_accents(text):
    """
    Strips accents from a piece of text.
    """
    text = unicodedata.normalize("NFD", text)
    output = []
    for char in text:
        cat = unicodedata.category(char)
        if cat == "Mn":
            continue
        output.append(char)
    return "".join(output)


for line in sys.stdin:
    line = convert_to_unicode(line.rstrip())
    line = run_strip_accents(line)
    print(u'%s' % line)


================================================
FILE: data/wiki/tools/segment_th.py
================================================
# Copyright (c) 2019-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#

import sys
from pythainlp.tokenize import word_tokenize

for line in sys.stdin.readlines():
    line = line.rstrip('\n')
    print(' '.join(word_tokenize(line)))


================================================
FILE: data/wiki/tools/tokenize.sh
================================================
# Copyright (c) 2019-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#

# Tokenize text data in various languages
# Usage: e.g.   cat wiki.ar | tokenize.sh ar

set -e

N_THREADS=8

lg=$1
TOOLS_PATH=$2

# moses
MOSES=$TOOLS_PATH/mosesdecoder
REPLACE_UNICODE_PUNCT=$MOSES/scripts/tokenizer/replace-unicode-punctuation.perl
NORM_PUNC=$MOSES/scripts/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$MOSES/scripts/tokenizer/remove-non-printing-char.perl
TOKENIZER=$MOSES/scripts/tokenizer/tokenizer.perl

# Chinese
if [ "$lg" = "zh" ]; then
  $TOOLS_PATH/stanford-segmenter-*/segment.sh pku /dev/stdin UTF-8 0 | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR
# Thai
elif [ "$lg" = "th" ]; then
  cat - | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | python $TOOLS_PATH/segment_th.py
# Japanese
elif [ "$lg" = "ja" ]; then
  cat - | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | kytea -notags
# other languages
else
  cat - | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | $TOKENIZER -no-escape -threads $N_THREADS -l $lg
fi


================================================
FILE: data/wiki103/get_data_cased.sh
================================================
OUTPUT=data/wiki103-cased
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip -P $OUTPUT/
unzip $OUTPUT/wikitext-103-raw-v1.zip -d $OUTPUT
mv $OUTPUT/wikitext-103-raw/* $OUTPUT
rm -rf $OUTPUT/wikitext-103-raw-v1.zip $OUTPUT/wikitext-103-raw


================================================
FILE: data/wiki103/get_data_uncased.sh
================================================
OUTPUT=data/wiki103

wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip -P $OUTPUT/
unzip $OUTPUT/wikitext-103-v1.zip -d $OUTPUT
mv $OUTPUT/wikitext-103/* $OUTPUT
rm -rf $OUTPUT/wikitext-103-v1.zip $OUTPUT/wikitext-103


================================================
FILE: requirements.txt
================================================
torch
#==1.4.0
torchvision
#==0.5.0
transformers==3.3.0
tensorboardX

# For GLUE evaluation
sklearn

# Fiass supports fast indexing.
# The code has a torch-implemented GPU indexing, so do not worry if you could not install faiss.
faiss-gpu>=1.6.3

# Spacy is used in sentence segmentation where the sentences are the input the cross-modality matching model.
spacy

# A higher h5py version to support h5py.VirtualLayout
h5py>=2.10.0


================================================
FILE: scripts/base_vlm_wiki.bash
================================================
# The name of experiment
GPUS=$1
NAME=$2

# Create dirs and make backup
output=snap/bert/$NAME
mkdir -p $output/src
cp -r vlm $output/src/
cp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash
cp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash 
cp $0 $output/run.bash

export TRAIN_FILE=data/wiki-cased/en.train.raw
export TEST_FILE=data/wiki-cased/en.valid.raw

# Pre-training
CUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \
    --output_dir=$output \
	--overwrite_output_dir \
	--config_name=vlm/configs/bert-12L-768H.json \
	--tokenizer_name=bert-base-uncased \
    --model_type=bert \
	--block_size=126 \
	--per_gpu_train_batch_size=32 \
    --per_gpu_eval_batch_size=32 \
	--gradient_accumulation_steps=2 \
    --max_steps=200000 \
	--learning_rate=2e-4 \
	--weight_decay=0.01 \
	--warmup_steps=5000 \
    --mlm_probability 0.15 \
    --mlm_ratio 1.0 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --col_data \
    --split_sent \
    --do_voken_cls \
    --voken_labels all \
    --voken_dir snap/xmatching/bert_resnext/vokens \
    --voken_suffix vg_nococo \
    --mlm ${@:3} | tee $output/log.log

    #--fp16 \
	#--fp16_opt_level O2 \


================================================
FILE: scripts/base_vlm_wiki_glue.bash
================================================
# The name of experiment
GPUS=$1
NAME=$2

# Create dirs and make backup
output=snap/bert/$NAME
mkdir -p $output/src
cp -r vlm $output/src/
cp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash
cp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash 
cp $0 $output/run.bash

export TRAIN_FILE=data/wiki-cased/en.train.raw
export TEST_FILE=data/wiki-cased/en.valid.raw

# Pre-training
CUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \
    --output_dir=$output \
	--overwrite_output_dir \
	--config_name=vlm/configs/bert-12L-768H.json \
	--tokenizer_name=bert-base-uncased \
    --model_type=bert \
	--block_size=126 \
	--per_gpu_train_batch_size=32 \
    --per_gpu_eval_batch_size=32 \
	--gradient_accumulation_steps=2 \
    --max_steps=200000 \
	--learning_rate=2e-4 \
	--weight_decay=0.01 \
	--warmup_steps=5000 \
    --mlm_probability 0.15 \
    --mlm_ratio 1.0 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --col_data \
    --split_sent \
    --do_voken_cls \
    --voken_labels all \
    --voken_dir snap/xmatching/bert_resnext/vokens \
    --voken_suffix vg_nococo \
    --mlm ${@:3} | tee $output/log.log

    #--fp16 \
	#--fp16_opt_level O2 \

# Wait for clearing the GPU cache
sleep 30
bash scripts/run_glue_epochs.bash $GPUS $output --snaps 4


================================================
FILE: scripts/base_wiki.bash
================================================
GPUS=$1
# The name of experiment
NAME=$2

# Create dirs and make backup
output=snap/bert/$NAME
mkdir -p $output/src
cp -r vlm/*.py $output/src/
cp $0 $output/run.bash
cp run_glue_epochs.bash $output/run_glue_epochs.bash
cp run_glue_at_epoch.bash $output/run_glue_at_epoch.bash 

export TRAIN_FILE=data/wiki-cased/en.train.raw
export TEST_FILE=data/wiki-cased/en.valid.raw

# Pre-training
CUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \
    --output_dir=$output \
	--overwrite_output_dir \
	--config_name=vlm/configs/bert-12L-768H.json \
	--tokenizer_name=bert-base-uncased \
    --model_type=bert \
	--block_size=126 \
	--per_gpu_train_batch_size=64 \
    --per_gpu_eval_batch_size=64 \
	--gradient_accumulation_steps=1 \
    --max_steps 220000 \
	--learning_rate=2e-4 \
	--weight_decay=0.01 \
	--warmup_steps=5000 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --col_data \
    --split_sent \
    --mlm ${@:3} | tee $output/log.log

    #--fp16 \
	#--fp16_opt_level O2 \


================================================
FILE: scripts/base_wiki_glue.bash
================================================
GPUS=$1
# The name of experiment
NAME=$2

# Create dirs and make backup
output=snap/bert/$NAME
mkdir -p $output/src
cp -r vlm/*.py $output/src/
cp $0 $output/run.bash
cp run_glue_epochs.bash $output/run_glue_epochs.bash
cp run_glue_at_epoch.bash $output/run_glue_at_epoch.bash 

export TRAIN_FILE=data/wiki-cased/en.train.raw
export TEST_FILE=data/wiki-cased/en.valid.raw

# Pre-training
CUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \
    --output_dir=$output \
	--overwrite_output_dir \
	--config_name=vlm/configs/bert-12L-768H.json \
	--tokenizer_name=bert-base-uncased \
    --model_type=bert \
	--block_size=126 \
	--per_gpu_train_batch_size=64 \
    --per_gpu_eval_batch_size=64 \
	--gradient_accumulation_steps=1 \
    --max_steps 220000 \
	--learning_rate=2e-4 \
	--weight_decay=0.01 \
	--warmup_steps=5000 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --col_data \
    --split_sent \
    --mlm ${@:3} | tee $output/log.log

    #--fp16 \
	#--fp16_opt_level O2 \

#--shuffle \
# Wait for clearing the GPU cache
sleep 30
bash scripts/run_glue_epochs.bash $GPUS $output --snaps -1


================================================
FILE: scripts/extract_keys.bash
================================================
CUDA_VISIBLE_DEVICES=$1 python vokenization/extract_vision_keys.py \
    --image-sets vg_nococo,coco_minival,coco_nominival,coco_train,cc_valid \
    --load-dir snap/xmatching/$2


================================================
FILE: scripts/mpvokenize_wiki.bash
================================================
GPU=$1

LOAD=snap/xmatching/$2
DATA_DIR=data/wiki-cased
TOKENIZER=bert-base-uncased

for DATA_NAME in en.valid.raw en.test.raw en.train.raw
do 
    CUDA_VISIBLE_DEVICES=$GPU python vokenization/vokenize_corpus_mp.py \
        --load $LOAD \
        --corpus=$DATA_DIR/$DATA_NAME \
        --tokenizer-name $TOKENIZER \
        --image-sets vg_nococo \
        --max-img-num 50000 
done


================================================
FILE: scripts/mpvokenize_wiki103.bash
================================================
GPU=$1

LOAD=snap/xmatching/$2
WIKI_DIR=data/wiki103-cased
TOKENIZER=bert-base-uncased

for DATA_NAME in wiki.valid.raw wiki.test.raw wiki.train.raw
do 
    CUDA_VISIBLE_DEVICES=$GPU python vokenization/vokenize_corpus_mp.py \
        --load $LOAD \
        --corpus=$WIKI_DIR/$DATA_NAME \
        --tokenizer-name $TOKENIZER \
        --image-sets vg_nococo \
        --max-img-num 50000
done


================================================
FILE: scripts/run_glue_at_epoch.bash
================================================
export GLUE_DIR=data/glue/
EPOCHS=$2
MODEL=$3
CKPT=$4

for TASK_NAME in WNLI RTE MRPC STS-B CoLA SST-2 QNLI QQP MNLI
do
    CUDA_VISIBLE_DEVICES=$1 python vlm/run_glue.py \
        --model_type bert \
        --tokenizer_name=bert-base-uncased \
        --model_name_or_path $MODEL/$CKPT \
        --task_name $TASK_NAME \
        --do_train \
        --do_eval \
        --do_lower_case \
        --data_dir $GLUE_DIR/$TASK_NAME \
        --save_steps -1 \
        --max_seq_length 126 \
        --per_gpu_eval_batch_size=32   \
        --per_gpu_train_batch_size=32   \
        --learning_rate 1e-4 \
        --warmup_steps 0.1 \
        --num_train_epochs $EPOCHS.0 \
        --output_dir $MODEL/glueepoch_$CKPT/$TASK_NAME
done

        #--overwrite_output_dir \


================================================
FILE: scripts/run_glue_epochs.bash
================================================
GPUS=$1
MODEL=$2
 
python vlm/run_glue_epochs.py --gpus $GPUS --load $MODEL \
    ${@:3}


================================================
FILE: scripts/run_xmatching.bash
================================================
GPUS=$1
# The name of experiment
NAME=$2

# Create dirs and make backup
output=snap/xmatching/$NAME
mkdir -p $output/src/
cp -r xmatching $output/src/
cp $0 $output/run.bash

# Pre-training
CUDA_VISIBLE_DEVICES=$GPUS unbuffer python xmatching/main.py \
    --train-imgs mscoco_train,mscoco_nominival --valid-imgs mscoco_minival \
    --train-langs mscoco --valid-langs mscoco \
    --max-len 20 --dim 64 \
    --lang-layers 4,3,2,1 \
    --lang-pretrained --visn-pretrained \
    --num-workers 8 --batchSize 256 --optim adam --lr 1e-3 --epochs 20 \
    --nodes 1 --nr 0 \
    --output $output ${@:3} | tee $output/log.log

#--visn resnext101_32x8d --lang bert \


================================================
FILE: scripts/small_vlm_wiki103.bash
================================================
# The name of experiment
GPUS=$1
NAME=$2

# Create dirs and make backup
output=snap/vlm/$NAME
mkdir -p $output/src
cp -r vlm $output/src/
cp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash
cp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash 
cp $0 $output/run.bash

export TRAIN_FILE=data/wiki103-cased/wiki.train.raw
export TEST_FILE=data/wiki103-cased/wiki.valid.raw

# Pre-training
CUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \
    --output_dir=$output \
	--overwrite_output_dir \
	--config_name=vlm/configs/bert-6L-512H.json \
	--tokenizer_name=bert-base-uncased \
    --model_type=bert \
	--block_size=126 \
	--per_gpu_train_batch_size=32 \
    --per_gpu_eval_batch_size=32 \
	--gradient_accumulation_steps=2 \
	--num_train_epochs=40 \
	--learning_rate=2e-4 \
	--weight_decay=0.01 \
	--warmup_steps=10000 \
    --mlm_probability 0.15 \
    --mlm_ratio 1.0 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --col_data \
    --split_sent \
    --do_voken_cls \
    --voken_labels all \
    --voken_dir snap/xmatching/bert_resnext/vokens \
    --voken_suffix vg_nococo \
    --mlm ${@:3} | tee $output/log.log

    #--fp16 \
	#--fp16_opt_level O2 \


================================================
FILE: scripts/small_vlm_wiki103_glue.bash
================================================
# The name of experiment
GPUS=$1
NAME=$2

# Create dirs and make backup
output=snap/vlm/$NAME
mkdir -p $output/src
cp -r vlm $output/src/
cp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash
cp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash 
cp $0 $output/run.bash

export TRAIN_FILE=data/wiki103-cased/wiki.train.raw
export TEST_FILE=data/wiki103-cased/wiki.valid.raw

# Pre-training
CUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \
    --output_dir=$output \
	--overwrite_output_dir \
	--config_name=vlm/configs/bert-6L-512H.json \
	--tokenizer_name=bert-base-uncased \
    --model_type=bert \
	--block_size=126 \
	--per_gpu_train_batch_size=32 \
    --per_gpu_eval_batch_size=32 \
	--gradient_accumulation_steps=2 \
	--num_train_epochs=40 \
	--learning_rate=2e-4 \
	--weight_decay=0.01 \
	--warmup_steps=10000 \
    --mlm_probability 0.15 \
    --mlm_ratio 1.0 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --col_data \
    --split_sent \
    --do_voken_cls \
    --voken_labels all \
    --voken_dir snap/xmatching/bert_resnext/vokens \
    --voken_suffix vg_nococo \
    --mlm ${@:3} | tee $output/log.log

    #--fp16 \
	#--fp16_opt_level O2 \

# Wait for clearing the GPU cache
sleep 30
bash scripts/run_glue_epochs.bash $GPUS $output --snaps 4


================================================
FILE: scripts/small_wiki103.bash
================================================
# The name of experiment
GPUS=$1
NAME=$2

# Create dirs and make backup
output=snap/bert/$NAME
mkdir -p $output/src
cp -r vlm/*.py $output/src/
cp $0 $output/run.bash

export TRAIN_FILE=data/wiki103-cased/wiki.train.raw
export TEST_FILE=data/wiki103-cased/wiki.valid.raw

# Pre-training
CUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \
    --output_dir=$output \
	--overwrite_output_dir \
	--config_name=vlm/configs/bert-6L-512H.json \
	--tokenizer_name=bert-base-uncased \
    --model_type=bert \
	--block_size=126 \
	--per_gpu_train_batch_size=64 \
    --per_gpu_eval_batch_size=64 \
	--gradient_accumulation_steps=1 \
	--num_train_epochs=44 \
	--learning_rate=2e-4 \
	--weight_decay=0.01 \
	--warmup_steps=10000 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --col_data \
    --split_sent \
    --shuffle \
    --mlm ${@:3} | tee $output/log.log

    #--fp16 \
	#--fp16_opt_level O2 \


================================================
FILE: scripts/small_wiki103_glue.bash
================================================
# The name of experiment
GPUS=$1
NAME=$2

# Create dirs and make backup
output=snap/bert/$NAME
mkdir -p $output/src
cp -r vlm/*.py $output/src/
cp $0 $output/run.bash

export TRAIN_FILE=data/wiki103-cased/wiki.train.raw
export TEST_FILE=data/wiki103-cased/wiki.valid.raw

# Pre-training
CUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \
    --output_dir=$output \
	--overwrite_output_dir \
	--config_name=vlm/configs/bert-6L-512H.json \
	--tokenizer_name=bert-base-uncased \
    --model_type=bert \
	--block_size=126 \
	--per_gpu_train_batch_size=64 \
    --per_gpu_eval_batch_size=64 \
	--gradient_accumulation_steps=1 \
	--num_train_epochs=44 \
	--learning_rate=2e-4 \
	--weight_decay=0.01 \
	--warmup_steps=10000 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --col_data \
    --split_sent \
    --shuffle \
    --mlm ${@:3} | tee $output/log.log

    #--fp16 \
	#--fp16_opt_level O2 \


# Wait for clearing the GPU cache
sleep 30
bash scripts/run_glue_epochs.bash $GPUS $output --snaps 4


================================================
FILE: scripts/xmatching_benchmark.bash
================================================
# Benchmarking the cross-modal matching model with
#     1. Retrieval scores.
#     2. Voken Diversity w.r.t words in specific language corpus.
# Please run this after image_key_retrivel and tokenization. 
#    i.e., step 1 and step2 in readme.md

MODEL=$2
MODELPATH=snap/xmatching/$MODEL
rm -rf $MODELPATH/analysis.log

# Retrieval scores
CUDA_VISIBLE_DEVICES=$1 unbuffer python vokenization/evaluate_retrieval.py \
    --load $MODELPATH \
    --image-sets coco_minival,cc_valid \
    | tee -a $MODELPATH/analysis.log

# Diversity
# Test diversity of vision-and-language (captioning) datasets
CUDA_VISIBLE_DEVICES=$1 unbuffer python vokenization/evaluate_diversity.py \
    --load $MODELPATH \
    --image-sets vg_nococo \
    --corpus coco_minival,cc_valid \
    | tee -a $MODELPATH/analysis.log

# Test diversity of pure-language corpus
CUDA_VISIBLE_DEVICES=$1 unbuffer python vokenization/evaluate_diversity.py \
    --load $MODELPATH \
    --image-sets vg_nococo \
    --corpus data/wiki103-cased/wiki.valid.raw \
    --maxsents 95000 \
    | tee -a $MODELPATH/analysis.log


================================================
FILE: snap/bert/.gitkeep
================================================


================================================
FILE: snap/vlm/.gitkeep
================================================


================================================
FILE: snap/xmatching/.gitkeep
================================================
/*


================================================
FILE: tokenization/to_hdf5.py
================================================
import h5py
import numpy as np
import tqdm

from transformers import AutoTokenizer


def validate_hdf5(fname, tokenizer_name):
    print("--------------------------------------------")
    print("Start to valid the hdf5 file", fname + '.' + tokenizer_name + '.hdf5')

    with open(fname) as f:
        lines = []
        for line in f:
            if 'wiki' in fname:
                # Wiki103: remove document title
                if line.startswith(' = '):
                    continue
                # Full Wiki: Remove the too short lines.
                if len(line.strip().split(' ')) < 5:
                    continue

            if len(line.strip()) == 0:
                # Always drop empty line
                continue
            lines.append(line)

    # Use the slow tokenizer to validate the results of the fast tokenizer.
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

    h5_file = h5py.File(fname + '.' + tokenizer_name + '.hdf5', 'r')
    tokens = h5_file['tokens']

    print("Start to check the first 10 lines:")
    ids = []
    for line in lines[:10]:
        ids.extend(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(line)))
    ids = np.array(ids)
    first_tokens = np.array(tokens[:len(ids)])
    if np.array_equal(ids, first_tokens):
        print("PASS")
    else:
        print(' '.join(tokenizer.convert_ids_to_tokens(ids)))
        print()
        print(' '.join(tokenizer.convert_ids_to_tokens(first_tokens)))
        assert False, "FAIL"

    print("Start to check the last 10 lines:")
    ids = []
    for line in lines[-10:]:
        ids.extend(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(line)))
    ids = np.array(ids)
    last_tokens = np.array(tokens[-len(ids):])
    if np.array_equal(ids, last_tokens):
        print("PASS")
    else:
        print(' '.join(tokenizer.convert_ids_to_tokens(ids)))
        print(' '.join(tokenizer.convert_ids_to_tokens(last_tokens)))
        assert False, "FAIL"
    print("--------------------------------------------")


def to_hdf5(fname, tokenizer_name, validate=True):
    print("Process %s" % fname)

    h5_file = h5py.File(fname + '.' + tokenizer_name + '.hdf5', 'w')
    dset = h5_file.create_dataset("tokens",
                                  (0,),
                                  maxshape=(None,),
                                  dtype='int32')

    dump_interval = 1000000
    dump_iter = 0
    with open('%s.%s' % (fname, tokenizer_name)) as f:
        lines = 0
        tokens = []
        for line in tqdm.tqdm(f):
            for token in map(int, line.split(' ')):
                tokens.append(token)
            if len(tokens) >= dump_interval:
                dset.resize((dump_iter + len(tokens),))
                dset[dump_iter: dump_iter + len(tokens)] = tokens
                dump_iter += len(tokens)
                tokens = []
            lines += 1

        dset.resize((dump_iter + len(tokens),))
        dset[dump_iter: dump_iter + len(tokens)] = tokens
        dump_iter += len(tokens)

    assert len(dset) == dump_iter
    h5_file.close()

    if validate:
        validate_hdf5(fname, tokenizer_name)

    print()


================================================
FILE: tokenization/tokenize_dataset.py
================================================
# coding=utf-8
# Copyleft 2020 project COL.

import argparse
from pathlib import Path

from transformers import AutoTokenizer
import time

from to_hdf5 import to_hdf5

def tokenize_dataset(data_dir, fname, tokenizer_name, lines_are_sents=False):
    data_path = Path(data_dir)

    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, use_fast=True)

    f = open(data_path / fname)
    g = open((data_path / ('%s.%s' % (fname, tokenizer_name))), 'w')

    # Statistics
    dcmt_cnt = 0
    token_cnt = 0
    line_cnt = 0
    line_starts = []

    # Logging and dumping hyper-parameters
    cache = ''
    log_interval = log_iter = 1000000
    dump_interval = dump_iter = 100000
    start_time = time.time()

    for i, line in enumerate(f):
        # Identify the start of documents, ignore it.
        if 'wiki103' in data_dir:
            if line.startswith(' = '):
                dcmt_cnt += 1
                continue
        elif 'wiki' in data_dir:
            if len(line.strip().split(' ')) == 1:
                dcmt_cnt += 1
                continue

        if 'wiki' in data_dir:
            # Remove too short lines. Book corpus does not need this.
            if len(line.strip().split(' ')) < 5:
                continue

        # Drop empty line (1)
        if len(line.strip()) == 0:
            continue

        tokenized_line = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(line))
        # tokenized_line = tokenizer.encode(line, add_special_tokens=False)
        if len(tokenized_line) == 0:    # Drop empty line (2)
            continue

        line_cnt += 1
        line_starts.append(token_cnt)
        if i < 5:
            print()
            print('Line:', line)
            print('Tokens:', ' '.join(tokenizer.convert_ids_to_tokens(tokenized_line)))
        token_cnt += len(tokenized_line)
        cache += ' '.join(map(str, tokenized_line)) + '\n'

        if (token_cnt + 1) > dump_iter:
            g.write(cache)
            cache = ''
            dump_iter += dump_interval

        if (token_cnt + 1) > log_iter:
            used_time = time.time() - start_time
            print("Process %d tokens in %d seconds, %0.4f tokens per second." % (
                token_cnt, used_time, token_cnt / used_time))
            log_iter += log_interval

    # Deal with the last remaining tokens.
    line_starts.append(token_cnt)
    g.write(cache)

    # Dump Line starts
    identifier = 'sent' if lines_are_sents else 'line'
    with open(data_path / ('%s.%s.%s' % (fname, tokenizer_name, identifier)), 'w') as f:
        for line_start in line_starts:
            f.write(str(line_start) + "\n")

    f.close()
    g.close()
    print(f"Documents: {dcmt_cnt}, Lines: {line_cnt}, Words: {token_cnt} in dataset {fname}")

    to_hdf5(str(data_path / fname), tokenizer_name)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    # Required parameters
    parser.add_argument(
        "datadir", default=None, type=str, help="The input training data file (a text file)."
    )
    parser.add_argument(
        "fname", default=None, type=str, help="The input training data file (a text file)."
    )
    parser.add_argument(
        "tokenizer_name", default=None, type=str, help="The input training data file (a text file)."
    )
    parser.add_argument(
        "--lines-are-sents", action='store_true',
        help="Add this if the line are already segmented to sentences, instead of paragraphs."
    )

    param = parser.parse_args()

    tokenize_dataset(
        param.datadir,
        param.fname,
        param.tokenizer_name,
        param.lines_are_sents,
    )


================================================
FILE: tokenization/tokenize_wiki103_bert.bash
================================================
DATA_DIR=data/wiki103-cased
TOKENIZER=bert-base-uncased
python tokenization/tokenize_dataset.py $DATA_DIR wiki.valid.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR wiki.test.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR wiki.train.raw $TOKENIZER


================================================
FILE: tokenization/tokenize_wiki103_roberta.bash
================================================
DATA_DIR=data/wiki103-cased
TOKENIZER=roberta-base
python tokenization/tokenize_dataset.py $DATA_DIR wiki.valid.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR wiki.test.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR wiki.train.raw $TOKENIZER


================================================
FILE: tokenization/tokenize_wiki_bert.bash
================================================
DATA_DIR=data/wiki-cased
TOKENIZER=bert-base-uncased
python tokenization/tokenize_dataset.py $DATA_DIR en.valid.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR en.test.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR en.train.raw $TOKENIZER


================================================
FILE: tokenization/tokenize_wiki_roberta.bash
================================================
DATA_DIR=data/wiki-cased-untokenized/
TOKENIZER=roberta-base
python tokenization/tokenize_dataset.py $DATA_DIR en.valid.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR en.test.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR en.train.raw $TOKENIZER


================================================
FILE: vlm/__init__.py
================================================
import data


================================================
FILE: vlm/configs/bert-12L-768H.json
================================================
{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}


================================================
FILE: vlm/configs/bert-4L-768H.json
================================================
{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 4,
  "type_vocab_size": 2,
  "vocab_size": 30522
}


================================================
FILE: vlm/configs/bert-6L-512H.json
================================================
{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 512,
  "initializer_range": 0.02,
  "intermediate_size": 2048,
  "max_position_embeddings": 512,
  "num_attention_heads": 8,
  "num_hidden_layers": 6,
  "type_vocab_size": 2,
  "vocab_size": 30522
}


================================================
FILE: vlm/configs/bert_base.json
================================================
{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}


================================================
FILE: vlm/data.py
================================================
import copy
import os
import random

import h5py
import torch
from torch.utils.data import DataLoader, Dataset
import tqdm


class CoLDataset(Dataset):
    IGNORE_ID = -100
    sent_strategy = 'first'

    def __init__(self, file_path, tokenizer_name, tokenizer, block_size=512,
                 split_sent=False, voken_dir=None, suffix=None, verbose=False,
                 voken_ablation=None):

        # Open token's hdf5
        token_path = file_path + '.' + tokenizer_name + '.hdf5'
        assert os.path.isfile(token_path)
        if verbose:
            print("-------- Load Data -------")
            print("Load tokens from", token_path)
        self.token_hdf5 = h5py.File(token_path, 'r')
        self.tokenizer = tokenizer
        self.tokens = self.token_hdf5['tokens']
        self.verbose = verbose
        self.voken_ablation = voken_ablation
        self._iter_cnt = 0

        # Open voken's hdf5 and load voken ids
        if voken_dir is not None:
            assert suffix is not None, 'Please provide suffix of the voken, e.g., vg_nococo.5000.'
            self.sent_level = 'sent' in voken_dir
            dset_fname = os.path.split(file_path)[-1]
            voken_path = os.path.join(voken_dir, f"{dset_fname}.{suffix}.hdf5")
            voken_ids_path = os.path.join(voken_dir, f"{dset_fname}.{suffix}.ids")
            if verbose:
                print("Load vokens from", voken_path)
            self.voken_hdf5 = h5py.File(voken_path, 'r')
            self.vokens = self.voken_hdf5['vokens']
            assert len(self.vokens) == len(self.tokens)
            self._voken_ids = list(
                map(lambda x: x.strip(),
                    open(voken_ids_path).readlines())
            )
            if verbose:
                print("\t with voken size", self.voken_size)
                print("\t top 5 voken ids are:", self._voken_ids[:5])
        else:
            self.vokens = None

        # Split for every block_size tokens
        # The last block without full length will be dropped.
        num_tokens = len(self.tokens)
        self.starts = list(range(0, num_tokens, block_size))
        self.batches = list(zip(self.starts[:-1], self.starts[1:]))

        manual_filtered =False
        if "en.train.raw" in file_path and tokenizer_name == "bert-base-uncased":
            self.batches = manual_filter(self.batches)
            if verbose:
                print("Data: Mannually filter the range for counties.")
            manual_filtered = True

        # batch_info
        if verbose:
            print("Split sent with block size", block_size)
            print(f"Total batches: {len(self.batches)}")
            print(f"Total tokens: {len(self.tokens)}")
            if voken_dir is not None:
                print(f"Total vokens: {len(self.vokens)}")
            if voken_ablation is not None:
                print("The model will process voken ablation strategy:", voken_ablation)
            print()

        block_check(self.batches, block_size, fixed_size=True, manual_filtered=manual_filtered)
        if self.voken_ablation == 'token':
            self._voken_ids = list(range(30522))

    @property
    def voken_size(self):
        return len(self._voken_ids)

    @property
    def voken_ids(self):
        return copy.copy(self._voken_ids)

    def assert_equal_vokens(self, dataset):
        assert self.voken_size == dataset.voken_size
        for vid, vid1 in zip(self.voken_ids, dataset.voken_ids):
            assert vid == vid1

    def __len__(self):
        return len(self.batches) - 1

    def __getitem__(self, item):
        token_start, token_end = self.batches[item]
        if self._iter_cnt < 5 and self.verbose:
            print(f"Data Loader: data iteration {self._iter_cnt}, with range {token_start} to {token_end}.")
            self._iter_cnt += 1
        tokens = list(self.tokens[token_start: token_end])
        token_tensor = torch.tensor(
            self.tokenizer.build_inputs_with_special_tokens(tokens),
            dtype=torch.long)
        if self.vokens is not None:
            vokens = list(self.vokens[token_start: token_end])

            vokens = self.maybe_do_sent_level(vokens)
            vokens = self.maybe_do_ablation_study(vokens, tokens)

            voken_tensor = torch.tensor(
                [self.IGNORE_ID] + vokens + [self.IGNORE_ID],
                dtype=torch.long
            )

            return token_tensor, voken_tensor
        else:
            return token_tensor

    def maybe_do_sent_level(self, vokens):
        if not self.sent_level:
            return vokens
        else:
            if self.sent_strategy == 'all':
                vokens = [
                    (-voken-1 if voken < 0 else voken)
                    for voken in vokens
                ]
            elif self.sent_strategy == 'first':
                vokens = [
                    (self.IGNORE_ID if voken < 0 else voken)
                    for voken in vokens
                ]
            return vokens

    def maybe_do_ablation_study(self, vokens, tokens):
        if self.voken_ablation is None:
            return vokens
        else:
            if self._iter_cnt < 5 and self.verbose:
                print("Before voken ablation: ", vokens)
            if self.voken_ablation == 'random':
                vokens = [random.randint(0, self.voken_size - 1)
                          for _ in range(len(vokens))]
            elif self.voken_ablation == 'shuffle':
                random.shuffle(vokens)
            elif self.voken_ablation == 'reverse':
                vokens = vokens[::-1]
            elif self.voken_ablation == 'token':
                vokens = tokens
            if self._iter_cnt < 5 and self.verbose:
                print("After voken ablation: ", vokens)
            return vokens

    def get_item_info(self, item):
        token_start = self.batches[item]
        token_end = self.batches[item + 1]
        return token_start, token_end

    def __del__(self):
        self.token_hdf5.close()
        if self.vokens is not None:
            self.voken_hdf5.close()


FORBIDDEN_RANGE = (
    119314944,      # Start of iter 3700
    187053048       # End of iter 5800
)


def intersect(x, y):
    x1, x2 = x
    y1, y2 = y
    if x2 <= y1 or x2 >= y2:
        # Case 1: [   x    )[   y    )
        # Case 2: [   y    )[   x    )
        return False
    return True


def manual_filter(batches):
    batches = list(filter(
        lambda x: not intersect(x, FORBIDDEN_RANGE),
        batches
    ))
    return batches


def block_check(batches, block_size, fixed_size=False, manual_filtered=False):
    """
    Check whether the batches satisfy following requirements.
        1. Monotonic
        2. Mutually exclusive
        3. Range < block_size
    """
    last_end = 0
    for start_token, end_token in batches:
        assert last_end <= start_token
        if fixed_size:
            assert (end_token - start_token) == block_size, 'len([%d, %d)) != %d' % (start_token, end_token, block_size)
        else:
            assert (end_token - start_token) <= block_size, 'len([%d, %d)) > %d' % (start_token, end_token, block_size)
        if manual_filtered:
            assert not intersect((start_token, end_token), FORBIDDEN_RANGE)
        last_end = end_token


def get_voken_feats(dataset: CoLDataset, feat_dir: str):
    """
    Load pre-extracted visual features regarding img_ids of vokens.
    """
    set2id2feat = {}
    voken_feats = []
    for voken_id in dataset.voken_ids:
        voken_img_set, voken_img_id = voken_id.split('/')
        if voken_img_set not in set2id2feat:
            img_ids = list(map(
                lambda x: x.rstrip(),
                open(os.path.join(feat_dir, f"{voken_img_set}.ids"))
            ))
            img_feats = h5py.File(
                os.path.join(feat_dir, f"{voken_img_set}.hdf5"), 'r'
            )['keys'][:]
            id2feat = {}
            assert len(img_ids) == len(img_feats)
            for img_id, img_feat in zip(img_ids, img_feats):
                id2feat[img_id] = img_feat
            set2id2feat[voken_img_set] = id2feat
        voken_feats.append(set2id2feat[voken_img_set][voken_img_id])
    return voken_feats


================================================
FILE: vlm/model.py
================================================
import math

import torch
import torch.nn.functional as F
from torch.nn import CrossEntropyLoss, MSELoss, SmoothL1Loss
from torch import nn
from transformers import (
    BertConfig,
    BertForMaskedLM,
)

from transformers.modeling_bert import BertOnlyMLMHead


BertLayerNorm = torch.nn.LayerNorm


# The GLUE function is copied from huggingface transformers:
# https://github.com/huggingface/transformers/blob/c6acd246ec90857b70f449dcbcb1543f150821fc/src/transformers/activations.py
def _gelu_python(x):
    """ Original Implementation of the gelu activation function in Google Bert repo when initially created.
        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
        Also see https://arxiv.org/abs/1606.08415
    """
    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))


if torch.__version__ < "1.4.0":
    gelu = _gelu_python
else:
    gelu = F.gelu


class CoLBertConfig(BertConfig):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.voken_size = None
        self.voken_dim = None
        self.do_voken_cls = False
        self.do_voken_reg = False
        self.do_voken_ctr = False
        self.shared_head = False
        self.verbose = False


class BertSharedHead(BertOnlyMLMHead):
    """Bert Head for masked language modeling."""

    def __init__(self, config):
        super().__init__(config)
        self.do_voken_cls = config.do_voken_cls
        self.do_voken_ctr = config.do_voken_ctr

        assert int(self.do_voken_cls) + int(self.do_voken_ctr) == 1
        if self.do_voken_cls:
            self.visn_decoder = nn.Linear(config.hidden_size, config.voken_size, bias=True)

        if self.do_voken_ctr:
            self.visn_decoder = nn.Linear(config.voken_dim, config.hidden_size, bias=True)

    def forward(self, features, **kwargs):
        """
        :param features: [batch, length, dim]
        :return: lang_scores [batch, length, vocab_size],
                 visn_scores [batch, length, voken_size]
        """
        x = self.predictions.transform(features)    # batch_size, length, dim

        lang_scores = self.predictions.decoder(x) + self.predictions.bias

        if self.do_voken_cls:
            visn_scores = self.visn_decoder(x)
        elif self.do_voken_ctr:
            voken_feats = kwargs['voken_feats']
            y = self.visn_decoder(voken_feats)  # voken_size, dim
            visn_scores = torch.einsum('bik,jk->bij', x, y)
        else:
            assert False

        return lang_scores, visn_scores


class BertVLMClassificationHead(nn.Module):
    """Bert Head for masked language modeling."""

    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)

        self.decoder = nn.Linear(config.hidden_size, config.voken_size, bias=True)
        # self.decoder = nn.Sequential(
        #     nn.Linear(config.hidden_size, 256, bias=True),
        #     nn.Linear(256, config.voken_size, bias=True),
        # )
        if config.verbose:
            print(f"VLM Classification Head: Build model with voken_size {config.voken_size}")

    def forward(self, features, **kwargs):
        x = self.dense(features)
        x = gelu(x)
        x = self.layer_norm(x)
        x = self.decoder(x)

        return x


class BertVLMContrastiveHeadNew(nn.Module):
    """Bert Head for masked language modeling."""

    def __init__(self, config):
        super().__init__()
        self.joint_dim = 512
        print(f"Contrastive Head: Using joint dim {self.joint_dim}")
        self.voken_size = config.voken_size
        self.dense = nn.Linear(config.hidden_size, self.joint_dim)
        self.layer_norm_x = BertLayerNorm(self.joint_dim, eps=config.layer_norm_eps)

        self.decoder_voken_feat = nn.Linear(config.voken_dim, self.joint_dim, bias=False)
        self.layer_norm_y = BertLayerNorm(self.joint_dim, eps=config.layer_norm_eps)

    def forward(self, bert_output, voken_feats, **kwargs):
        # Process the bert output
        x = self.dense(bert_output)
        x = gelu(x)
        x = self.layer_norm_x(x)

        # Process the pre-trained voken feats.
        y = self.decoder_voken_feat(voken_feats)      # [v, f] --> [v, 64]
        y = self.layer_norm_y(y)

        score = torch.einsum('ijf,vf->ijv', x, y) / math.sqrt(self.joint_dim)
        assert score.dim() == 3 and score.shape[2] == self.voken_size

        return score


class BertVLMContrastiveHead(nn.Module):
    """Bert Head for masked language modeling."""

    def __init__(self, config):
        super().__init__()
        self.voken_size = config.voken_size
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)

        self.joint_dim = 64
        self.decoder_bert_output = nn.Linear(config.hidden_size, self.joint_dim, bias=False)
        self.decoder_voken_feat = nn.Linear(config.voken_dim, self.joint_dim, bias=False)

    def forward(self, bert_output, voken_feats, **kwargs):
        # Process the bert output
        x = self.dense(bert_output)
        x = gelu(x)
        x = self.layer_norm(x)
        x = self.decoder_bert_output(x)                   # [b, l, f] --> [b, l, 64]

        # Process the pre-trained voken feats.
        y = self.decoder_voken_feat(voken_feats)      # [v, f] --> [v, 64]

        score = torch.einsum('ijf,vf->ijv', x, y) / math.sqrt(self.joint_dim)
        assert score.dim() == 3 and score.shape[2] == self.voken_size

        return score


class BertVLMRegressionHead(nn.Module):
    """Bert Head for masked language modeling."""

    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)

        self.decoder = nn.Linear(config.hidden_size, config.voken_dim, bias=True)

    def forward(self, features, **kwargs):
        x = self.dense(features)
        x = gelu(x)
        x = self.layer_norm(x)

        # project back to size of vocabulary with bias
        x = self.decoder(x)

        return x


class CoLwithBert(BertForMaskedLM):
    config_class = CoLBertConfig

    def __init__(self, config):
        super().__init__(config)
        self.do_voken_cls = config.do_voken_cls
        self.do_voken_reg = config.do_voken_reg
        self.do_voken_ctr = config.do_voken_ctr
        self.shared_head = config.shared_head
        self.verbose = config.verbose

        if self.verbose:
            print(f"Model: do voken cls -- {self.do_voken_cls}, do_voken_reg -- {self.do_voken_reg},"
                  f" do voken ctr -- {self.do_voken_ctr}")

        self.token_cls_loss_fct = CrossEntropyLoss()

        if self.shared_head:
            if self.verbose:
                print("Model: Using shared head for Voken and Token predictions.")
            self.cls = BertSharedHead(config)
            # Reinit the weight of the new head.
            self.init_weights()
        else:
            # Voken Classification
            if config.do_voken_cls:
                self.visual_cls_head = BertVLMClassificationHead(config)

            # Voken Regression
            if config.do_voken_reg:
                assert config.voken_dim is not None, "you need to set voken dim in the config."
                self.visual_reg_head = BertVLMRegressionHead(config)

            # Voken Constrastive
            if config.do_voken_ctr:
                assert config.voken_dim is not None, "you need to set voken dim in the config."
                self.visual_ctr_head = BertVLMContrastiveHeadNew(config)

        # Build voken features embeddings if needed.
        if self.do_voken_ctr or self.do_voken_reg:
            # The voken emb will be preloaded by func "init_voken_feat_emb"
            self.voken_feat_emb = nn.Embedding(
                config.voken_size,
                config.voken_dim
            )
            # Freeze this embedding
            for p in self.voken_feat_emb.parameters():
                p.requires_grad = False

        # Build Loss functions
        if config.do_voken_cls:
            # Voken Classification
            self.voken_cls_loss_fct = CrossEntropyLoss()
        if config.do_voken_reg:
            # Voken Regression
            self.voken_reg_loss_fct = SmoothL1Loss(reduction='none')
            # self.voken_reg_loss_fct = torch.nn.L1Loss(reduction='none')
        if config.do_voken_ctr:
            # Voken Constrastive
            self.voken_ctr_loss_fct = CrossEntropyLoss()

    def init_voken_feat_emb(self, feats):
        if self.verbose:
            print(f"Model: load the voken features with shape {feats.shape}")
            print("\tBefore Loading, std and mean are: ", self.voken_feat_emb.weight.std(), self.voken_feat_emb.weight.mean())
        assert feats.shape == (self.config.voken_size, self.config.voken_dim)
        self.voken_feat_emb.weight.data[:] = torch.Tensor(feats)
        self.original_voken_feats = torch.Tensor(feats).clone()
        self.original_voken_feats = self.original_voken_feats.half()
        if self.verbose:
            print("\tAfter Loading, std and mean are: ", self.voken_feat_emb.weight.std(), self.voken_feat_emb.weight.mean())
            print("\tThe 1st, 2nd, and last voken feats are: ")
            print("\t", self.voken_feat_emb.weight[0])
            print("\t", self.voken_feat_emb.weight[1])
            print("\t", self.voken_feat_emb.weight[-1])
        assert not self.voken_feat_emb.weight.requires_grad
        # print(self.voken_feat_emb.weight.dtype)
        # assert torch.all(torch.eq(self.voken_feat_emb.weight.cuda(),
        #                           self.original_voken_feats)), "The voken feats have been updated during training."

    def to(self, *args):
        if self.do_voken_ctr or self.do_voken_reg:
            self.original_voken_feats = self.original_voken_feats.to(*args)
        return super().to(*args)

    def forward(
            self,
            input_ids=None,
            attention_mask=None,
            token_type_ids=None,
            position_ids=None,
            head_mask=None,
            inputs_embeds=None,
            masked_lm_labels=None,
            encoder_hidden_states=None,
            encoder_attention_mask=None,
            lm_labels=None,
            voken_labels=None,
    ):
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
        )
        sequence_output = outputs[0]

        if not self.shared_head:
            voken_loss = 0.
            if self.do_voken_cls:
                assert voken_labels is not None
                voken_scores = self.visual_cls_head(sequence_output)
                voken_cls_loss = self.voken_cls_loss_fct(voken_scores.view(-1, self.config.voken_size), voken_labels.view(-1))
                voken_loss += voken_cls_loss

            if self.do_voken_reg:
                assert voken_labels is not None
                voken_prediction = self.visual_reg_head(sequence_output)

                # Get the mask and pre-trained features
                voken_label_mask = (voken_labels != -100)               # Get a mask of [0, 1, 1, ...., 1, 0], [b, len]
                safe_voken_labels = voken_labels.clone()
                safe_voken_labels[~voken_label_mask] = 0
                voken_feats = self.voken_feat_emb(safe_voken_labels)         # [b, len] --> [b, len, f]

                # Loss
                voken_reg_loss = self.voken_reg_loss_fct(voken_prediction, voken_feats)   # [b, len, f]

                # [b, l, f] * ([b,l] --> [b, l, 1]) = [b, l, f]
                voken_reg_loss = (voken_reg_loss * voken_label_mask.float().unsqueeze(-1))

                # [b, l, f] --sum-> [b, l] --mean-> [1,]
                voken_reg_loss = voken_reg_loss.sum(-1).mean()

                voken_loss += voken_reg_loss

            if self.do_voken_ctr:
                assert torch.all(torch.eq(self.voken_feat_emb.weight,
                                          self.original_voken_feats)), "The voken feats have been updated during training."

                voken_scores = self.visual_ctr_head(
                    sequence_output, self.voken_feat_emb.weight
                )
                voken_ctr_loss = self.voken_ctr_loss_fct(
                    voken_scores.view(-1, self.config.voken_size),
                    voken_labels.view(-1)
                )
                voken_loss += voken_ctr_loss

            if masked_lm_labels is not None:
                prediction_scores = self.cls(sequence_output)
                token_loss = self.token_cls_loss_fct(
                    prediction_scores.view(-1, self.config.vocab_size),
                    masked_lm_labels.view(-1))
            else:
                token_loss = torch.tensor(0.)
        else:
            voken_loss, token_loss = self.calculate_shared_loss(
                sequence_output,
                masked_lm_labels,
                voken_labels,
            )

        return voken_loss, token_loss

    def calculate_shared_loss(self, sequence_output, masked_lm_labels, voken_labels):
        if self.do_voken_cls:
            lang_scores, visn_scores = self.cls(sequence_output)
        else:
            lang_scores, visn_scores = self.cls(
                sequence_output,
                voken_feats=self.voken_feat_emb.weight
            )

        assert voken_labels is not None

        voken_loss_func = self.voken_cls_loss_fct if self.do_voken_cls else self.voken_ctr_loss_fct
        voken_loss = voken_loss_func(
            visn_scores.view(-1, self.config.voken_size),
            voken_labels.view(-1)
        )

        if masked_lm_labels is not None:
            token_loss = self.token_cls_loss_fct(
                lang_scores.view(-1, self.config.vocab_size),
                masked_lm_labels.view(-1)
            )
        else:
            token_loss = torch.tensor(0.)

        return voken_loss, token_loss


class SimpleBertForMaskedLM(BertForMaskedLM):

    def __init__(self, config):
        super().__init__(config)

    def forward(
            self,
            input_ids=None,
            attention_mask=None,
            token_type_ids=None,
            position_ids=None,
            head_mask=None,
            inputs_embeds=None,
            masked_lm_labels=None,
            encoder_hidden_states=None,
            encoder_attention_mask=None,
            lm_labels=None,
    ):
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
        )
        sequence_output = outputs[0]

        prediction_scores = self.cls(sequence_output)
        loss_fct = CrossEntropyLoss()
        token_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))

        return token_loss,


================================================
FILE: vlm/param.py
================================================
import argparse


def process_args():
    parser = argparse.ArgumentParser()

    # Datasets
    parser.add_argument(
        "--train_data_file", default=None, type=str,
        help="The input training data file (a text file).")
    parser.add_argument(
        "--eval_data_file", default=None, type=str,
        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
    parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
    parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")

    # Data loader
    parser.add_argument("--col_data", action="store_true", help="Using the specific dataset object in data.py")
    parser.add_argument("--split_sent", action="store_true", help="Overwrite the cached training and evaluation sets")
    parser.add_argument("--shuffle", action="store_true", help="Shuffle the training dataset")
    parser.add_argument(
        "--block_size", default=-1, type=int,
        help="Optional input sequence length after tokenization."
             "The training dataset will be truncated in block of this size for training."
             "Default to the model max input length for single sentence inputs (take into account special tokens).",
    )

    # Logging and Saving
    parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
    parser.add_argument(
        "--output_dir", type=str,
        help="The output directory where the model predictions and checkpoints will be written.",)
    parser.add_argument(
        "--overwrite_output_dir", action="store_true",
        help="Overwrite the content of the output directory")

    # Model types
    parser.add_argument(
        "--model_type", type=str, help="The model architecture to be trained or fine-tuned.",)
    parser.add_argument(
        "--should_continue", action="store_true", help="Whether to continue from latest checkpoint in output_dir")
    parser.add_argument(
        "--model_name_or_path", default=None, type=str,
        help="The model checkpoint for weights initialization. Leave None if you want to train a model from scratch.",)
    parser.add_argument(
        "--config_name", default=None, type=str,
        help="Optional pretrained config name or path if not the same as model_name_or_path. If both are None, initialize a new config.",)
    parser.add_argument(
        "--tokenizer_name", default=None, type=str,
        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path. If both are None, initialize a new tokenizer.",)
    parser.add_argument(
        "--cache_dir", default=None, type=str,
        help="Optional directory to store the pre-trained models downloaded from s3 (instead of the default one)",)
    parser.add_argument(
        "--overwrite_cache", action="store_true",
        help="Overwrite the cached training and evaluation sets")

    # MLM tasks
    parser.add_argument(
        "--mlm", action="store_true", help="Train with masked-language modeling loss instead of language modeling.")
    parser.add_argument(
        "--mlm_probability", type=float, default=0.15, help="Ratio of tokens to mask for masked language modeling loss")
    parser.add_argument(
        "--mlm_ratio", type=float, default=1., help="The ratio of mlm loss in the total loss.")

    # VLM related params
    parser.add_argument("--voken_dir", type=str, default='snap1/coco_hinge05_dim64_resxt101_robertal4/vokens',
                        help='Where the vokens are saved')
    parser.add_argument("--voken_suffix", type=str, default='vg_nococo.10000',
                        help='The suffix after the voken file, e.g., en.train.raw.{suffix} where suffix==vgcoco.1000')
    parser.add_argument("--voken_labels", type=str, default='all',
                        help='all: Calculate voken loss for all tokens;'
                             'mask: Calculate voken loss for masked tokens.'
                             'nonmask: Calculate voken loss for non-masked tokens.')
    parser.add_argument("--voken_feat_dir", type=str, default=None,
                        help='Where the vokens are saved')
    parser.add_argument("--do_voken_cls", action='store_true', help='Will do voken classification task')
    parser.add_argument("--do_voken_reg", action='store_true', help='Will do voken regression task (not used in this paper)')
    parser.add_argument("--do_voken_ctr", action='store_true', help='Will do voken contrastive task (not used in this paper)')
    parser.add_argument("--shared_head", action='store_true', help='Share the head if more than one tasks (e.g., cls, reg, ctr) are used (not used in this paper)')

    # Batch Size and Training Steps
    parser.add_argument("--seed", type=int, default=95, help="random seed for initialization")
    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int, help="Batch size per GPU/CPU for training.")
    parser.add_argument("--per_gpu_eval_batch_size", default=4, type=int, help="Batch size per GPU/CPU for evaluation.")
    parser.add_argument("--gradient_accumulation_steps", type=int, default=1,
        help="Number of updates steps to accumulate before performing a backward/update pass.",)
    parser.add_argument("--num_train_epochs", default=1.0, type=float, help="Total number of training epochs to perform.")
    parser.add_argument("--max_steps", default=-1, type=int,
        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",)

    # Optimizer
    parser.add_argument("--lamb", action="store_true", help='Use the LAMB optimizer in apex')
    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
    parser.add_argument("--warmup_ratio", default=0., type=float, help="Linear warmup over warmup_steps.")
    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
    parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")

    # Distributed Training
    parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
    parser.add_argument("--nodes", type=int, default=1)
    parser.add_argument("--nr", type=int, default=0)

    # Half Precision
    parser.add_argument(
        "--fp16", action="store_true",
        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",)
    parser.add_argument(
        "--fp16_opt_level", type=str, default="O1",
        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
             "See details at https://nvidia.github.io/apex/amp.html",)

    # Ablation Study
    parser.add_argument("--voken_ablation", default=None,
                        help="random, shuffle, reverse, token")


    args = parser.parse_args()
    return args


================================================
FILE: vlm/run_glue.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Finetuning the library models for sequence classification on GLUE (Bert, XLM, XLNet, RoBERTa, Albert, XLM-RoBERTa)."""


import argparse
import glob
import json
import logging
import os
import random

import numpy as np
import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange

from transformers import (
    MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,
    WEIGHTS_NAME,
    AdamW,
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
    glue_compute_metrics as compute_metrics,
    glue_convert_examples_to_features as convert_examples_to_features,
    glue_output_modes as output_modes,
    glue_processors as processors,
)
# from transformers import glue_compute_metrics as compute_metrics
# from transformers import glue_convert_examples_to_features as convert_examples_to_features
# from transformers import glue_output_modes as output_modes
# from transformers import glue_processors as processors


try:
    from torch.utils.tensorboard import SummaryWriter
except ImportError:
    from tensorboardX import SummaryWriter


logger = logging.getLogger(__name__)

#MODEL_CONFIG_CLASSES = list(MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys())
#MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)

#ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in MODEL_CONFIG_CLASSES), (),)


def set_seed(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)


def train(args, train_dataset, model, tokenizer):
    """ Train the model """
    # if args.local_rank in [-1, 0]:
    #    tb_writer = SummaryWriter()

    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)

    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
    else:
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    # Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]

    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
    num_warmup_steps = int(t_total * args.warmup_steps)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=t_total
    )

    # Check if saved optimizer or scheduler states exist
    #if os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt")) and os.path.isfile(
        #os.path.join(args.model_name_or_path, "scheduler.pt")
    #):
        ## Load in optimizer and scheduler states
        #optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
        #scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))

    if args.fp16:
        try:
            from apex import amp
        except ImportError:
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)

    # multi-gpu training (should be after apex fp16 initialization)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Distributed training (should be after apex fp16 initialization)
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True,
        )

    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    global_step = 0
    epochs_trained = 0
    steps_trained_in_current_epoch = 0
    # Check if continuing training from a checkpoint
    #if os.path.exists(args.model_name_or_path):
        # set global_step to global_step of last saved checkpoint from model path
        #try:
            #global_step = int(args.model_name_or_path.split("-")[-1].split("/")[0])
        #except ValueError:
            #global_step = 0
        #epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
        #steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)

        #logger.info("  Continuing training from checkpoint, will skip to saved global_step")
        #logger.info("  Continuing training from epoch %d", epochs_trained)
        #logger.info("  Continuing training from global step %d", global_step)
        #logger.info("  Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)

    tr_loss, logging_loss = 0.0, 0.0
    model.zero_grad()
    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0],
    )
    set_seed(args)  # Added here for reproductibility
    for _ in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
        for step, batch in enumerate(epoch_iterator):

            # Skip past any already trained steps if resuming training
            if steps_trained_in_current_epoch > 0:
                steps_trained_in_current_epoch -= 1
                continue

            model.train()
            batch = tuple(t.to(args.device) for t in batch)
            inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
            if args.model_type != "distilbert":
                inputs["token_type_ids"] = (
                    batch[2] if args.model_type in ["bert", "xlnet", "albert"] else None
                )  # XLM, DistilBERT, RoBERTa, and XLM-RoBERTa don't use segment_ids
            outputs = model(**inputs)
            loss = outputs[0]  # model outputs are always tuple in transformers (see doc)

            if args.n_gpu > 1:
                loss = loss.mean()  # mean() to average on multi-gpu parallel training
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            tr_loss += loss.item()
            if (step + 1) % args.gradient_accumulation_steps == 0 or (
                # last step in epoch but step is always smaller than gradient_accumulation_steps
                len(epoch_iterator) <= args.gradient_accumulation_steps
                and (step + 1) == len(epoch_iterator)
            ):
                if args.fp16:
                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                else:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)

                optimizer.step()
                scheduler.step()  # Update learning rate schedule
                model.zero_grad()
                global_step += 1

                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
                    logs = {}
                    if (
                        args.local_rank == -1 and args.evaluate_during_training
                    ):  # Only evaluate when single GPU otherwise metrics may not average well
                        results = evaluate(args, model, tokenizer)
                        for key, value in results.items():
                            eval_key = "eval_{}".format(key)
                            logs[eval_key] = value

                    loss_scalar = (tr_loss - logging_loss) / args.logging_steps
                    learning_rate_scalar = scheduler.get_lr()[0]
                    logs["learning_rate"] = learning_rate_scalar
                    logs["loss"] = loss_scalar
                    logging_loss = tr_loss

                    #for key, value in logs.items():
                        #tb_writer.add_scalar(key, value, global_step)
                    print(json.dumps({**logs, **{"step": global_step}}))

                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
                    # Save model checkpoint
                    output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
                    if not os.path.exists(output_dir):
                        os.makedirs(output_dir)
                    model_to_save = (
                        model.module if hasattr(model, "module") else model
                    )  # Take care of distributed/parallel training
                    model_to_save.save_pretrained(output_dir)
                    tokenizer.save_pretrained(output_dir)

                    torch.save(args, os.path.join(output_dir, "training_args.bin"))
                    logger.info("Saving model checkpoint to %s", output_dir)

                    torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
                    torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
                    logger.info("Saving optimizer and scheduler states to %s", output_dir)

            if args.max_steps > 0 and global_step > args.max_steps:
                epoch_iterator.close()
                break
        if args.max_steps > 0 and global_step > args.max_steps:
            train_iterator.close()
            break

    #if args.local_rank in [-1, 0]:
        #tb_writer.close()

    return global_step, tr_loss / global_step


def evaluate(args, model, tokenizer, prefix=""):
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
    eval_outputs_dirs = (args.output_dir, args.output_dir + "-MM") if args.task_name == "mnli" else (args.output_dir,)

    results = {}
    for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
        eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)

        if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
            os.makedirs(eval_output_dir)

        args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
        # Note that DistributedSampler samples randomly
        eval_sampler = SequentialSampler(eval_dataset)
        eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)

        # multi-gpu eval
        if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
            model = torch.nn.DataParallel(model)

        # Eval!
        logger.info("***** Running evaluation {} *****".format(prefix))
        logger.info("  Num examples = %d", len(eval_dataset))
        logger.info("  Batch size = %d", args.eval_batch_size)
        eval_loss = 0.0
        nb_eval_steps = 0
        preds = None
        out_label_ids = None
        for batch in tqdm(eval_dataloader, desc="Evaluating"):
            model.eval()
            batch = tuple(t.to(args.device) for t in batch)

            with torch.no_grad():
                inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
                if args.model_type != "distilbert":
                    inputs["token_type_ids"] = (
                        batch[2] if args.model_type in ["bert", "xlnet", "albert"] else None
                    )  # XLM, DistilBERT, RoBERTa, and XLM-RoBERTa don't use segment_ids
                outputs = model(**inputs)
                tmp_eval_loss, logits = outputs[:2]

                eval_loss += tmp_eval_loss.mean().item()
            nb_eval_steps += 1
            if preds is None:
                preds = logits.detach().cpu().numpy()
                out_label_ids = inputs["labels"].detach().cpu().numpy()
            else:
                preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
                out_label_ids = np.append(out_label_ids, inputs["labels"].detach().cpu().numpy(), axis=0)

        eval_loss = eval_loss / nb_eval_steps
        if args.output_mode == "classification":
            preds = np.argmax(preds, axis=1)
        elif args.output_mode == "regression":
            preds = np.squeeze(preds)
        result = compute_metrics(eval_task, preds, out_label_ids)
        results.update(result)

        print(eval_output_dir, prefix)
        output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
        with open(output_eval_file, "w") as writer:
            logger.info("***** Eval results {} *****".format(prefix))
            for key in sorted(result.keys()):
                logger.info("  %s = %s", key, str(result[key]))
                writer.write("%s = %s\n" % (key, str(result[key])))

    return results


def load_and_cache_examples(args, task, tokenizer, evaluate=False):
    if args.local_rank not in [-1, 0] and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

    processor = processors[task]()
    output_mode = output_modes[task]
    # Load data features from cache or dataset file
    cached_features_file = os.path.join(
        args.data_dir,
        "cached_{}_{}_{}_{}".format(
            "dev" if evaluate else "train",
            #list(filter(None, args.model_name_or_path.split("/"))).pop(),
            args.tokenizer_name,
            str(args.max_seq_length),
            str(task),
        ),
    )
    if os.path.exists(cached_features_file) and not args.overwrite_cache:
        logger.info("Loading features from cached file %s", cached_features_file)
        features = torch.load(cached_features_file)
    else:
        logger.info("Creating features from dataset file at %s", args.data_dir)
        label_list = processor.get_labels()
        if task in ["mnli", "mnli-mm"] and args.model_type in ["roberta", "xlmroberta"]:
            # HACK(label indices are swapped in RoBERTa pretrained model)
            label_list[1], label_list[2] = label_list[2], label_list[1]
        examples = (
            processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
        )
        features = convert_examples_to_features(
            examples,
            tokenizer,
            label_list=label_list,
            max_length=args.max_seq_length,
            output_mode=output_mode,
            # pad_on_left=bool(args.model_type in ["xlnet"]),  # pad on the left for xlnet
            # pad_token=tokenizer.pad_token_id,
            # pad_token_segment_id=tokenizer.pad_token_type_id,
        )
        if args.local_rank in [-1, 0]:
            logger.info("Saving features into cached file %s", cached_features_file)
            torch.save(features, cached_features_file)
    for i in range(3):
        print('ids:', features[i].input_ids)
        print('tokens:', tokenizer.convert_ids_to_tokens(features[i].input_ids))
        print('att:', features[i].attention_mask)

    if args.local_rank == 0 and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

    # Convert to Tensors and build dataset
    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
    all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
    if output_mode == "classification":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
    elif output_mode == "regression":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.float)

    dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
    return dataset


def main():
    parser = argparse.ArgumentParser()

    # Required parameters
    parser.add_argument(
        "--data_dir",
        default=None,
        type=str,
        required=True,
        help="The input data dir. Should contain the .tsv files (or other data files) for the task.",
    )
    parser.add_argument(
        "--model_type",
        default=None,
        type=str,
        required=True,
        #help="Model type selected in the list: " + ", ".join(MODEL_TYPES),
    )
    parser.add_argument(
        "--model_name_or_path",
        default=None,
        type=str,
        required=True,
        #help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
    )
    parser.add_argument(
        "--task_name",
        default=None,
        type=str,
        required=True,
        help="The name of the task to train selected in the list: " + ", ".join(processors.keys()),
    )
    parser.add_argument(
        "--output_dir",
        default=None,
        type=str,
        required=True,
        help="The output directory where the model predictions and checkpoints will be written.",
    )

    # Other parameters
    parser.add_argument(
        "--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name",
    )
    parser.add_argument(
        "--tokenizer_name",
        default="",
        type=str,
        help="Pretrained tokenizer name or path if not the same as model_name",
    )
    parser.add_argument(
        "--cache_dir",
        default="",
        type=str,
        help="Where do you want to store the pre-trained models downloaded from s3",
    )
    parser.add_argument(
        "--max_seq_length",
        default=128,
        type=int,
        help="The maximum total input sequence length after tokenization. Sequences longer "
        "than this will be truncated, sequences shorter will be padded.",
    )
    parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
    parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")
    parser.add_argument(
        "--evaluate_during_training", action="store_true", help="Run evaluation during training at each logging step.",
    )
    parser.add_argument(
        "--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model.",
    )

    parser.add_argument(
        "--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.",
    )
    parser.add_argument(
        "--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for evaluation.",
    )
    parser.add_argument(
        "--gradient_accumulation_steps",
        type=int,
        default=1,
        help="Number of updates steps to accumulate before performing a backward/update pass.",
    )
    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
    parser.add_argument(
        "--num_train_epochs", default=3.0, type=float, help="Total number of training epochs to perform.",
    )
    parser.add_argument(
        "--max_steps",
        default=-1,
        type=int,
        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
    )
    parser.add_argument("--warmup_steps", default=0, type=float, help="Linear warmup over warmup_steps.")

    parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
    parser.add_argument(
        "--eval_all_checkpoints",
        action="store_true",
        help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number",
    )
    parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available")
    parser.add_argument("--from_scratch", action="store_true", help="Avoid using CUDA when available")
    parser.add_argument(
        "--overwrite_output_dir", action="store_true", help="Overwrite the content of the output directory",
    )
    parser.add_argument(
        "--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets",
    )
    parser.add_argument(
        "--nopooler", action="store_true", help="Do not load the pooler",
    )
    parser.add_argument("--seed", type=int, default=9595, help="random seed for initialization")

    parser.add_argument(
        "--fp16",
        action="store_true",
        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",
    )
    parser.add_argument(
        "--fp16_opt_level",
        type=str,
        default="O1",
        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
        "See details at https://nvidia.github.io/apex/amp.html",
    )
    parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
    parser.add_argument("--server_ip", type=str, default="", help="For distant debugging.")
    parser.add_argument("--server_port", type=str, default="", help="For distant debugging.")
    args = parser.parse_args()

    if (
        os.path.exists(args.output_dir)
        and os.listdir(args.output_dir)
        and args.do_train
        and not args.overwrite_output_dir
    ):
        raise ValueError(
            "Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
                args.output_dir
            )
        )

    # Setup distant debugging if needed
    if args.server_ip and args.server_port:
        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
        import ptvsd

        print("Waiting for debugger attach")
        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
        ptvsd.wait_for_attach()

    # Setup CUDA, GPU & distributed training
    if args.local_rank == -1 or args.no_cuda:
        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
        args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count()
    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
        torch.cuda.set_device(args.local_rank)
        device = torch.device("cuda", args.local_rank)
        torch.distributed.init_process_group(backend="nccl")
        args.n_gpu = 1
    args.device = device

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
    )
    logger.warning(
        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
        args.local_rank,
        device,
        args.n_gpu,
        bool(args.local_rank != -1),
        args.fp16,
    )

    # Set seed
    set_seed(args)

    # Prepare GLUE task
    args.task_name = args.task_name.lower()
    if args.task_name not in processors:
        raise ValueError("Task not found: %s" % (args.task_name))
    processor = processors[args.task_name]()
    args.output_mode = output_modes[args.task_name]
    label_list = processor.get_labels()
    num_labels = len(label_list)

    # Load pretrained model and tokenizer
    if args.local_rank not in [-1, 0]:
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

    args.model_type = args.model_type.lower()
    config = AutoConfig.from_pretrained(
        args.config_name if args.config_name else args.model_name_or_path,
        num_labels=num_labels,
        finetuning_task=args.task_name,
        cache_dir=args.cache_dir if args.cache_dir else None,
    )
    tokenizer = AutoTokenizer.from_pretrained(
        args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
        do_lower_case=args.do_lower_case,
        cache_dir=args.cache_dir if args.cache_dir else None,
    )
    model = AutoModelForSequenceClassification.from_pretrained(
        args.model_name_or_path,
        from_tf=bool(".ckpt" in args.model_name_or_path),
        config=config,
        cache_dir=args.cache_dir if args.cache_dir else None,
    )

    if args.nopooler:
        model.bert.pooler.apply(model._init_weights)

    if args.local_rank == 0:
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

    model.to(args.device)

    logger.info("Training/evaluation parameters %s", args)

    # Training
    if args.do_train:
        train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

    # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
        # Create output directory if needed
        if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
            os.makedirs(args.output_dir)

        logger.info("Saving model checkpoint to %s", args.output_dir)
        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
        # They can then be reloaded using `from_pretrained()`
        model_to_save = (
            model.module if hasattr(model, "module") else model
        )  # Take care of distributed/parallel training
        model_to_save.save_pretrained(args.output_dir)
        tokenizer.save_pretrained(args.output_dir)

        # Good practice: save your training arguments together with the trained model
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

        # Load a trained model and vocabulary that you have fine-tuned
        model = AutoModelForSequenceClassification.from_pretrained(args.output_dir)
        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

    # Evaluation
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
        tokenizer = AutoTokenizer.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
            )
            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""
            prefix = prefix if 'checkpoint' in prefix else ''

            model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
            model.to(args.device)
            result = evaluate(args, model, tokenizer, prefix=prefix)
            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
            results.update(result)

    return results


if __name__ == "__main__":
    main()


================================================
FILE: vlm/run_glue_epochs.py
================================================
import argparse
import math
import os
from pathlib import Path
from pprint import pprint
import subprocess
import threading
import time

import torch

parser = argparse.ArgumentParser()
parser.add_argument(
    "--load", default=None, type=str,
    help="The model loaded, e.g., snap/vlm/wiki103_small"
)
parser.add_argument(
    "--gpus", default=None, type=str,
    help="The list of GPU ids, separated by comma, e.g., '2,3'"
)
parser.add_argument(
    "--snaps", default=1, type=int,
    help="The number of snaps evaluated with GLUE benchmark. "
         "-1 means all."
)
parser.add_argument(
    "--start-from", default=0, type=int
)
args = parser.parse_args()

if args.gpus is None:
    # Get all gpus available in this server.
    num_gpus = torch.cuda.device_count()
    # The device id are labeled from 0 to num_gpus-1.
    available_gpus = list(range(num_gpus))
else:
    available_gpus = [int(gpu_id) for gpu_id in args.gpus.split(",")]
    num_gpus = len(available_gpus)

resource = threading.Semaphore(num_gpus)


def get_snap_paths(load):
    load_path = Path(load)
    paths = []
    for dir_path in load_path.iterdir():
        if dir_path.name.startswith("checkpoint-"):
            paths.append(dir_path)
    return paths


def sorted_paths(paths):
    pathXkey = []
    for path in paths:
        name = path.name
        identifier = name[len("checkpoint-"):]
        if identifier == 'last':
            continue
        if 'epoch' in identifier:
            key = identifier
        else:
            key = int(identifier)
        pathXkey.append((path, key))
    pathXkey = sorted(pathXkey, key=lambda x: x[1])
    paths = list(map(lambda x: x[0], pathXkey))
    return paths


def get_test_paths(paths, snaps):
    """
    Return $snaps paths to be tested on GLUE
    """
    if snaps == -1:
        return paths
    interval = len(paths) * 1. / snaps
    test_paths = []
    for i in range(1, snaps+1):
        idx = int(math.ceil(interval * i)) - 1
        test_paths.append(paths[idx])
    return test_paths


# Get all paths needs to be processed
paths = get_snap_paths(args.load)
paths = sorted_paths(paths)
paths = paths[args.start_from:]
paths = get_test_paths(paths, args.snaps)
paths = paths[::-1]         # Run the last epochs first.
path_lock = threading.Lock()


def run_glue():
    while True:
        # Only have one atomic operation (list.pop) here, do not need lock.
        # A Semaphore is enough to control the resources.
        resource.acquire()
        gpu_id = available_gpus.pop(0)

        # Involve multiple atomic operations (list.__len__, list.pop),
        # thus introduce a lock here.
        path_lock.acquire()
        if len(paths) > 0:
            path = paths.pop(0)
        else:
            path_lock.release()
            break
        path_lock.release()

        model = path.parent
        ckpt = path.name
        print(gpu_id, model, ckpt)
        process = subprocess.Popen(
            ['bash',
             'scripts/run_glue_at_epoch.bash',
             str(gpu_id),    # Use GPU
             '3',            # Number of epochs
             model,
             ckpt
             ],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE)
        stdout, stderr = process.communicate()

        available_gpus.append(gpu_id)
        resource.release()

        # Sleep here allows the script (run_glue_at_epoch.bash) to finish
        # thus all memory in GPU will be cleared.
        time.sleep(5)
    return


# Allocate #threads which equals to #GPUs
threads = []
for _ in range(num_gpus):
    threads.append(
        threading.Thread(target=run_glue)
    )
for thread in threads:
    thread.start()

# Join to the main thread, thus the main thread will wait for all the threads.
for thread in threads:
    thread.join()


================================================
FILE: vlm/run_lm_distributed.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
using a masked language modeling (MLM) loss.
"""


import argparse
import glob
import json
import logging
import os
import pickle
import random
import re
import shutil
import sys
from typing import Dict, List, Tuple
from datetime import datetime

import numpy as np
import torch
from torch.nn.utils.rnn import pad_sequence
import torch.multiprocessing as mp
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
from transformers import (
    WEIGHTS_NAME,
    AdamW,
    BertConfig,
    BertForMaskedLM,
    BertTokenizer,
    CamembertConfig,
    CamembertForMaskedLM,
    CamembertTokenizer,
    DistilBertConfig,
    DistilBertForMaskedLM,
    DistilBertTokenizer,
    GPT2Config,
    GPT2LMHeadModel,
    GPT2Tokenizer,
    OpenAIGPTConfig,
    OpenAIGPTLMHeadModel,
    OpenAIGPTTokenizer,
    PreTrainedModel,
    PreTrainedTokenizer,
    RobertaConfig,
    RobertaForMaskedLM,
    RobertaTokenizer,
    get_linear_schedule_with_warmup,
)

sys.path.append(
    os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
)
from vlm.data import CoLDataset
from vlm.param import process_args
from vlm.model import SimpleBertForMaskedLM


try:
    from torch.utils.tensorboard import SummaryWriter
except ImportError:
    from tensorboardX import SummaryWriter


logger = logging.getLogger(__name__)


MODEL_CLASSES = {
    "gpt2": (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
    "openai-gpt": (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
    "bert": (BertConfig, SimpleBertForMaskedLM, BertTokenizer),
    "roberta": (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer),
    "distilbert": (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer),
    "camembert": (CamembertConfig, CamembertForMaskedLM, CamembertTokenizer),
}


class TextDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, args, file_path: str, block_size=512):
        assert os.path.isfile(file_path)

        block_size = block_size - (tokenizer.max_len - tokenizer.max_len_single_sentence)

        directory, filename = os.path.split(file_path)
        cached_features_file = os.path.join(
            directory, args.model_type + "_cached_lm_" + str(block_size) + "_" + filename
        )

        if os.path.exists(cached_features_file) and not args.overwrite_cache:
            logger.info("Loading features from cached file %s", cached_features_file)
            with open(cached_features_file, "rb") as handle:
                self.examples = pickle.load(handle)
        else:
            logger.info("Creating features from dataset file at %s", directory)

            self.examples = []
            with open(file_path, encoding="utf-8") as f:
                text = f.read()

            tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))

            for i in range(0, len(tokenized_text) - block_size + 1, block_size):  # Truncate in block of block_size
                self.examples.append(tokenizer.build_inputs_with_special_tokens(tokenized_text[i : i + block_size]))
            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
            # If your dataset is small, first you should loook for a bigger one :-) and second you
            # can change this behavior by adding (model specific) padding.

            logger.info("Saving features into cached file %s", cached_features_file)
            with open(cached_features_file, "wb") as handle:
                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, item):
        return torch.tensor(self.examples[item], dtype=torch.long)


class LineByLineTextDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, args, file_path: str, block_size=512):
        assert os.path.isfile(file_path)
        # Here, we do not cache the features, operating under the assumption
        # that we will soon use fast multithreaded tokenizers from the
        # `tokenizers` repo everywhere =)
        logger.info("Creating features from dataset file at %s", file_path)

        with open(file_path, encoding="utf-8") as f:
            lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]

        self.examples = tokenizer.batch_encode_plus(lines, add_special_tokens=True, max_length=block_size)["input_ids"]

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i):
        return torch.tensor(self.examples[i], dtype=torch.long)


def load_and_cache_examples(args, tokenizer, evaluate=False):
    file_path = args.eval_data_file if evaluate else args.train_data_file
    if args.col_data:
        return CoLDataset(file_path, args.tokenizer_name, tokenizer, args.block_size,
                          split_sent=args.split_sent,
                          verbose=(args.gpu == 0))
    elif args.line_by_line:
        return LineByLineTextDataset(tokenizer, args, file_path=file_path, block_size=args.block_size)
    else:
        return TextDataset(tokenizer, args, file_path=file_path, block_size=args.block_size)


def set_seed(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)


def mask_tokens(inputs: torch.Tensor, tokenizer: PreTrainedTokenizer, args) -> Tuple[torch.Tensor, torch.Tensor]:
    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """

    if tokenizer.mask_token is None:
        raise ValueError(
            "This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer."
        )

    labels = inputs.clone()
    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
    probability_matrix = torch.full(labels.shape, args.mlm_probability)
    special_tokens_mask = [
        tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
    ]
    probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
    if tokenizer._pad_token is not None:
        padding_mask = labels.eq(tokenizer.pad_token_id)
        probability_matrix.masked_fill_(padding_mask, value=0.0)
    masked_indices = torch.bernoulli(probability_matrix).bool()
    labels[~masked_indices] = -100  # We only compute loss on masked tokens

    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)

    # 10% of the time, we replace masked input tokens with random word
    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
    inputs[indices_random] = random_words[indices_random]

    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
    return inputs, labels


def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
    set_seed(args)  # Added here for reproducibility

    """ Train the model """
    if args.gpu == 0:
        current_time = datetime.now().strftime('%b%d_%H-%M-%S')
        tb_writer = SummaryWriter(args.output_dir + '/runs/' + current_time)

    args.train_batch_size = args.per_gpu_train_batch_size

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    if args.shuffle:
        logger.info(f"Shuffle the dataset in training,"
                       f"GPU: {args.gpu},"
                       f"Rank: {args.rank},"
                       f"Total: {args.world_size}")
    train_sampler = DistributedSampler(
        train_dataset,
        num_replicas=args.world_size,
        rank=args.rank,
        shuffle=args.shuffle,
    )
    train_dataloader = DataLoader(
        train_dataset, sampler=train_sampler, shuffle=False, num_workers=0,
        batch_size=args.train_batch_size, collate_fn=collate, pin_memory=True
    )

    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
    else:
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    # Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]
    optimizer = AdamW(optimizer_grouped_parameters,
                      # betas=(0.9, 0.98),
                      lr=args.learning_rate,
                      eps=args.adam_epsilon)
    if args.warmup_ratio > 0.:
        assert args.warmup_steps == 0
        args.warmup_steps = int(t_total * args.warmup_ratio)
    if args.gpu == 0:
        print("Optimized with lr %f, steps %d, warmup steps %d, and use beta, epsilon %0.8f." % (
            args.learning_rate, t_total, args.warmup_steps, optimizer.defaults['eps']
        ), optimizer.defaults['betas'])
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )

    # Check if saved optimizer or scheduler states exist
    if (
        args.model_name_or_path
        and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))
        and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))
    ):
        # Load in optimizer and scheduler states
        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))

    if args.fp16:
        try:
            from apex import amp
        except ImportError:
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level,
                                          verbosity=0)
        from apex.parallel import DistributedDataParallel as DDP
        model = DDP(model)
    else:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.gpu], find_unused_parameters=True
        )

    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Total train batch size (w. distributed & accumulation) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * args.world_size
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    global_step = 0
    epochs_trained = 0
    # Check if continuing training from a checkpoint
    # if args.model_name_or_path and os.path.exists(args.model_name_or_path):
    #     try:
    #         # set global_step to gobal_step of last saved checkpoint from model path
    #         checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
    #         epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
    #         steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)
    #         logger.info("  Continuing training from checkpoint, will skip to saved global_step")
    #         logger.info("  Continuing training from epoch %d", epochs_trained)
    #     except ValueError:
    #         logger.info("  Do not load model from %s, restart training" % args.model_name_or_path)

    # model_to_resize = model.module if hasattr(model, "module") else model  # Take care of distributed/parallel training
    # model_to_resize.resize_token_embeddings(len(tokenizer))

    model.zero_grad()
    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.gpu != 0
    )
    for epoch in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.gpu != 0)
        tr_loss, logging_loss = 0.0, 0.0
        model.zero_grad()       # Support of accumulating gradients
        for step, batch in enumerate(epoch_iterator):
            inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch)
            inputs = inputs.to(args.device)
            labels = labels.to(args.device)
            # If some of the input is padded, then the attention mask is needed
            attention_mask = (inputs != tokenizer.pad_token_id)         # word_tokens --> 1, pad_token --> 0
            if attention_mask.all():
                attention_mask = None

            if epoch == 0 and step < 3 and args.gpu == 0:
                print(inputs.shape)
                print(inputs[0])
                print(tokenizer.convert_ids_to_tokens(inputs[0].cpu().numpy()))
                print(labels[0])
                print(attention_mask)

            model.train()
            outputs = model(inputs,
                            attention_mask=attention_mask,
                            masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
            loss = outputs[0]  # model outputs are always tuple in transformers (see doc)

            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            tr_loss += loss.item()
            if (step + 1) % args.gradient_accumulation_steps == 0:
                if args.max_grad_norm > 0.:
                    if args.fp16:
                        total_norm = torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                    else:
                        total_norm =torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
                optimizer.step()
                scheduler.step()  # Update learning rate schedule
                model.zero_grad()
                global_step += 1

                if args.gpu == 0 and args.logging_steps > 0 and (step + 1) % args.logging_steps == 0:
                    # Log metrics
                    tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
                    if args.fp16:
                        try:
                            from apex.amp import _amp_state
                            tb_writer.add_scalar("loss_scale", _amp_state.loss_scalers[0]._loss_scale, global_step)
                            tb_writer.add_scalar("scaled_loss", scaled_loss.item(), global_step)
                        except ImportError:
                            logger.warning("Cannot import apex.amp._amp_state, "
                                           "would not state the loss_scale in the log")
                    if args.max_grad_norm > 0.:  # Only clip the grad when it is valid
                        tb_writer.add_scalar("grad_norm", total_norm, global_step)
                    tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
                    logging_loss = tr_loss

            if args.max_steps > 0 and global_step >= args.max_steps:
                break

        # Save it each epoch
        if args.gpu == 0:
            # Save checkpoints
            checkpoint_name = "checkpoint-epoch%04d" % epoch
            save_model(args, checkpoint_name, model, tokenizer, optimizer, scheduler)
            last_path = os.path.join(args.output_dir, 'checkpoint-last')
            # if os.path.exists(last_path):
            #     print(last_path)
            #     os.remove(last_path)
            # os.symlink(os.path.join(args.output_dir, checkpoint_name), last_path)

            # Evaluate the model
            logger.info(" Training loss of Epoch %d: %0.4f" % (epoch, tr_loss / step))
            logger.info(" Evaluation Results of Epoch %d: " % epoch)
            results = evaluate(args, model, tokenizer)
            for key, value in results.items():
                tb_writer.add_scalar("eval_{}".format(key), value, global_step)
                logger.info("\t %s: %0.4f" % (key, value))
            output_eval_file = os.path.join(args.output_dir, checkpoint_name, "eval_results.json")
            json.dump(results, open(output_eval_file, 'w'), sort_keys=True, indent=4)

        if args.max_steps > 0 and global_step >= args.max_steps:
            epoch_iterator.close()
            train_iterator.close()
            break

    if args.gpu == 0:
        tb_writer.close()


def save_model(args, name, model, tokenizer, optimizer, scheduler):
    # Save model checkpoint
    output_dir = os.path.join(args.output_dir, name)
    os.makedirs(output_dir, exist_ok=True)
    model_to_save = (
        model.module if hasattr(model, "module") else model
    )  # Take care of distributed/parallel training
    model_to_save.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

    torch.save(args, os.path.join(output_dir, "training_args.bin"))
    logger.info("Saving model checkpoint to %s", output_dir)

    # torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
    # torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
    # logger.info("Saving optimizer and scheduler states to %s", output_dir)


def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, prefix="") -> Dict:
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)

    args.eval_batch_size = args.per_gpu_eval_batch_size
    # Note that DistributedSampler samples randomly

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate
    )

    # Eval!
    logger.info("***** Running evaluation {} *****".format(prefix))
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", args.eval_batch_size)
    eval_loss = 0.0
    nb_eval_steps = 0
    model.eval()

    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch)
        inputs = inputs.to(args.device)
        labels = labels.to(args.device)
        # If some of the input is padded, then the attention mask is needed
        attention_mask = (inputs != tokenizer.pad_token_id)  # word_tokens --> 1, pad_token --> 0
        if attention_mask.all():
            attention_mask = None

        with torch.no_grad():
            outputs = model(inputs, attention_mask=attention_mask, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
            lm_loss = outputs[0]
            eval_loss += lm_loss.mean().item()
        nb_eval_steps += 1

    eval_loss = eval_loss / nb_eval_steps
    perplexity = torch.exp(torch.tensor(eval_loss)).item()

    result = {"perplexity": perplexity}

    return result


def is_port_in_use(port):
    import socket
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        return s.connect_ex(('localhost', port)) == 0


def main():
    args = process_args()
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    port = 9595
    while is_port_in_use(port):
        port += 1
    print("Use port", port)
    os.environ['MASTER_PORT'] = str(port)

    # Using all available gpus for multi-processing distributed
    args.gpus = torch.cuda.device_count()
    print("Use gpus ", list(range(args.gpus)))
    args.world_size = args.gpus * args.nodes
    mp.spawn(setup, nprocs=args.gpus, args=(args,))


def setup(gpu, args):
    if args.should_continue:
        args.model_name_or_path = 'checkpoint-last'

    # Setup CUDA, GPU & distributed training
    torch.cuda.set_device(gpu)
    device = torch.device("cuda", gpu)
    args.gpu = gpu                                  # Local device id.
    args.device = device                            # Local device object.
    args.rank = args.nr * args.gpus + gpu           # The gpu id in the world.
    torch.distributed.init_process_group(
        backend="nccl",
        init_method='env://',
        world_size=args.world_size,
        rank=args.rank
    )

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.gpu == 0 else logging.WARN,
    )
    logger.warning(
        "Process GPU: %s, num_of_total_GPUs: %s, distributed training: True, 16-bits training: %s",
        args.gpu, args.gpus, args.fp16,
    )

    # Set seed
    set_seed(args)

    # Load pretrained model and token
    # Barrier to make sure only the first process in distributed training
    # download model & vocabizer
    if gpu != 0:
        torch.distributed.barrier()

    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]

    # Get Config
    if args.config_name:
        config = config_class.from_pretrained(args.config_name, cache_dir=args.cache_dir)
    elif args.model_name_or_path:
        config = config_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
    else:
        raise ValueError(
            "Why do you want the default config?? Please use --config_name or --model_name_or_path"
        )

    # Get Tokenizer
    if args.tokenizer_name:
        tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
        # BERT always needs lower cased tokens.
        assert tokenizer.init_kwargs.get("do_lower_case", False)
    elif args.model_name_or_path:
        tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
    else:
        raise ValueError(
            "You are instantiating a new {} tokenizer. This is not supported, "
            "but you can do it from another script, save it,"
            "and load it from here, using --tokenizer_name".format(tokenizer_class.__name__)
        )

    assert args.block_size <= tokenizer.max_len

    if args.model_name_or_path:
        model = model_class.from_pretrained(
            args.model_name_or_path,
            from_tf=bool(".ckpt" in args.model_name_or_path),
            config=config,
            cache_dir=args.cache_dir,
        )
    else:
        logger.info("Training new model from scratch")
        model = model_class(config=config)

    model.to(args.device)

    # End of barrier to make sure only the first process waiting other processes
    if gpu == 0:
        torch.distributed.barrier()

    logger.info("Training/evaluation parameters %s", args)

    # Training
    if args.do_train:
        # Barrier to make sure only the first process in distributed training process the dataset,
        # and the others will use the cache
        if gpu != 0:
            torch.distributed.barrier()
        train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
        if gpu == 0:
            torch.distributed.barrier()

        train(args, train_dataset, model, tokenizer)

    # Evaluation
    if args.do_eval and gpu == 0:
        result = evaluate(args, model, tokenizer)


if __name__ == "__main__":
    main()


================================================
FILE: vlm/run_vlm_distributed.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
using a masked language modeling (MLM) loss.
"""

from datetime import datetime
import json
import logging
import os
import random
import sys
import time
from typing import Dict, List, Tuple

import numpy as np
import torch
from torch.nn.utils.rnn import pad_sequence
import torch.multiprocessing as mp
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
from transformers import (
    MODEL_WITH_LM_HEAD_MAPPING,
    WEIGHTS_NAME,
    AdamW,
    AutoConfig,
    AutoModelWithLMHead,
    AutoTokenizer,
    BertConfig,
    BertForMaskedLM,
    BertTokenizer,
    CamembertConfig,
    CamembertForMaskedLM,
    CamembertTokenizer,
    DistilBertConfig,
    DistilBertForMaskedLM,
    DistilBertTokenizer,
    GPT2Config,
    GPT2LMHeadModel,
    GPT2Tokenizer,
    OpenAIGPTConfig,
    OpenAIGPTLMHeadModel,
    OpenAIGPTTokenizer,
    PreTrainedModel,
    PreTrainedTokenizer,
    RobertaConfig,
    RobertaForMaskedLM,
    RobertaTokenizer,
    get_linear_schedule_with_warmup,
)

sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from vlm.data import CoLDataset, get_voken_feats
from vlm.param import process_args
from vlm.model import CoLBertConfig, CoLwithBert


try:
    from torch.utils.tensorboard import SummaryWriter
except ImportError:
    from tensorboardX import SummaryWriter


logger = logging.getLogger(__name__)


MODEL_CLASSES = {
    "gpt2": (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
    "openai-gpt": (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
    "bert": (CoLBertConfig, CoLwithBert, BertTokenizer),
    "roberta": (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer),
    "distilbert": (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer),
    "camembert": (CamembertConfig, CamembertForMaskedLM, CamembertTokenizer),
}


def load_and_cache_examples(args, tokenizer, evaluate=False):
    file_path = args.eval_data_file if evaluate else args.train_data_file
    return CoLDataset(file_path, args.tokenizer_name, tokenizer, args.block_size,
                      split_sent=args.split_sent, voken_dir=args.voken_dir,
                      suffix=args.voken_suffix,
                      verbose=(args.gpu == 0),
                      voken_ablation=args.voken_ablation)


def set_seed(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)


def mask_tokens(tokens: torch.Tensor, vokens: torch.Tensor, tokenizer: PreTrainedTokenizer, args) \
        -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """ Notice that this function would have a side affect of manipulating the Tensor tokens.
    Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """

    if tokenizer.mask_token is None:
        raise ValueError(
            "This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer."
        )

    labels = tokens.clone()
    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
    probability_matrix = torch.full(labels.shape, args.mlm_probability)
    special_tokens_mask = [
        tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
    ]
    probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
    if tokenizer._pad_token is not None:
        padding_mask = labels.eq(tokenizer.pad_token_id)
        probability_matrix.masked_fill_(padding_mask, value=0.0)
    masked_indices = torch.bernoulli(probability_matrix).bool()
    labels[~masked_indices] = -100  # We only compute loss on masked tokens

    if args.voken_labels == 'mask':
        vokens[~masked_indices] = -100
    elif args.voken_labels == 'nonmask':
        vokens[masked_indices] = -100
    elif args.voken_labels == 'all':
        pass
    else:
        assert "Do not support the voken loss of type %s" % args.voken_labels

    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
    tokens[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)

    # 10% of the time, we replace masked input tokens with random word
    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
    tokens[indices_random] = random_words[indices_random]

    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
    return tokens, labels, vokens


def train(args, train_dataset: CoLDataset, valid_dataset: CoLDataset,
          model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
    set_seed(args)  # Added here for reproducibility

    """ Train the model """
    if args.gpu == 0:
        current_time = datetime.now().strftime('%b%d_%H-%M-%S')
        tb_writer = SummaryWriter(args.output_dir + '/runs/' + current_time)

    args.train_batch_size = args.per_gpu_train_batch_size

    def col_collate(examples):
        tokens, vokens = zip(*examples)
        if tokenizer._pad_token is None:
            tokens = pad_sequence(tokens, batch_first=True)
        else:
            tokens = pad_sequence(tokens, batch_first=True, padding_value=tokenizer.pad_token_id)
        vokens = pad_sequence(vokens, batch_first=True, padding_value=-100)
        return tokens, vokens

    if args.shuffle:
        logger.info(f"Shuffle the dataset in training,"
                       f"GPU: {args.gpu},"
                       f"Rank: {args.rank},"
                       f"Total: {args.world_size}")
    train_sampler = DistributedSampler(
        train_dataset,
        num_replicas=args.world_size,
        rank=args.rank,
        shuffle=args.shuffle,
    )
    train_dataloader = DataLoader(
        train_dataset, sampler=train_sampler, shuffle=False, num_workers=0,
        batch_size=args.train_batch_size, collate_fn=col_collate, pin_memory=True
    )

    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
        # args.num_train_epochs = 9595
    else:
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    # Prepare optimizer and schedule (linear warmup and decay)
    if args.lamb:
        no_decay = ['bias', 'gamma', 'beta', 'LayerNorm']
    else:
        no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {
            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
            "weight_decay": 0.0,
        },
    ]
    if args.lamb:
        logger.info(f"Using LAMB Optimizer with max grad norm {args.max_grad_norm}")
        import apex
        optimizer = apex.optimizers.FusedLAMB(
            optimizer_grouped_parameters,
            lr=args.learning_rate,
            eps=args.adam_epsilon,
            max_grad_norm=args.max_grad_norm
        )
    else:
        optimizer = AdamW(optimizer_grouped_parameters,
                          lr=args.learning_rate,
                          #betas=(0.9, 0.98),
                          eps=args.adam_epsilon)
    if args.gpu == 0:
        print(f"Optimized with lr: {optimizer.defaults['lr']}, total steps: {t_total},"
              f" warmup steps: {args.warmup_steps}, epsilon {optimizer.defaults['eps']},"
              f" beta: {optimizer.defaults['betas']}, weight decay {args.weight_decay}.")
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )

    # Check if saved optimizer or scheduler states exist
    if (
        args.model_name_or_path
        and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))
        and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))
    ):
        # Load in optimizer and scheduler states
        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))

    if args.fp16:
        try:
            from apex import amp
        except ImportError:
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
        from apex.parallel import DistributedDataParallel as DDP
        model = DDP(model)
    else:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.gpu], find_unused_parameters=True
        )

    # Allow not calculating the lm heads.
    if args.mlm_ratio == 0.:
        model.lm_head = None


    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Total train batch size (w. distributed & accumulation) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * args.world_size
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    global_step = 0
    epochs_trained = 0
    # Check if continuing training from a checkpoint
    # if args.model_name_or_path and os.path.exists(args.model_name_or_path):
    #     try:
    #         # set global_step to gobal_step of last saved checkpoint from model path
    #         checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
    #         epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
    #         steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)
    #         logger.info("  Continuing training from checkpoint, will skip to saved global_step")
    #         logger.info("  Continuing training from epoch %d", epochs_trained)
    #     except ValueError:
    #         logger.info("  Do not load model from %s, restart training" % args.model_name_or_path)

    model_to_resize = model.module if hasattr(model, "module") else model  # Take care of distributed/parallel training
    assert model_to_resize.config.vocab_size == len(tokenizer)
    # model_to_resize.resize_token_embeddings(len(tokenizer))

    model.zero_grad()
    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.gpu != 0
    )
    set_seed(args)  # Added here for reproducibility
    LOSS_NAMES = ['token_loss', 'voken_loss', 'total_loss']
    for epoch in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.gpu != 0)
        tr_loss, logging_loss = np.zeros(len(LOSS_NAMES)), 0.0
        model.zero_grad()
        for step, (tokens, vokens) in enumerate(epoch_iterator):
            token_inputs, token_labels, voken_labels = mask_tokens(tokens, vokens, tokenizer, args)
            token_inputs = token_inputs.to(args.device)
            token_labels = token_labels.to(args.device) if args.mlm_ratio != 0. else None
            voken_labels = voken_labels.to(args.device)
            # If some of the input is padded, then the attention mask is needed
            attention_mask = (token_inputs != tokenizer.pad_token_id)         # word_tokens --> 1, pad_token --> 0
            if attention_mask.all():
                attention_mask = None

            if epoch == 0 and step < 3 and args.gpu == 0:
                print()
                print("Token inputs:", token_inputs.shape, token_inputs[0])
                print("Token inputs (in str): ", tokenizer.convert_ids_to_tokens(token_inputs[0].cpu().numpy()))
                print("Attention Mask:", attention_mask)
                print("Token Labels: ", token_labels[0] if token_labels is not None else token_labels)
                print("Token Labels (in str): ", tokenizer.convert_ids_to_tokens(token_labels[0].cpu().numpy()) if token_labels is not None else token_labels)
                print("Voken Labels: ", voken_labels[0])
                print()

            model.train()
            outputs = model(token_inputs,
                            attention_mask=attention_mask,
                            masked_lm_labels=token_labels,
                            voken_labels=voken_labels)
            voken_loss = outputs[0]
            token_loss = outputs[1]

            if args.mlm_ratio == 0.:
                loss = voken_loss
            else:
                loss = voken_loss + args.mlm_ratio * token_loss

            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            # print(f"GPU: {args.gpu}, Global Step: {global_step + 1}, "
            #       f"Step: {step}, "
            #       f"Range: {train_dataset.get_item_info(step * args.world_size + args.gpu)}, "
            #       f"Loss: {loss.item()}, "
            #       f"Scaled Loss: {scaled_loss.item()}")

            tr_loss += np.array((token_loss.item() / args.gradient_accumulation_steps,
                                 voken_loss.item() / args.gradient_accumulation_steps,
                                 loss.item()))

            if (step + 1) % args.gradient_accumulation_steps == 0:
                if args.max_grad_norm > 0. and not args.lamb:
                    # Only clip the grad when it is valid and not using LAMB optimizer,
                    # because the LAMB optimizer already apply grad clipping
                    if args.fp16:
                        total_norm = torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                    else:
                        total_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
                elif args.max_grad_norm <= 0. and step <= args.gradient_accumulation_steps:
                    logger.warning("Have not clipped the gradient because "
                                   "the max_grad_norm is set to %0.2f" % args.max_grad_norm)
                optimizer.step()
                scheduler.step()  # Update learning rate schedule
                model.zero_grad()
                global_step += 1

                if args.gpu == 0 and args.logging_steps > 0 and (step + 1) % args.logging_steps == 0:
                    # Log metrics
                    tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
                    if args.fp16:
                        try:
                            from apex.amp import _amp_state
                            tb_writer.add_scalar("loss_scale", _amp_state.loss_scalers[0]._loss_scale, global_step)
                            tb_writer.add_scalar("scaled_loss", scaled_loss.item(), global_step)
                        except ImportError:
                            logger.warning("Cannot import apex.amp._amp_state, "
                                           "would not state the loss_scale in the log")
                    if args.max_grad_norm > 0. and not args.lamb:  # Only clip the grad when it is valid
                        tb_writer.add_scalar("grad_norm", total_norm, global_step)
                    interval_loss = (tr_loss - logging_loss) / args.logging_steps
                    for loss_idx, loss_name in enumerate(LOSS_NAMES):
                        tb_writer.add_scalar(loss_name, interval_loss[loss_idx], global_step)
                    logging_loss = tr_loss.copy()

            if args.max_steps > 0 and global_step >= args.max_steps:
                break

            # if step == 200:
            #     break
            #
        # Save it each epoch
        if args.gpu == 0:
            # Save checkpoints
            checkpoint_name = "checkpoint-epoch%04d" % epoch
            save_model(args, checkpoint_name, model, tokenizer, optimizer, scheduler)

            # last_path = os.path.join(args.output_dir, 'checkpoint-last')
            # if os.path.exists(last_path):
            #     os.remove(last_path)
            # os.symlink(os.path.join(args.output_dir, checkpoint_name), last_path)

            # Evaluate the model
            for loss_idx, loss_name in enumerate(LOSS_NAMES):
                logger.info(" Training %s of Epoch %d: %0.4f" % (
                    loss_name, epoch, tr_loss[loss_idx] / len(train_dataloader)))

            if args.do_eval:
                logger.info(" Evaluation Results of Epoch %d: " % epoch)
                old_eval_batch_size = args.per_gpu_eval_batch_size
                while args.per_gpu_eval_batch_size > 0:
                    try:
                        results = evaluate(args, valid_dataset, model, tokenizer)
                        break
                    except RuntimeError as e:
                        args.per_gpu_eval_batch_size = int(args.per_gpu_eval_batch_size / 2)
                        print("HALVE THE BATCH SIZE in EVAL.")
                        if args.per_gpu_eval_batch_size == 0:
                            raise e
                        time.sleep(5)
                args.per_gpu_eval_batch_size = old_eval_batch_size

                for key, value in results.items():
                    tb_writer.add_scalar("eval_{}".format(key), value, global_step)
                    logger.info("\t %s: %0.4f" % (key, value))
                tb_writer.add_scalar("epoch", epoch, global_step)
                output_eval_file = os.path.join(args.output_dir, checkpoint_name, "eval_results.json")
                json.dump(results, open(output_eval_file, 'w'), sort_keys=True, indent=4)
            # Currently, only GPU 0 is responsible for the evaluation.
            # torch.cuda.empty_cache()
            # torch.distributed.barrier()
        else:
            pass
            # torch.cuda.empty_cache()
            # torch.distributed.barrier()

        if args.max_steps > 0 and global_step >= args.max_steps:
            epoch_iterator.close()
            train_iterator.close()
            break

    if args.gpu == 0:
        tb_writer.close()


def save_model(args, name, model, tokenizer, optimizer, scheduler):
    # Save model checkpoint
    output_dir = os.path.join(args.output_dir, name)
    os.makedirs(output_dir, exist_ok=True)
    model_to_save = (
        model.module if hasattr(model, "module") else model
    )  # Take care of distributed/parallel training
    model_to_save.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

    torch.save(args, os.path.join(output_dir, "training_args.bin"))
    logger.info("Saving model checkpoint to %s", output_dir)

    # torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
    # torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
    # logger.info("Saving optimizer and scheduler states to %s", output_dir)


def evaluate(args, eval_dataset: CoLDataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, prefix="") -> Dict:
    torch.cuda.empty_cache() 
    # # Loop to handle MNLI double evaluation (matched, mis-matched)
    # eval_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)

    args.eval_batch_size = args.per_gpu_eval_batch_size
    # Note that DistributedSampler samples randomly

    def col_collate(examples):
        tokens, vokens = zip(*examples)
        if tokenizer._pad_token is None:
            tokens = pad_sequence(tokens, batch_first=True)
        else:
            tokens = pad_sequence(tokens, batch_first=True, padding_value=tokenizer.pad_token_id)
        vokens = pad_sequence(vokens, batch_first=True, padding_value=-100)
        return tokens, vokens

    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=col_collate
    )

    # Eval!
    logger.info("***** Running evaluation {} *****".format(prefix))
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", args.eval_batch_size)
    total_token_loss = 0.0
    total_voken_loss = 0.0
    nb_eval_steps = 0
    model.eval()

    for tokens, vokens in tqdm(eval_dataloader, desc="Evaluating"):
        token_inputs, token_labels, voken_labels = mask_tokens(tokens, vokens, tokenizer, args)
        token_inputs = token_inputs.to(args.device)
        token_labels = token_labels.to(args.device) if args.mlm_ratio != 0 else None
        voken_labels = voken_labels.to(args.device)
        # If some of the input is padded, then the attention mask is needed
        attention_mask = (token_inputs != tokenizer.pad_token_id)  # word_tokens --> 1, pad_token --> 0
        if attention_mask.all():
            attention_mask = None

        with torch.no_grad():
            outputs = model(token_inputs,
                            attention_mask=attention_mask,
                            masked_lm_labels=token_labels,
                            voken_labels=voken_labels)
            voken_loss = outputs[0]
            token_loss = outputs[1]

            total_voken_loss += voken_loss.item()
            total_token_loss += token_loss.item()

        nb_eval_steps += 1

    total_token_loss = total_token_loss / nb_eval_steps
    perplexity = torch.exp(torch.tensor(total_token_loss)).item()

    result = {"perplexity": perplexity,
              "voken_loss": total_voken_loss / nb_eval_steps}
    torch.cuda.empty_cache() 

    return result


def is_port_in_use(port):
    import socket
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        return s.connect_ex(('localhost', port)) == 0


def main():
    args = process_args()
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    port = 9595
    while is_port_in_use(port):
        port += 1
    print("Use port", port)
    os.environ['MASTER_PORT'] = str(port)

    # Using all available gpus for multi-processing distributed
    args.gpus = torch.cuda.device_count()
    print("Use gpus ", list(range(args.gpus)))
    args.world_size = args.gpus * args.nodes
    mp.spawn(setup, nprocs=args.gpus, args=(args,))


def setup(gpu, args):
    if args.should_continue:
        args.model_name_or_path = 'checkpoint-last'

    # Setup CUDA, GPU & distributed training
    torch.cuda.set_device(gpu)
    device = torch.device("cuda", gpu)
    args.gpu = gpu                                  # Local device id.
    args.device = device                            # Local device object.
    args.rank = args.nr * args.gpus + gpu           # The gpu id in the world.
    torch.distributed.init_process_group(
        backend="nccl",
        init_method='env://',
        world_size=args.world_size,
        rank=args.rank
    )

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.gpu == 0 else logging.WARN,
    )
    logger.warning(
        "Process GPU: %s, num_of_total_GPUs: %s, distributed training: True, 16-bits training: %s",
        args.gpu, args.gpus, args.fp16,
    )

    # Set seed
    set_seed(args)

    # Load pretrained model and token
    # Barrier to make sure only the first process in distributed training
    # download model & vocabizer
    if gpu != 0:
        torch.distributed.barrier()

    # Use self-defined models, thus avoiding Auto***.
    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]

    # Next, we will initialize the training process in the following order:
    #   1. tokenizer --> 2. dataset --> 3. config --> 4. model.
    # because A) dataset relies on the tokenizer.special_tokens.
    #         B) config relies on the dataset.voken_size.

    # Get Tokenizer
    if args.tokenizer_name:
        tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
    elif args.model_name_or_path:
        tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
    else:
        raise ValueError(
            "You are instantiating a new {} tokenizer. This is not supported, "
            "but you can do it from another script, save it,"
            "and load it from here, using --tokenizer_name".format(tokenizer_class.__name__)
        )

    assert args.block_size <= tokenizer.max_len

    # Barrier to make sure only the first process in distributed training process the dataset,
    # and the others will use the cache
    if gpu != 0:
        torch.distributed.barrier()
    train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
    valid_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)
    if gpu == 0:
        torch.distributed.barrier()

    # Assert the vokens are equal in valid and eval.
    valid_dataset.assert_equal_vokens(train_dataset)

    config_kwargs = {}
    if args.do_voken_reg or args.do_voken_ctr:
        assert args.voken_feat_dir is not None
        voken_feats = get_voken_feats(train_dataset, args.voken_feat_dir)
        config_kwargs['voken_dim'] = len(voken_feats[0])
        if gpu == 0:
            logger.info(f"Load voken feats from {args.voken_feat_dir}"
                        f"with {len(voken_feats)} features and dimension {len(voken_feats[0])}")

    # Get Config
    if args.config_name:
        config = config_class.from_pretrained(
            args.config_name,
            cache_dir=args.cache_dir,
            voken_size=train_dataset.voken_size,
            do_voken_cls=args.do_voken_cls,
            do_voken_reg=args.do_voken_reg,
            do_voken_ctr=args.do_voken_ctr,
            shared_head=args.shared_head,
            verbose=(args.gpu == 0),
            **config_kwargs
        )
    elif args.model_name_or_path:
        config = config_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
    else:
        raise ValueError(
            "Why do you want the default config?? Please use --config_name or --model_name_or_path"
        )

    if args.model_name_or_path:
        logger.info(f"Training model from the weight {args.model_name_or_path}.")
        model = model_class.from_pretrained(
            args.model_name_or_path,
            from_tf=bool(".ckpt" in args.model_name_or_path),
            config=config,
            cache_dir=args.cache_dir,
        )
    else:
        logger.info("Training new model from scratch")
        model = model_class(config=config)

    if args.do_voken_reg or args.do_voken_ctr:
        voken_feats = torch.tensor(voken_feats)
        model.init_voken_feat_emb(voken_feats)

    model.to(args.device)

    # End of barrier to make sure only the first process waiting other processes
    if gpu == 0:
        torch.distributed.barrier()

    if args.model_name_or_path:
        if gpu == 0:
            logger.info("Evaluate the performance of the loaded model.")
            results = evaluate(args, valid_dataset, model, tokenizer)
            for key, value in results.items():
                logger.info("\t %s: %0.4f" % (key, value))
            torch.distributed.barrier()
        else:
            torch.distributed.barrier()

    logger.info("Training/evaluation parameters %s", args)

    # Training
    if args.do_train:
        train(args, train_dataset, valid_dataset, model, tokenizer)

    # Evaluation
    if args.do_eval and gpu == 0:
        results = evaluate(args, valid_dataset, model, tokenizer)
        for key, value in results.items():
            logger.info("\t %s: %0.4f" % (key, value))


if __name__ == "__main__":
    main()


================================================
FILE: vlm/show_glue_results_epochs.py
================================================
import os
from pathlib import Path

root = Path(
    'snap'
)

task2major = {
    'QQP': 'acc_and_f1',
    'STS-B': 'corr',
    'MRPC': 'acc_and_f1',
}

# The tasks sorted by the amount of data
all_tasks = [
    # 'WNLI',
    'RTE',
    'MRPC',
    'STS-B',
    'CoLA',
    'SST-2',
    'QNLI',
    'QQP',
    'MNLI',
    'MNLI-MM',
]


def print_result(glue_dir):
    print(glue_dir)
    results = {}
    for task in glue_dir.iterdir():
        if task.is_dir():
            eval_fpath = task / 'eval_results.txt'
            task_name = task.name
            if eval_fpath.exists():
                with eval_fpath.open() as f:
                    for line in f:
                        metric, value = line.split('=')
                        metric = metric.strip()
                        value = float(value.strip())
                        if task_name in task2major:
                            if metric == task2major[task_name]:
                                results[task_name] = value
                        else:
                            results[task_name] = value
    if len(results) > 0:
        # sorted_keys = sorted(list(results.keys()))
        # for key in sorted_keys:
        #     print("%8s" % key, end='')
        # print("%8s" % 'GLUE', end='')
        # print()
        # for key in sorted_keys:
        #     print("%8.2f" % (results[key] * 100.), end='')
        # print("%8.2f" % (sum(results.values()) * 100. / len(results)), end='')
        # print()
        for task in all_tasks:
            print("%8s" % task, end='')
        print("%8s" % 'GLUE', end='')
        print()
        for task in all_tasks:
            if task in results:
                result = results[task]
                print("%8.2f" % (result * 100), end='')
            else:
                print(" " * 8, end='')
        mean = lambda x: sum(x) / max(len(x), 1)
        avg_result = mean([value for key, value in results.items() if key in all_tasks])
        print("%8.2f" % (avg_result * 100.), end='')
        print()


def search(path):
    def sorted_key(path):
        try:
            return path.stat().st_mtime
        except Exception:
            return 0.
    path_list = sorted(
        path.iterdir(),
        key=sorted_key
        # x.name
    )
    for subdir in path_list:
        if subdir.is_dir():
            if 'glueepoch_' in subdir.name:
                print_result(subdir)
            else:
                search(subdir)

search(root)


================================================
FILE: vokenization/__init__.py
================================================


================================================
FILE: vokenization/common.py
================================================
import os

# Name of image sets
IMAGE_SETS = [
    'coco_train',
    'coco_nominival',
    'coco_minival',
    'vg_nococo',
    'cc_train',
    'cc_valid',
]

# Root of each dataset
# CC_ROOT, COCO_ROOT, VG_ROOT should contain the `images` folder
# CC_ROOT -- images
#              |-- training
#                      |-- training_00009486    # Jpeg files but does not have the extension.
#                      |-- ....
#              |-- validation
#                      |-- validation_00009486
#                      |-- ...
# CC_ROOT = os.getenv('CC_ROOT', 'data/cc')
# COCO_ROOT = os.getenv('COCO_ROOT', 'data/mscoco')
# VG_ROOT = os.getenv('VG_ROOT', 'data/vg')
# LXRT_ROOT = os.getenv('LXRT_ROOT', 'data/lxrt')
CC_ROOT = 'data/cc'
COCO_ROOT = 'data/mscoco'
VG_ROOT = 'data/vg'
LXRT_ROOT = 'data/lxmert'

# THe local directory to save essential image infos
#       (e.g., image ids for the vokenizer, image paths in this server)
# LOCAL_DIR
#   |- images
#         |- coco_train_ids.txt
#         |- coco_train_paths.txt
#         |- cc_train_ids.txt
#         |- cc_train_paths.txt
#         |- ..............
# Running create_image_ids.py will build *_ids.txt
# Running extract_vision_keys.py will build *_paths.txt
LOCAL_DIR = 'data/vokenization'


================================================
FILE: vokenization/create_image_ids.py
================================================
import json
import os
from pathlib import Path
import sys

# sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import common

imgset2lxrtfname = {
    'coco_train': 'mscoco_train.json',
    'coco_nominival': 'mscoco_nominival.json',
    'coco_minival': 'mscoco_minival.json',
    'vg_nococo': 'vgnococo.json',
}

imgset2ccfname = {
    'cc_train': 'training.tsv',
    'cc_valid': 'validation.tsv'
}


def write_ids(img_set, img_ids):
    """
    Write the indexed image ids 'img_ids' for image set 'img_set' to
    the local file.
    """
    info_dir = os.path.join(common.LOCAL_DIR, 'images')
    os.makedirs(info_dir, exist_ok=True)
    print("Write %d image ids for image set %s to %s." % (
        len(img_ids), img_set, os.path.join(info_dir, img_set + '.ids')))
    ids_path = os.path.join(info_dir, img_set + '.ids')
    if os.path.exists(ids_path):
        # If there is an existing ids_path, make sure that they are the same.
        print(f"Already exist the image ids for image set {img_set} at path {ids_path}.")
        print("Now, we want to make sure that they are equal:")
        with open(ids_path, 'r') as f:
            exist_img_ids = list(map(lambda x: x.strip(), f.readlines()))
        success = True
        for i, (exist_img_id, img_id) in zip(exist_img_ids):
            if exist_img_id != img_id:
                print(f"The image id at line {i} is different:")
                print(f"\tIn the file: {exist_img_id}, In this script: {img_id}")
                success = False
        if success:
            print("PASS!")
    else:
        with open(ids_path, 'w') as f:
            for img_id in img_ids:
                f.write(img_id + '\n')


for img_set in common.IMAGE_SETS:
    if img_set in imgset2lxrtfname:
        lxrt_path = Path(common.LXRT_ROOT)
        img_ids = []
        fname = imgset2lxrtfname[img_set]
        for datum in json.load((lxrt_path / fname).open()):
            img_id = datum['img_id']
            img_ids.append(img_id)

        write_ids(img_set, img_ids)

    if img_set in imgset2ccfname:
        cc_path = Path(common.CC_ROOT)
        img_ids = []
        fname = imgset2ccfname[img_set]
        if not (cc_path / fname).exists():
            print("No such file", cc_path / fname)
            continue
        for i, line in enumerate((cc_path / fname).open()):
            sent, img_id = line.split('\t')
            img_ids.append(img_id.strip())

        write_ids(img_set, img_ids)


================================================
FILE: vokenization/evaluate_diversity.py
================================================
import argparse
from collections import defaultdict
import json
import os
import sys

import numpy as np
import tqdm

from vokenization import Vokenizer, load_model_and_tokenizer
import common

imgset2fname = {
    'coco_train': 'mscoco_train.json',
    'coco_nominival': 'mscoco_nominival.json',
    'coco_minival': 'mscoco_minival.json',
    'vg_nococo': 'vgnococo.json',
    'cc_train': 'training.tsv',
    'cc_valid': 'validation.tsv',
}

tokenizer_name = 'bert-base-uncased'


def load_lang_data(corpus_name, topk=10000):
    """
    Load {topk} sentences from the corpus named by {corpus_name}.
    """
    fpath = corpus_name + '.' + tokenizer_name
    tokens = []
    with open(fpath) as f:
        for i, line in enumerate(f):
            tokens.append(list(map(int, line.split(' '))))
            if (i + 1) == topk:
                break
    print("Read %d sentences from the corpus %s located at %s." % (
        len(tokens), corpus_name, fpath
    ))
    return tokens


def load_cc_data(img_set):
    fname = os.path.join(common.CC_ROOT, imgset2fname[img_set])
    sents = []
    with open(fname) as f:
        for line in f:
            sent, _ = line.split('\t')
            sents.append(sent)
    print("Load the %d sentences for image set %s from %s" % (
        len(sents), img_set, fname))
    return sents


def load_lxrt_data(img_set):
    fname = os.path.join(common.LXRT_ROOT, imgset2fname[img_set])
    sents = []
    with open(fname) as f:
        data = json.load(f)
        for datum in data:
            sents.extend(datum['sentf']['mscoco'])
    print("Load the %d sentences for image set %s from %s" % (
        len(sents), img_set, fname))
    return sents


def analyze(token2info):
    """
    :param token2info: token2info: token --> (img_id --> cnt)
    :return:
    """
    names = ['Num Images', 'Max Cnt', 'Avg Cnt', 'Std Cnt']
    results = np.zeros(4)
    num_tokens = 0
    for token in token2info:
        img2cnt = token2info[token]
        cnts = np.array(list(img2cnt.values()))
        num_imgs = len(cnts)
        max_cnt = cnts.max()
        avg_cnt = cnts.mean()
        std_cnt = cnts.std()
        results += (num_imgs, max_cnt, avg_cnt, std_cnt)
        num_tokens += 1
    print("With %d tokens, " % num_tokens)
    results /= num_tokens
    for name, result in zip(names, results):
        print("Average of %s is %0.2f" % (name, result))

    corpus_info = defaultdict(lambda: 0)
    for info in token2info.values():
        for img, cnt in info.items():
            corpus_info[img] += cnt
    print("Cover %d images" % len(corpus_info))

# load = '/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_bertl4'
parser = argparse.ArgumentParser()
parser.add_argument('--load', type=str, default='/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_robertal4',
                    help='The directory saved the model (containing'
                         'BEST.pth.model).')
parser.add_argument('--image-sets', type=str, default='coco_minival',
                    help='The splits of images to be extracted')
parser.add_argument('--corpus', type=str, default='wiki103',
                    help='Evaluated corpus')
parser.add_argument('--maxsents', type=int, default=10000,
                    help='The maximum sentences to be evaluated in the corpus')
args = parser.parse_args()

keys_path = os.path.join(args.load, 'keys')

print("Evaluate for model %s on image sets %s" % (args.load, args.image_sets))
model, tokenizer = load_model_and_tokenizer(args.load)
img_sets = args.image_sets.split(',')
vokenizer = Vokenizer(model, tokenizer, keys_path, img_sets)

corpus_list = args.corpus.split(',')
for corpus in corpus_list:
    corpus = corpus.strip()
    print("\nProcessing corpus %s for diversity test:" % corpus)
    # token2info: token --> (img_id --> cnt)
    token2info = defaultdict(lambda: defaultdict(lambda: 0))

    if corpus in imgset2fname:
        if 'cc' in corpus:
            sents = load_cc_data(corpus)
        else:
            sents = load_lxrt_data(corpus)
        batch_size = 32
        for start_id in tqdm.tqdm(range(0, len(sents), batch_size)):
            batch_sents = sents[start_id: start_id + batch_size]
            scores, ids, tokens, paths = vokenizer.vokenize_sents(batch_sents, topk=None)
            for i in range(len(paths)):
                for token, path in zip(tokens[i][1:-1], paths[i][1:-1]):
                    token2info[token][path] += 1
    else:
        tokens_list = load_lang_data(corpus, args.maxsents)
        batch_size = 16
        for start_id in tqdm.tqdm(range(0, len(tokens_list), batch_size)):
            batch_tokens = tokens_list[start_id: start_id + batch_size]
            scores, ids, tokens, paths = vokenizer.vokenize_ids(batch_tokens, topk=None)
            for i in range(len(paths)):
                for token, path in zip(tokens[i][1:-1], paths[i][1:-1]):
                    token2info[token][path] += 1

    analyze(token2info)


================================================
FILE: vokenization/evaluate_retrieval.py
================================================
import argparse
from collections import defaultdict
import json
import os

import tqdm

from vokenization import Vokenizer, load_model_and_tokenizer
import common

imgset2fname = {
    'coco_train': 'mscoco_train.json',
    'coco_nominival': 'mscoco_nominival.json',
    'coco_minival': 'mscoco_minival.json',
    'vg_nococo': 'vg_nococo.json',
    'cc_train': 'training.tsv',
    'cc_valid': 'validation.tsv',
}


def load_cc_data(img_set):
    fname = os.path.join(common.CC_ROOT, imgset2fname[img_set])
    sentXimgname = []
    with open(fname) as f:
        for line in f:
            sent, gt_img_name = line.split('\t')
            gt_img_name = gt_img_name.strip()
            sentXimgname.append((sent, gt_img_name))
    print("Load the %d (img, sent) pairs for image set %s from %s" % (
        len(sentXimgname), img_set, fname))
    return sentXimgname


def load_lxrt_data(img_set):
    fname = os.path.join(common.LXRT_ROOT, imgset2fname[img_set])
    sentXimgname = []
    with open(fname) as f:
        data = json.load(f)
        for datum in data:
            gt_img_name = datum['img_id'] + '.jpg'
            sents = datum['sentf']['mscoco']
            for sent in sents:
                sentXimgname.append((sent, gt_img_name))
    print("Load the %d (img, sent) pairs for image set %s from %s" % (
        len(sentXimgname), img_set, fname))
    return sentXimgname


# load = '/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_bertl4'
parser = argparse.ArgumentParser()
parser.add_argument('--load', type=str, default='/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_robertal4',
                    help='The directory saved the model (containing'
                         'BEST.pth.model).')
parser.add_argument('--image-sets', type=str, default='coco_minival',
                    help='The splits of images to be extracted')
args = parser.parse_args()

keys_path = os.path.join(args.load, 'keys')

print("Evaluate for model %s on image sets %s" % (args.load, args.image_sets))
model, tokenizer = load_model_and_tokenizer(args.load)
img_sets = args.image_sets.split(',')

sent_level = 'sent' in args.load

for img_set in img_sets:
    vokenizer = Vokenizer(model, tokenizer, keys_path, [img_set],
                          sent_level=sent_level)
    if 'cc' in img_set:
        sentXimgname = load_cc_data(img_set)
    else:
        sentXimgname = load_lxrt_data(img_set)

    topks = [1, 5, 10]
    print("\nEvaluate image set", img_set, "for topk retrieval:", topks)
    total = 0
    arg_topk = None if max(topks) == 1 else max(topks)
    results = defaultdict(lambda: 0)
    batch_size = 32
    for start_id in tqdm.tqdm(range(0, len(sentXimgname), batch_size)):
        batch_sentXimg = sentXimgname[start_id: start_id + batch_size]
        sents, gt_img_names = zip(*batch_sentXimg)
        sents = list(sents)

        scores, ids, tokens, paths_list = vokenizer.vokenize_sents(sents, topk=arg_topk)
        if sent_level:
            paths_list = [x[:3] for x in paths_list]     # Only eval the first vokens.
        if arg_topk is None:
            paths_list = [[[img_id] for img_id in sent] for sent in paths_list]
        for paths, gt_img_name in zip(paths_list, gt_img_names):                # for each sent in batch
            for topk_paths in paths[1:-1]:      # for each token in sent
                for k, kth_path in enumerate(topk_paths):     # for each img_path in topk image paths of a token
                    img_name = os.path.split(kth_path)[-1]
                    if img_name == gt_img_name:
                        results[k + 1] += 1
        total += sum(map(lambda x: len(x) - 2, paths_list))

    accumulate = 0
    for i in range(1, max(topks)+1):
        accumulate += results[i]
        if i in topks:
            print("R%d: %0.2f%%, (Random: %0.4f%%)" % (
                i,
                accumulate / total * 100.,
                i / vokenizer.img_num * 100.
            ))

    del vokenizer


================================================
FILE: vokenization/extract_vision_keys.py
================================================
# In this file, we extract the vision features as the keys in retrieval.
import argparse
import os
import pickle
import shutil
import sys

import h5py
import torch
from torchvision import transforms
from torchvision.datasets.folder import default_loader
import tqdm
from transformers import BertTokenizer
from PIL import Image

import common

# Load all images
Image.MAX_IMAGE_PIXELS = None


def get_img_path(img_set, img_id):
    """
    Get the paths regarding the img_set and img_id.
    THIS FUNCTION MIGHT NEED TO BE MODIFIED.
    """
    source, tag = img_set.split('_')
    if source == 'cc':
        split_tag, _ = img_id.split('_')
        return "%s/images/%s/%s" % (common.CC_ROOT, split_tag, img_id)
    elif 'COCO' in img_id:
        _, split_tag, _ = img_id.split('_')
        return "%s/images/%s/%s" % (common.COCO_ROOT, split_tag, img_id + '.jpg')
    else:   # VG images
        return "%s/images/%s.jpg" % (common.VG_ROOT, img_id)


def get_img_paths_and_ids(img_set):
    """
    Return a list of images paths and image ids in this 'img_set'.
    """

    # Load the image ids from the common local dir,
    # thus make sure that the order of the images are the same.
    info_dir = os.path.join(common.LOCAL_DIR, 'images')
    img_paths = []
    with open(os.path.join(info_dir, img_set + '.ids')) as f:
        img_ids = list(map(lambda x: x.strip(), f.readlines()))
    for img_id in img_ids:
        img_paths.append(get_img_path(img_set, img_id))
    return img_paths, img_ids


def save_img_paths_and_ids(img_set, img_paths, img_ids, output):
    info_dir = os.path.join(common.LOCAL_DIR, 'images')

    # Save Image Paths
    curr_paths_fname = os.path.join(output, img_set + '.path')
    print("\tSave img paths to ", curr_paths_fname)
    with open(curr_paths_fname, 'w') as f:
        for path in img_paths:
            f.write(path + "\n")

    # Save Image Ids
    curr_ids_fname = os.path.join(output, img_set + '.ids')
    print("\tSave img ids to ", curr_ids_fname)
    with open(curr_ids_fname, 'w') as f:
        for idx in img_ids:
            f.write(idx + "\n")

    common_paths_fname = os.path.join(info_dir, img_set + '.path')
    if os.path.exists(common_paths_fname):
        with open(common_paths_fname) as f:
            common_img_paths = f.readlines()
            common_img_paths = [img_path.strip() for img_path in common_img_paths]
            # All feature extractor should extract for the same image set.
            assert common_img_paths == img_paths
    else:
        shutil.copy(curr_paths_fname, common_paths_fname)


def extract_vision_feature_keys(model, img_transform, img_sets, output, batch_size):
    """

    :param model: The visn_model which takes an image [b, channel, H, W] as input,
                  and output with [b, f]
    :param img_transform: The transformation of images, compatible with training.
    :param img_sets: The sets of images to be extracted.
    :param output: The directory to save the extracted keys.
    :return:
    """
    last_dim = -1
    for img_set in img_sets:
        print("Extracting feature keys for image set %s" % img_set)
        img_paths, img_ids = get_img_paths_and_ids(img_set)
        saved_img_paths = []
        saved_img_ids = []
        img_keys = []
        tensor_imgs = []
        for i, img_path in enumerate(tqdm.tqdm(img_paths)):
            try:
                pil_img = default_loader(img_path)
            except Exception as e:
                print(e)
                print("Skip image %s" % img_path)
                continue
            saved_img_paths.append(img_path)
            saved_img_ids.append(img_ids[i])

            tensor_imgs.append(img_transform(pil_img))

            if len(tensor_imgs) == batch_size:
                visn_input = torch.stack(tensor_imgs).cuda()
                with torch.no_grad():
                    visn_output = model(visn_input)

                # Check sizes of features are equal.
                if last_dim == -1:
                    last_dim = visn_output.shape[-1]
                assert last_dim == visn_output.shape[-1]
                last_dim = visn_output.shape[-1]

                # Saved the features in hdf5
                img_keys.extend(visn_output.detach().cpu().numpy())

                tensor_imgs = []

        if len(tensor_imgs) > 0:
            visn_input = torch.stack(tensor_imgs).cuda()
            with torch.no_grad():
                visn_output = model(visn_input)
            # Saved the features in hdf5
            img_keys.extend(visn_output.detach().cpu().numpy())

        assert len(img_keys) == len(saved_img_paths)
        h5_path = os.path.join(output, img_set + '.hdf5')
        print(f"\tSave features (keys) to {h5_path} with hdf5 dataset 'Keys'.")
        h5_file = h5py.File(h5_path, 'w')
        dset = h5_file.create_dataset("keys", (len(saved_img_paths), last_dim))
        for i, img_key in enumerate(img_keys):
            dset[i] = img_key
        save_img_paths_and_ids(img_set, saved_img_paths, saved_img_ids, output)
        h5_file.close()


# This default transformation is used by PyTorch ResNet on ImageNet.
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])
default_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    normalize
])


import torch
from torch import nn
import torchvision.models as models
def get_visn_arch(arch):
    try:
        return getattr(models, arch)
    except AttributeError as e:
        print(e)
        print("There is no arch %s in torchvision." % arch)

# __all__ = ['ResNet', 'resnet18', 'resnet34', 'resnet50', 'resnet101',
#            'resnet152', 'resnext50_32x4d', 'resnext101_32x8d',
#            'wide_resnet50_2', 'wide_resnet101_2']


class VisnModel(nn.Module):
    def __init__(self, arch='resnet50', pretrained=True):
        """
        :param dim: dimension of the output
        :param arch: backbone architecture,
        :param pretrained: load feature with pre-trained vector
        :param finetuning: finetune the model
        """
        super().__init__()
        # Setup Backbone
        resnet = get_visn_arch(arch)(pretrained=pretrained)
        for param in resnet.parameters():
            param.requires_grad = False
        resnet.fc = nn.Identity()
        self.backbone = resnet

    def forward(self, img):
        """
        :param img: a tensor of shape [batch_size, H, W, C]
        :return: a tensor of [batch_size, d]
        """
        x = self.backbone(img)
        x = x.detach()
        # x = x / x.norm(2, dim=-1, keepdim=True)
        return x


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--load-dir', type=str, default=None,
                        help='The directory saved the model (containing'
                             'BEST.pth.model).')
    parser.add_argument('--torchvision-model', type=str, default=None,
                        help='The directory saved the model (containing'
                             'BEST.pth.model).')
    parser.add_argument('--image-sets', type=str, default='coco_minival',
                        help='The splits of images to be extracted')
    parser.add_argument('--output-dir', type=str, default=None,
                        help='The directory to save the extracted feature keys')
    parser.add_argument('--batch-size', type=int, default=32)
    args = parser.parse_args()

    img_sets = [img_set.strip() for img_set in args.image_sets.split(',')]

    if args.torchvision_model is not None:
        assert args.load_dir is None, ("either load from torch model using option 'torchvision_model'"
                                       "or from pre-trained CoX model with option 'load_dir'")
        visn_model = VisnModel(arch=args.torchvision_model).eval().cuda()
        if args.batch_size > 1:
            # for multi-batch extraction, must use the same image size
            img_transform = transforms.Compose([
                transforms.Resize(256),
                transforms.CenterCrop(224),
                transforms.ToTensor(),
                normalize
            ])
        else:
            # For single-batch extraction, we want to extract high-quality features, with two processes:
            #    1. Use large image sizes (400 - 600)
            #    2. Keep the aspect ratio
            MIN_SIZE = 400.
            MAX_SIZE = 600.
            def img_transform_func(img):
                img_w, img_h = img.size     # PiL Image's size order is w, h
                assert img_w > 0 and img_h > 0
                scale = min(
                    MIN_SIZE / min(img_w, img_h),
                    MAX_SIZE / max(img_w, img_h),
                )
                # Keep the aspect ratio
                want_w, want_h = int(img_w * scale), int(img_h * scale)

                _img_transform = transforms.Compose([
                    transforms.Resize((want_h, want_w)),    # PyTorch use size order h, w
                    transforms.ToTensor(),
                    normalize
                ])
                return _img_transform(img)
            img_transform = img_transform_func
    else:
        # Load the model
        if os.path.exists(args.load_dir + '/BEST.pth.model'):
            print("Load model from %s." % (args.load_dir + '/BEST.pth.model'))
            sys.path.append(args.load_dir + '/src')
            for dirc in os.listdir(args.load_dir + '/src'):
                sys.path.append(args.load_dir + '/src/' + dirc)
            # import model        # The pickle has some issues... thus must load the library
            joint_model = torch.load(args.load_dir + '/BEST.pth.model')
            joint_model.eval()            # DO NOT FORGET THIS!!!
            visn_model = joint_model.visn_model
        else:
            print(f"No snapshot {args.load_dir + '/BEST.pth.model'}. Exit.")
            exit()

        # Load the img-preprocessing transformation, which used in training CoX model.
        if os.path.exists(args.load_dir + '/img_transform.pkl'):
            print("Load img transformation from %s." % (args.load_dir + '/img_transform.pkl'))
            with open(args.load_dir + '/img_transform.pkl', 'rb') as f:
                img_transform = pickle.load(f)
        else:
            print("Using default image transformatioin")
            img_transform = default_transform

    # Feature output directory
    output_dir = args.output_dir
    if args.output_dir is None:
        output_dir = args.load_dir + '/keys'      # Save the keys with the model dict
    os.makedirs(output_dir, exist_ok=True)

    extract_vision_feature_keys(
        visn_model,
        img_transform,
        img_sets,
        output_dir,
        args.batch_size
    )


================================================
FILE: vokenization/indexing.py
================================================
import numpy as np
import torch
import tqdm


class GPUIndexer(object):
    def __init__(self, keys, gpus=(0,), fp16=False):
        self.gpus = gpus
        self.gpu = gpus[0]
        self.keys = keys
        self.fp16 = fp16
        self.dim = len(self.keys[0])

    def topk(self, query, topk: int = 1):
        raise NotImplementedError

    def batch_topk(self, query, topk: int = 1):
        raise NotImplementedError

    def batch_top1(self, query):
        raise NotImplementedError


class TorchGPUIndexer(GPUIndexer):
    def __init__(self, keys, gpus=(0,), fp16=False):
        super().__init__(keys, gpus, fp16)
        self.gpu_keys = torch.tensor(keys).cuda(self.gpu)
        print(f"Build torch indexer on GPU {self.gpu}")

        if self.fp16:
            self.gpu_keys = self.gpu_keys.half()

    def topk(self, query, topk: int = 1):
        if not type(query) is torch.Tensor:
            query = torch.tensor(query)
        query = query.cuda(self.gpu)
        if self.fp16:
            query = query.half()
        score = (self.gpu_keys * query).sum(-1)
        topk_score, topk_idx = score.topk(topk)
        return topk_score, topk_idx

    def batch_topk(self, query, topk: int = 1):
        if not type(query) is torch.Tensor:
            query = torch.tensor(query)
        query = query.cuda(self.gpu)
        if self.fp16:
            query = query.half()
        score = (self.gpu_keys.unsqueeze(0) * query.unsqueeze(1)).sum(-1)
        topk_score, topk_idx = score.topk(topk, dim=1)
        return topk_score, topk_idx

    def batch_top1(self, query):
        if not type(query) is torch.Tensor:
            query = torch.tensor(query)
        query = query.cuda(self.gpu)
        if self.fp16:
            query = query.half()
        score = (self.gpu_keys.unsqueeze(0) * query.unsqueeze(1)).sum(-1)
        topk_score, topk_idx = score.max(dim=1)
        return topk_score, topk_idx

    def batch_top1_l2(self, query):
        if not type(query) is torch.Tensor:
            query = torch.tensor(query)
        query = query.cuda(self.gpu)
        if self.fp16:
            query = query.half()
        # print(query.norm(dim=-1) - 1.)
        # print(self.gpu_keys.norm(dim=-1) - 1.)
        score = ((self.gpu_keys.unsqueeze(0) - query.unsqueeze(1)) ** 2).sum(-1)
        topk_score, topk_idx = score.min(dim=1)
        return topk_score, topk_idx


class FaissGPUIndexer(GPUIndexer):
    def __init__(self, keys, gpus=(0,), fp16=False):
        try:
            import faiss
        except Exception as e:
            print("Faiss is not installed! Please see https://github.com/facebookresearch/faiss/blob/master/INSTALL.md.")
            raise e
        super().__init__(keys, gpus, fp16)
        res = faiss.StandardGpuResources()
        index_flat = faiss.IndexFlatL2(self.dim)
        # index_flat = faiss.IndexFlatIP(self.dim)
        print(f"Build faiss indexer on GPU {self.gpu}")
        print(keys.shape)
        self.gpu_index_flat = faiss.index_cpu_to_gpu(res, self.gpu, index_flat)
        self.gpu_index_flat.add(keys)

    def batch_topk(self, query, topk: int = 1):
        if type(query) is torch.Tensor:
            query = query.cpu().numpy()
        D, I = self.gpu_index_flat.search(query, topk)
        D = D
        I = I
        D = torch.from_numpy(D)
        I = torch.from_numpy(I)
        return D, I

    def batch_top1(self, query):
        """
        :param query: shape of [b, f]
        """
        if type(query) is torch.Tensor:
            query = query.cpu().numpy()
        D, I = self.gpu_index_flat.search(query, 1)
        D = D[:, 0]
        I = I[:, 0]
        D = torch.from_numpy(D)
        I = torch.from_numpy(I)
        return D, I


if __name__ == '__main__':
    # 100k keys
    keys = np.random.uniform(size=(1000000, 64)) * 0.01
    querys = np.random.uniform(size=(1000000, 64)) * 0.01
    indexer = GPUIndexer(keys, [0], fp16=True)
    batch_size = 64
    for start in tqdm.tqdm(range(0, len(querys), batch_size)):
        query = querys[start: start + batch_size]
        # indexer.batch_topk(query, 1)
        top_score, top_idx = indexer.batch_top1(query)


================================================
FILE: vokenization/revokenization.py
================================================
# Copyleft 2020 project COL.

from transformers import AutoTokenizer


class ReVokenizer:
    """
    Convert a
    """
    def __init__(self, forward_tokenizer_name, backward_tokenizer_name, vokenizer):
        """
        :args forward_tokenizer:
        :args backward_tokenizer:
        :args vokenizer:
        """
        self.forward_tokenizer = AutoTokenizer.from_pretrained(forward_tokenizer_name, use_fast=True)
        self.backward_tokenizer = AutoTokenizer.from_pretrained(backward_tokenizer_name, use_fast=True)
        self.slow_backward_tokenizer = AutoTokenizer.from_pretrained(backward_tokenizer_name)
        self.vokenizer = vokenizer

        self.prepare_for_unicode()

    def vokenize_sent(self, sents, topk=None):
        pass

    def vokenize_ids(self, input_ids, topk=None, verbose=False):
        """
           backward_input
        <-- Backward Tokenizer
        <--    Sentence   -->
        Forward Tokenizer -->
            forward_input --> Vokenizer --> forward_results
        """
        sents, forward_input, backward_input = self.process(input_ids)
        alignments = self.batch_calculate_alignment(
            forward_input['offset_mapping'],
            backward_input['offset_mapping'],
        )
        forward_results = self.vokenizer.vokenize_ids(
            forward_input['input_ids'], topk
        )
        backward_results = self.batch_map_back(forward_results, alignments)
        if verbose:
        # if True:
            self.show_alignments(
                sents, forward_input, backward_input, alignments,
                input_ids, backward_results)
        return backward_results

    def show_alignments(self, sents, forward_inputs, backward_inputs, alignments, input_ids,
                        backward_results):
        forward_ids = forward_inputs['input_ids']
        forward_offsets = forward_inputs['offset_mapping']
        backward_ids = backward_inputs['input_ids']
        backward_offsets = backward_inputs['offset_mapping']
        _, _, backward_result_tokens, _ = backward_results
        for sent, forward_id, backward_id, forward_offset, backward_offset, alignment, input_id, backward_result_token in zip(
            sents, forward_ids, backward_ids, forward_offsets, backward_offsets, alignments, input_ids, backward_result_tokens
        ):
            print(sent)
            for backward_idx, forward_idx in enumerate(alignment):
                def get_str(l, r):
                    return sent[l: r]
                print("%2d %2d %7s %7s %7s  |  %7s %7s %7s" % (
                    backward_idx, forward_idx,
                    self.backward_tokenizer._convert_id_to_token(input_id[backward_idx]),
                    self.backward_tokenizer._convert_id_to_token(backward_id[backward_idx]),
                    get_str(*backward_offset[backward_idx]),
                    self.forward_tokenizer._convert_id_to_token(forward_id[forward_idx]),
                    backward_result_token[backward_idx + 1],
                    get_str(*forward_offset[forward_idx]),
                ))
            print()

    def show_input(self, sents, forward_inputs, backward_inputs, input_ids):
        forward_ids = forward_inputs['input_ids']
        forward_offsets = forward_inputs['offset_mapping']
        backward_ids = backward_inputs['input_ids']
        backward_offsets = backward_inputs['offset_mapping']

        for sent, forward_id, backward_id, forward_offset, backward_offset, input_id in zip(
                sents, forward_ids, backward_ids, forward_offsets, backward_offsets, input_ids
        ):
            print(sent)
            for i, (backward_i, bo, input_i) in enumerate(zip(backward_id, backward_offset, input_id)):
                print("%7s %7s" % (
                    self.backward_tokenizer._convert_id_to_token(backward_i),
                    self.backward_tokenizer._convert_id_to_token(input_i),
                    # self.forward_tokenizer._convert_id_to_token(forward_i),
                ), bo, sent[bo[0]: bo[1]] if bo is not None else '')
            print()


    def backward_decode(self, input_id):
        # return u''.join(self.backward_tokenizer.convert_ids_to_tokens(input_id)).replace('Ġ', ' ')
        # return self.backward_tokenizer.decode(input_id)
        tokens = self.slow_backward_tokenizer.convert_ids_to_tokens(input_id, skip_special_tokens=True)
        # print(tokens)
        return self.slow_backward_tokenizer.convert_tokens_to_string(
            tokens
        )

    def process(self, input_ids):
        """
        :return: two dicts (forward_input, backward_input)
            with keys "input_ids" "offset_mapping"
        """
        sents = [self.backward_decode(input_id) for input_id in input_ids]
        tokenizer_kwargs = {
            'return_token_type_ids': False,
            'return_attention_mask': False,
            'return_offsets_mapping': True,
        }
        # 'add_special_tokens': False,
        forward_input = self.forward_tokenizer.batch_encode_plus(
            sents,
            **tokenizer_kwargs
        )
        backward_input = self.backward_tokenizer.batch_encode_plus(
            sents,
            **tokenizer_kwargs
        )

        # Avoid batch-1
        self._safe_guard(forward_input)
        self._safe_guard(backward_input)

        # Remove <cls> and <sep>
        self._remove_special_tokens(forward_input)
        self._remove_special_tokens(backward_input)

        # postprocessing of the backwards
        self._calibrate_backward_offset(backward_input)
        # self._fix_nouns(backward_input)
        self._fix_length(backward_input, input_ids)

        assert list(map(len, backward_input['input_ids'])) == \
               list(map(len, input_ids)), (list(map(len, backward_input['input_ids'])),
               list(map(len, input_ids)))
        return sents, forward_input, backward_input

    @staticmethod
    def _safe_guard(inputs):
        ids = inputs['input_ids']
        if type(ids[0]) is int:
            for key, value in inputs.items():
                inputs[key] = [value]

    @staticmethod
    def _remove_special_tokens(inputs):
        if type(inputs) is dict:
            for key in inputs:
                inputs[key] = ReVokenizer._remove_special_tokens(inputs[key])
            return inputs
        return [input[1:-1] for input in inputs]

    @staticmethod
    def _fix_nouns(backward_input):
        backward_offsets = backward_input['offset_mapping']
        for backward_offset in backward_offsets:
            last_not_noun_idx = -1
            while backward_offset[last_not_noun_idx] is None:
                last_not_noun_idx -= 1
            for noun_idx in range(last_not_noun_idx + 1, 0):
                backward_offset[noun_idx] = backward_offset[last_not_noun_idx]

    @staticmethod
    def _fix_length(backward_input, input_ids):
        backward_ids = backward_input['input_ids']
        backward_offsets = backward_input['offset_mapping']
        for i in range(len(backward_ids)):
            desired_length = len(input_ids[i])
            if len(backward_ids[i]) > desired_length:
                backward_ids[i] = backward_ids[i][:desired_length]
                backward_offsets[i] = backward_offsets[i][:desired_length]

            while len(backward_ids[i]) < desired_length:
                backward_ids[i].append(backward_ids[i][-1])
                backward_offsets[i].append(backward_offsets[i][-1])

            # print(desired_length)
            # print(len(backward_ids[i]))
            assert desired_length == len(backward_ids[i]) == len(backward_offsets[i])

    def _calibrate_backward_offset(self, backward_input):
        batch_input_ids = backward_input['input_ids']
        batch_new_offset = []
        for input_ids in batch_input_ids:
            now = 0
            byte_list = []
            new_offset = []
            for input_id in input_ids:
                token = self.backward_tokenizer._convert_id_to_token(input_id)
                start = now
                unicode_complete_flag = True
                for char in token:
                    byte = self.c2b[char]
                    byte_list.append(byte)
                    try:
                        unicode_char = bytes(byte_list).decode('utf-8')
                        byte_list = []
                        now += 1
                        unicode_complete_flag = True
                    except UnicodeDecodeError as e:
                        unicode_complete_flag = False
                if unicode_complete_flag:
                    left, right = start, now
                else:
                    left, right = start, now + 1
                new_offset.append((left, right))
            # print(token, sent[left: right].replace(' ', 'Ġ'))
            batch_new_offset.append(new_offset)
        backward_input['offset_mapping'] = batch_new_offset

    def prepare_for_unicode(self):
        def bytes_to_unicode():
            """
            Returns list of utf-8 byte and a mapping to unicode strings.
            We specifically avoids mapping to whitespace/control characters the bpe code barfs on.
            The reversible bpe codes work on unicode strings.
            This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
            When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
            This is a signficant percentage of your normal, say, 32K bpe vocab.
            To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
            """
            bs = (
                    list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(
                range(ord("®"), ord("ÿ") + 1))
            )
            cs = bs[:]
            n = 0
            for b in range(2 ** 8):
                if b not in bs:
                    bs.append(b)
                    cs.append(2 ** 8 + n)
                    n += 1
            cs = [chr(n) for n in cs]
            return dict(zip(bs, cs))
        self.b2c = bytes_to_unicode()
        self.c2b = {c: b for b, c in self.b2c.items()}

    def show(self, ids_list):
        print(
            [self.backward_tokenizer.convert_ids_to_tokens(ids) for ids in ids_list]
        )

    @staticmethod
    def batch_map_back(results, alignments):
        if type(results) is tuple:
            # Handle multiple output by the vokenizer
            #   i.e., input_ids, input_scores, ...
            return [ReVokenizer.batch_map_back(one_results, alignments) for one_results in results]
        new_results = []
        for result, alignment in zip(results, alignments):
            # print(result)
            # print(max(alignment), len(result))
            new_results.append(
                [result[0]] + [result[idx + 1] for idx in alignment] + [result[-1]])
            assert max(alignment) < (len(result) - 2)
        return new_results

    @staticmethod
    def batch_calculate_alignment(batch_forward_offsets, batch_backward_offsets):
        """
        for each backward_token indicated by backward offset, align a forward token to it.
        """
        alignments = []
        for forward_offsets, backward_offsets in zip(batch_forward_offsets, batch_backward_offsets):
            alignment = []
            # Backward: I  ha ve a lov ely  c at.
            # Sent:     I  have  a lovely   cat
            # Forward:  I  hav e a lo ve ly cat.
            now_idx = 0
            for backward_offset in backward_offsets:
                best_idx = now_idx
                best_iou = IoU(forward_offsets[best_idx], backward_offset)
                while (now_idx + 1 < len(forward_offsets)) and \
                      (forward_offsets[now_idx][1] < backward_offset[1]):
                    now_idx += 1
                    now_iou = IoU(forward_offsets[now_idx], backward_offset)
                    if now_iou > best_iou:
                        best_idx = now_idx
                        best_iou = now_iou
                alignment.append(best_idx)
            alignments.append(alignment)
        return alignments


def IoU(a, b):
    x1, y1 = a
    x2, y2 = b
    len1 = y1 - x1
    len2 = y2 - x2
    I = max(min(y1, y2) - max(x1, x2), 0)
    U = len1 + len2 - I
    return I / max(U, 1)


if __name__ == "__main__":
    revokenizer = ReVokenizer('bert-base-uncased', 'roberta-base', None)
    tokenizer = AutoTokenizer.from_pretrained('roberta-base')
    sents = ['Do not panic. ',
             ' iso have a dream .',
             ' This is a test???',
             'Congratulations to the LiLT Founder and CEO, @stanfordnlp grad, Spence Green!',
             'Ay congrats Ethan! An awesome crew, well deserved',
             ' By the fourth season, fewer than three million viewers tuned in each week despite what some fans and critics considered an increase in episode quality.',
             'Filming of the final episode began on Friday, February 25, after the first half of the day was spent completing "Terra Prime". Principal photography took eight days to complete, one day longer than usual. ',
             'sda asdo weij sdjf oweif bqosdj weorasd.?SdfasXX...',
             ]

    ids = [tokenizer.encode(sent, add_special_tokens=False) for sent in sents]
    print(sents)
    sents = [tokenizer.decode(idx) for idx in ids]
    print(sents)
    revokenizer.vokenize_ids(ids)


================================================
FILE: vokenization/revokenize_corpus_mp.py
================================================
# coding=utf-8
# Copyleft 2020 project COL.

import argparse
import copy
from multiprocessing import Queue, Process
import os
import queue
import sys
import time

import h5py
import torch
import tqdm
from spacy.lang.en import English

sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from vokenization.vokenization import load_model_and_tokenizer, Vokenizer
from vokenization.revokenization import ReVokenizer


# Handle the GPU issue in multi-processing.
from multiprocessing import set_start_method
try:
    set_start_method('spawn')
except RuntimeError:
    pass


def processer(args, input_queue, output_queue):
    print(f"Setup workers on gpu {args.gpus}")
    img_sets = sorted([img_set.strip() for img_set in args.image_sets.split(',')])

    print("Build models and tokenizer")
    # We will assign the GPU to model latter, thus load to cpu first!
    model, tokenizer = load_model_and_tokenizer(args.load, cpu=True)
    keys_dir = args.load + '/keys'  # Save the keys with the model dict

    print("Build Retriever from %s with image sets" % keys_dir, img_sets)
    vokenizer = Vokenizer(model, tokenizer, keys_dir,
                          img_sets=img_sets, max_img_num=args.max_img_num,
                          gpus=args.gpus, sent_level=('sent' in args.load))
    print(f"GPU: {args.gpus}, build vokenizer with {vokenizer.img_num} images.")

    # Before vokenization, save the image ids
    dset_name = os.path.split(args.corpus)[-1]
    modifier = f".{vokenizer.img_num}" if vokenizer.img_num != 50000 else ""
    vokens_img_ids_path = os.path.join(
        args.output,
        f"{dset_name}.{'_'.join(img_sets)}{modifier}.ids"
    )
    if args.gpus[0] == 0:
        if os.path.exists(vokens_img_ids_path):
            # If the img_ids file exists, assert that they are the same.
            saved_img_ids = open(vokens_img_ids_path).readlines()
            img_ids = vokenizer.img_ids
            assert len(saved_img_ids) == len(img_ids)
            for saved_img_id, img_id in zip(saved_img_ids, img_ids):
                assert saved_img_id.strip() == img_id
        else:
            vokenizer.dump_img_ids(vokens_img_ids_path)

    while True:
        page_id, sents = input_queue.get()
        # Print the first few sents for debugging
        if args.gpus[0] == 0:
            if page_id < 12 and sents is not None:
                print('page_id:', page_id)
                print('batch_size:', len(sents))
                print('ids of sent[0]:', sents[0])
                print('tokens of sent[0]:', tokenizer.convert_ids_to_tokens(sents[0]))
                print()
        # print(f"Processer {args.gpus}: Get Page Id {page_id}")
        if sents is not None:
            output_str = ''
            results = vokenizer.vokenize_ids(sents)
            idxs = results[1]
            for j, idx in enumerate(idxs):
                assert len(idx[1:-1]) == len(sents[j])
                dump_idx = map(lambda x: str(x.item()), idx[1:-1])
                output_str += ' '.join(dump_idx) + '\n'

            output_queue.put((page_id, output_str))
        else:
            break


def reducer(output_fname, output_queue, total_tokens):
    next_page_id = 0
    heap = queue.PriorityQueue()
    output = open(output_fname, 'a')
    cache = ""
    start_time = None
    processed_tokens = 0

    while True:
        page_id, result = output_queue.get()
        if start_time is None:      # The clock starts to tick when receiving the first package.
            start_time = time.time()
        # print("Reducer: Get Page Id %d" % page_id)
        if result is not None:
            # Put it into the heap
            heap.put((page_id, result))

            # Check the could-be-dumped data in the queue
            while heap.qsize() > 0:
                smallest_page_id, result = heap.get()
                if smallest_page_id == next_page_id:
                    # which means that this page is the next page, thus dump it
                    # print("Reducer: Commit Page Id %d" % next_page_id)
                    processed_tokens += len(result.split(' '))
                    cache += result
                    next_page_id += 1
                else:
                    heap.put((smallest_page_id, result))
                    break
            # print("Reducer: Length of Cache Now", len(cache))
            if len(cache) > 1000000:
                # Dump for every 1000000 characters to reduce IO calls
                output.write(cache)
                output.flush()
                cache = ''
                used_time = int(time.time() - start_time)
                print("Process %d tokens, %d to go, with speed %0.2f tokens/second,"
                      "finished in %0.2f hours" % (
                    processed_tokens,
                    total_tokens - processed_tokens,
                    processed_tokens / used_time,
                    (total_tokens - processed_tokens) / (processed_tokens / used_time) / 3600
                ))
        else:
            if len(cache) > 0:
                output.write(cache)
                output.flush()
                cache = ''
            break

    output.close()


def setup_mp(args, tokens, sent_ranges, vokens_path):
    QUEUE_SIZE = 10000
    input_queue = Queue(maxsize=QUEUE_SIZE)
    output_queue = Queue(maxsize=QUEUE_SIZE)

    workers = []
    num_gpu = torch.cuda.device_count()
    for worker_id in range(args.num_workers):
        gpu_id = worker_id % num_gpu
        curr_args = copy.copy(args)
        curr_args.gpus = (gpu_id,)
        worker = Process(target=processer,
                         args=(curr_args, input_queue, output_queue))
        worker.daemon = True
        worker.start()
        workers.append(worker)

    total_tokens = len(tokens) - sent_ranges[0][0] if len(sent_ranges) > 0 else 0
    reduce = Process(target=reducer,
                     args=(vokens_path, output_queue, total_tokens))
    reduce.start()

    for i, start_id in enumerate(range(0, len(sent_ranges), args.batch_size)):
        sents = []
        for left, right in sent_ranges[start_id: start_id + args.batch_size]:
            sents.append(tokens[left: right])
        input_queue.put((i, sents))

    # Notifying workers the end of input
    for _ in workers:
        input_queue.put((-1, None))

    # wait for workers to terminate
    for w in workers:
        w.join()

    # Notify the reducer the end of output
    output_queue.put((-1, None))

    # wait for reducer to terminate
    reduce.join()


def segment_sent(
        tokens,
        tokenizer,
        tokens_line_info_path,
        tokens_sent_info_path
    ):
    """
    Single-processed segmentation of sentences. We might need to parallel this as well.
    """
    with open(tokens_line_info_path) as f:
        line_starts = list(map(int, f.readlines()))

    nlp = English()
    sentencizer = nlp.create_pipe("sentencizer")
    nlp.add_pipe(sentencizer)

    sent_starts = [0]
    now = 0
    for i in tqdm.tqdm(range(len(line_starts) - 1)):
        start_token_idx = line_starts[i]
        end_token_idx = line_starts[i + 1]
        line_tokens = tokens[start_token_idx: end_token_idx]
        line = ' '.join(tokenizer.convert_ids_to_tokens(line_tokens))
        line = line.replace("[UNK]", "UNK")

        doc = nlp(line)
        sents_len = 0
        sents = []
        for sent in doc.sents:
            if i < 2:
                print(sent)
            sent = str(sent)
            sents.append(sent)
            words = sent.split(' ')
            sent_len = len(words)
            now += sent_len
            sent_starts.append(now)
            sents_len += sent_len

        if sents_len != len(line_tokens):
            print(sents_len)
            print(sents)
            print(len(line_tokens))
            print(line)
            assert False
        assert sent_starts[-1] == end_token_idx

    with open(tokens_sent_info_path, 'w') as f:
        for sent_start in sent_starts:
            f.write(str(sent_start) + "\n")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    # Text
    parser.add_argument('--corpus', type=str, default='/ssd-playpen/data/wiki103/wiki.train.raw')
    # Models
    parser.add_argument('--load', type=str,
                        default='/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_robertal4',
                        help='The directory saved the model (containing'
                             'BEST.pth.model).')
    parser.add_argument('--output', type=str, default=None,
                        help='The directory to save the extracted feature keys.'
                             '"None" would save in the "load" dir')
    parser.add_argument('--backward-tokenizer-name', type=str, default='roberta-base')
    parser.add_argument('--forward-tokenizer-name', type=str, default='roberta-base')
    # Vision: Define the vokens set
    parser.add_argument('--image-sets', type=str, default='vg_nococo',
                        help='The splits of images to be extracted')
    parser.add_argument('--max-img-num', type=int, default=50000,
                        help='number of images used. -1 means all images.')
    # Speed Up Options:
    parser.add_argument('--num-workers', type=int, default=-1,
                        help='-1 will use all GPUs.')
    parser.add_argument('--batch-size', type=int, default=16,
                        help='The # of sentences in a batch.')
    args = parser.parse_args()

    if args.num_workers == -1:
        args.num_workers = torch.cuda.device_count()

    if args.output is None:
        args.output = os.path.join(args.load, 'vokens')
    os.makedirs(args.output, exist_ok=True)

    dset_name = os.path.split(args.corpus)[-1]
    img_sets = sorted([img_set.strip() for img_set in args.image_sets.split(',')])
    print()
    print("Main Th"
          "read: Build a virtual vokenizer to check the number of images.")
    keys_dir = args.load + '/keys'  # Save the keys with the model dict
    virtual_vokenizer = Vokenizer(
        None, None, keys_dir,
        img_sets=img_sets, max_img_num=args.max_img_num,
        gpus=(-1,), sent_level=('sent' in args.load))
    modifier = f".{virtual_vokenizer.img_num}" if virtual_vokenizer.img_num != 50000 else ""
    vokens_path = os.path.join(
        args.output,
        f"{dset_name}.{'_'.join(img_sets)}{modifier}"
    )
    tokens_hdf5_path = f'{args.corpus}.{args.backward_tokenizer_name}.hdf5'
    tokens_sent_info_path = f'{args.corpus}.{args.backward_tokenizer_name}.sent'

    # "Load" tokens from hdf5
    tokens_hdf5 = h5py.File(tokens_hdf5_path, 'r')
    tokens = tokens_hdf5['tokens']

    # Calibrate the start line if the vokens have been proceeded.
    if not os.path.exists(tokens_sent_info_path):
        tokens_line_info_path = f'{args.corpus}.{args.backward_tokenizer_name}.line'
        model, tokenizer = load_model_and_tokenizer(args.load, cpu=True)
        segment_sent(
            tokens,
            tokenizer,
            tokens_line_info_path,
            tokens_sent_info_path
        )

    # Load sent info and find the start sentence
    with open(tokens_sent_info_path) as f:
        sent_starts = list(map(int, f.readlines()))

    # Skip the sentences which have been extracted.
    extracted_tokens = 0
    if os.path.isfile(vokens_path):
        with open(vokens_path, 'r') as g:
            for g_line in tqdm.tqdm(g):
                extracted_tokens += len(g_line.strip().split(' '))
    try:
        start_sent_idx = sent_starts.index(extracted_tokens)
    except ValueError as e:
        print("The extracted tokens does not match a starting sentence.")
        print(e)

    # Start to vokenize
    print("Main Thread: Dump visual tokens to %s" % vokens_path)
    print("Main Thread: Start vokenization from the %d'th token" % sent_starts[start_sent_idx])

    sent_ranges = []
    for i in range(start_sent_idx, len(sent_starts) - 1):
        left_token_idx = sent_starts[i]
        right_token_idx = sent_starts[i + 1]
        sent_ranges.append((left_token_idx, right_token_idx))

    setup_mp(args, tokens, sent_ranges, vokens_path)

    # save into hdf5 file
    if os.path.exists(vokens_path + '.hdf5'):
        print("The hdf5 file %s already exists. So the hdf5 file is not converted."
              % (vokens_path + '.hdf5'))
    else:
        with open(args.corpus + '.' + args.backward_tokenizer_name + ".sent") as f:
            for i, line in enumerate(f):
                pass
            num_tokens = int(line)
            num_sents = i

        h5_file = h5py.File(vokens_path + '.hdf5', 'w')
        dset = h5_file.create_dataset("vokens", (num_tokens,), dtype='int32')

        dump_interval = 100000
        dump_iter = 0
        lines = 0

        with open(vokens_path) as f:
            tokens = []
            for line in tqdm.tqdm(f, total=num_sents):
                for token in map(int, line.split(' ')):
                    tokens.append(token)
                if len(tokens) >= dump_interval:
                    dset[dump_iter: dump_iter + len(tokens)] = tokens
                    dump_iter += len(tokens)
                    tokens = []
                lines += 1
            dset[dump_iter: dump_iter + len(tokens)] = tokens
            dump_iter += len(tokens)
            assert num_tokens == dump_iter
            print(lines, num_sents)
            assert lines == num_sents
        h5_file.close()


================================================
FILE: vokenization/vokenization.py
================================================
# coding=utf-8
# Copyleft 2020 project COL.

from collections import defaultdict
import math
import pickle
import os
import sys

import h5py
import numpy as np
import torch
from torch.nn.utils.rnn import pad_sequence
from transformers import BertTokenizer

import common
from indexing import TorchGPUIndexer, FaissGPUIndexer

VERY_LARGE = 9595959595


class Vokenizer:
    def __init__(self, model, tokenizer, keys_dir, img_sets=('coco_minival',),
                 max_img_num=VERY_LARGE, gpus=(0,), backend='faiss', upper_bound=128,
                 sent_level=False):
        """

        :param model: Hugginface language model
        :param tokenizer: Hugginface Tokenizer
        :param keys_dir: the directory which saves the keys.
        :param img_sets: the img_sets to be loaded, see common.IMAGE_SETS for all options.
        :param max_img_num: load up to #max_img_num images into the dictionary
        :param gpus: The GPUs used in calculating the BERT outputs and indexing.
                     Note: Currently only one GPU is supported!!!
        """
        self.model = model.cuda(gpus[0]) if model is not None else model
        self.tokenizer = tokenizer
        self.img_sets = img_sets
        self.gpus = gpus        # The GPUs used in the indexer
        self.gpu = self.gpus[0]
        self.backend = backend
        self.upper_bound = upper_bound
        self.sent_level = sent_level    # Otherwise use word level

        max_img_num = VERY_LARGE if max_img_num == -1 else max_img_num
        # These two are important, which indicates the mapping from
        # vokens to their actual images.
        self.img_paths = []
        self.img_ids = []
        for img_set in self.img_sets:
            assert img_set in common.IMAGE_SETS, "%s not in image sets %s" % (
                img_set, common.IMAGE_SETS)

            # Load image paths corresponding to the keys.
            # img_paths_fname = os.path.join(common.LOCAL_DIR, 'images', img_set + "_paths.txt")
            # img_ids_fname = os.path.join(common.LOCAL_DIR, 'images', img_set + "_ids.txt")
            img_paths_fname = os.path.join(keys_dir, f"{img_set}.path")
            img_ids_fname = os.path.join(keys_dir, f"{img_set}.ids")
            if not os.path.exists(img_paths_fname):
                # If the actual images are not saved on the server, we would use the img_ids.
                img_paths_fname = img_ids_fname
            with open(img_paths_fname) as f:
                all_img_paths = list(map(lambda x: x.strip(), f.readlines()))
            with open(img_ids_fname) as g:
                all_img_ids = list(map(lambda x: x.strip(), g.readlines()))
            assert len(all_img_paths) == len(all_img_ids)
            for img_path, img_id in zip(all_img_paths, all_img_ids):
                if len(self.img_paths) < max_img_num:
                    self.img_paths.append(img_path)
                    self.img_ids.append(f"{img_set}/{img_id}")
                else:
                    break
        assert len(self.img_paths) == len(self.img_ids)

        # Lazy loading and indexing
        self.keys = None
        self.keys_dir = keys_dir
        self.indexed = False
        self.indexer = None

    @property
    def img_num(self):
        return len(self.img_paths)

    def dump_img_ids(self, fname):
        """
        Dump the mapping from the voken_id to img_ids, to fname.
        Saved in the format of array.
        """
        with open(fname, 'w') as f:
            for img_id in self.img_ids:
                f.write(img_id + "\n")

    def __len__(self):
        return self.img_num

    def indexing(self):
        self.model.eval()

        # Load pre-extracted image keys.
        self.keys = []
        remain_img_num = self.img_num
        for img_set in self.img_sets:
            assert img_set in common.IMAGE_SETS, "%s not in image sets %s" % (
                img_set, common.IMAGE_SETS)
            keys_fname = os.path.join(self.keys_dir, img_set + '.hdf5')
            if not os.path.exists(keys_fname):
                assert False, "keys of image set %s is not extracted, please save it at %s" % (
                    img_set, keys_fname
                )

            # Load Keys
            h5_file = h5py.File(keys_fname, 'r')
            dset = h5_file["keys"]
            load_img_num = min(remain_img_num, len(dset))
            load_keys = dset[:load_img_num]
            self.keys.append(load_keys)
            remain_img_num -= load_img_num
            h5_file.close()
            if load_img_num == 0:
                break

        # Lazy indexing
        self.keys = np.concatenate(self.keys, 0)
        if self.backend == 'torch':
            self.indexer = TorchGPUIndexer(self.keys, gpus=self.gpus, fp16=True)
        elif self.backend == 'faiss':
            self.indexer = FaissGPUIndexer(self.keys, gpus=self.gpus, fp16=True)
        else:
            raise NotImplementedError(f"Backend {self.backend} is not supported")

        self.indexed = True

    def vokenize_sents(self, sents, topk=None):

        input_ids = []
        for sent in sents:
            input_ids.append(self.tokenizer.encode(
                sent,
                add_special_tokens=False,
                # return_tensors='pt'     # Return PyTorch (pt) tensors
            ))
        return self.vokenize_ids(input_ids, attention_mask=None, topk=topk)

    def vokenize_ids(self, input_ids, attention_mask=None, topk=None):
        """
        :param input_ids:  A list of token_ids i.e.,
                [[token_1_1, token_1_2, ...], [token_2_1, token_2_2, ...], ...]
        :param attention_mask: I did not use it for now.
        :param topk: Retrieve the topk vokens for each token.
        :return: top_scores, top_idxs, input_tokens, top_paths
            Note: 1. The results would consider the additional special tokens while the input_tokens do **not**.
                  2. If topk=None, it will be a 2-d results with:
                         [ [s11_top1, s12_top1, ...],
                           [s21_top1, s22_top1, ...],
                           ..... ]
                     If topk!=None (e.g., 1, 5, 10), it will be a 3-d results with:
                         [ [ [s11_top1, s11_top2, ...],
                             [s12_top1, s12_top2, ...],
                             ...... ],
                           [ [s21_top1, s21_top2, ...],
                             [s22_top1, s22_top2, ...],
                             ...... ],
                           ..... ],
                    where s11_top1 means s1(the 1st sentence)1(the 1st token of the 1st sentence)_top1(the top-1 index)
        """
        if not self.indexed:        # Index the keys at the first retrieval call.
            self.indexing()

        # The original tokens
        input_tokens = [
            ([self.tokenizer.cls_token] + [self.tokenizer._convert_id_to_token(idx) for idx in input_id] + [self.tokenizer.sep_token])
            for input_id in input_ids]

        # Deal with over-length tokens (because the BERT-style encoder has length limit due to the positional embedding)
        # Here is a process to avoid very short sequence when cutting the long sentence:
        # Suppose the sentence length is 18 and UPPER_BOUND is 8,
        # we draw it as                         <----------------->, where "<" is bos, and ">" is the last token
        # instead of cut it as                  <------->------->->, which has very short sequence <-> in the end.
        # we cut it with almost equal length:   <----->----->----->
        input_ids = input_ids.copy()
        sent2segs = defaultdict(list)
        for i in range(len(input_ids)):
            if len(input_ids[i]) > self.upper_bound:
                num_segments = math.ceil(len(input_ids[i]) / self.upper_bound)
                tokens_per_seg = int(len(input_ids[i]) / num_segments)
                remaining = input_ids[i][tokens_per_seg:]
                input_ids[i] = input_ids[i][:tokens_per_seg]
                while len(remaining) > 0:
                    # print(len(remaining))
                    sent2segs[i].append(len(input_ids))
                    input_ids.append(remaining[:tokens_per_seg])
                    remaining = remaining[tokens_per_seg:]

        # Convert to torch tensors.
        if not type(input_ids) is torch.Tensor:
            input_ids = [
                torch.tensor(self.tokenizer.build_inputs_with_special_tokens(list(input_id)))
                for input_id in input_ids
            ]
            input_ids = pad_sequence(input_ids,
                                     batch_first=True,
                                     padding_value=self.tokenizer.pad_token_id)
            attention_mask = (input_ids != self.tokenizer.pad_token_id)         # word_tokens --> 1, pad_token --> 0
            if attention_mask.all():
                attention_mask = None

        # Get lengths
        if attention_mask is not None:
            lengths = list(attention_mask.sum(1).numpy())
        else:
            lengths = [len(input_ids[0])] * len(input_ids)

        if attention_mask is not None and type(input_ids) is not torch.Tensor:
            attention_mask = torch.tensor(attention_mask)

        # Lang model inference
        input_ids = input_ids.cuda(self.gpu)
        if attention_mask is not None:
            attention_mask = attention_mask.cuda(self.gpu)

        def apply_model(input_ids, attention_mask, lengths):
            with torch.no_grad():
                lang_output = self.model(input_ids, attention_mask)     # b, l, f
                if type(lang_output) is list:
                    lang_output = lang_output[0]

            # Gather language output
            if self.sent_level:
                # lang_output of shape [batch_size, dim]
                gathered_output = lang_output
            else:
                # lang_output of shape [batch_size, max_len, dim]
                # --> gathered_output [ \sum_i len(i), dim]
                gathered_output = torch.cat([output[:length] for output, length in zip(lang_output, lengths)])

            # Visn retrieval
            if topk is None:
                # It will call the function `max()` and return a 2-d tensor
                top_score, top_idx = self.indexer.batch_top1(gathered_output)
            else:
                # It will call the function `topk(k)` and return a 3-d tensor
                top_score, top_idx = self.indexer.batch_topk(gathered_output, topk=topk)

            return top_score, top_idx

        top_score, top_idx = memory_safe_apply(apply_model, input_ids, attention_mask, lengths)

        # Split
        top_score, top_idx = top_score.detach().cpu(), top_idx.detach().cpu()
        if not self.sent_level:
            # If word level, split it
            top_scores = list(top_score.split(lengths))       # [ float_tensor(len1), float_tensor(len2), ...]
            top_idxs = list(top_idx.split(lengths))           # [ int_tensor(len1), int_tensor(len2), ...]
        else:
            # If sent level, repeat the voken.
            #   Use clone() here
            top_scores = [ts.expand(length, *ts.shape).clone() for ts, length in zip(top_score, lengths)]
            top_idxs = [tid.expand(length, *tid.shape).clone() for tid, length in zip(top_idx, lengths)]

        if top_idxs[0].dim() == 1:
            # Return the top1 paths
            top_paths = [[self.img_paths[idx.item()] for idx in top_idx]
                         for top_idx in top_idxs]
        else:
            # Return the topk paths related to the sentences
            top_paths = [[[self.img_paths[k_idx.item()] for k_idx in topk_idx]
                          for topk_idx in top_idx]
                         for top_idx in top_idxs]

        if self.sent_level:
            for i, tid in enumerate(top_idxs):
                # Keep the first positive and others negative, to mark the header of the sentence.
                # [3] --> [3, 3, 3, 3] --> [-4, -4, -4, -4] --> [3, -4, -4, -4]
                # "-x-1" is used to handle zero, [0] --> [1, 1, 1, 1] --> [-1, -1, -1, -1] --> [0, -1, -1, -1]
                # print('Before conversion', tid)
                tid[:] = tid * (-1) - 1
                tid[1] = tid[1] * (-1) - 1  # The tid[0] is corresponding to <cls>
                # print('After conversion', top_idxs[i])

        # Put back the segments of over-length sentences
        if len(sent2segs) > 0:
            for sent_id, segment_ids in sent2segs.items():
                for segment_id in segment_ids:
                    # Append the results with the segments:
                    #    ---------Now----------------   + ----Appended Segment-----
                    #    [<cls1> I have a <sep1>][:-1]  + [<cls2> cat . <sep2>][1:]
                    #  = [<cls1> I have a cat . <sep2>]
                    top_scores[sent_id] = torch.cat([top_scores[sent_id][:-1], top_scores[segment_id][1:]])
                    top_idxs[sent_id] = torch.cat([top_idxs[sent_id][:-1], top_idxs[segment_id][1:]])
                    top_paths[sent_id] = top_paths[sent_id][:-1] + top_paths[segment_id][1:]
            num_sents = len(input_tokens)
            top_scores = top_scores[:num_sents]
            top_idxs = top_idxs[:num_sents]
            top_paths = top_paths[:num_sents]

        return top_scores, top_idxs, input_tokens, top_paths


def memory_safe_apply(func, *args):
    """
    If batch-wise applying exceeds the GPU memory, it would process each sample separately and sequentially
    :param func: function with some constraints, see code for details.
    :param args: args of this function
    :return:
    """
    try:
        return func(*args)
    except RuntimeError as e:
        print(e)
        batch_size = len(args[0])
        outputs = []
        for i in range(batch_size):
            one_batch_args = tuple(a[i: i+1] for a in args)
            output = func(*one_batch_args)
            # **output of the func should be of the format**:
            # (o1, o2, ...) where each o_i is a tensor of shape [1, ...]
            assert type(output) is tuple or type(output) is list
            outputs.append(output)
    # outputs = ( (o1_1, o1_2, ...), (o2_1, o2_2, ...), ...)
    # zip(*outputs) = ( (o1_1, o2_1, ...), (o1_2, o2_2, ...), ...)
    outputs = tuple(torch.cat(output) for output in zip(*outputs))
    return outputs


default_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


def load_model_and_tokenizer(load, cpu=False):
    if os.path.exists(load + '/BEST.pth.model'):
        sys.path.append(load + '/src')
        for dirc in os.listdir(load + '/src'):
            sys.path.append(load + '/src/' + dirc)
        # import model  # The pickle has some issues... thus must load the library
        if cpu:
            device = torch.device('cpu')
            joint_model = torch.load(load + '/BEST.pth.model',
                                     map_location=device)
        else:
            joint_model = torch.load(load + '/BEST.pth.model')
        joint_model.eval()  # DO NOT FORGET THIS!!!
    else:
        print("No snapshots there, exit.")
        exit()

    if os.path.exists(load + '/tokenizer.pkl'):
        with open(load + '/tokenizer.pkl', 'rb') as f:
            tokenizer = pickle.load(f)
    else:
        tokenizer = default_tokenizer

    return joint_model.lang_model, tokenizer


================================================
FILE: vokenization/vokenize_corpus_mp.py
================================================
# coding=utf-8
# Copyleft 2020 project COL.

import argparse
import copy
from multiprocessing import Queue, Process
import os
import queue
import sys
import time

import h5py
import torch
import tqdm
from spacy.lang.en import English

# sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from vokenization import load_model_and_tokenizer, Vokenizer

# Handle the GPU issue in multi-processing.
from multiprocessing import set_start_method
try:
    set_start_method('spawn')
except RuntimeError:
    pass


def processer(args, input_queue, output_queue):
    print(f"Setup workers on gpu {args.gpus}")
    img_sets = sorted([img_set.strip() for img_set in args.image_sets.split(',')])

    print("Build models and tokenizer")
    # We will assign the GPU to model latter, thus load to cpu first!
    model, tokenizer = load_model_and_tokenizer(args.load, cpu=True)
    keys_dir = args.load + '/keys'  # Save the keys with the model dict

    print("Build Retriever from %s with image sets" % keys_dir, img_sets)
    vokenizer = Vokenizer(model, tokenizer, keys_dir,
                          img_sets=img_sets, max_img_num=args.max_img_num,
                          gpus=args.gpus, sent_level=('sent' in args.load))
    print(f"GPU: {args.gpus}, build vokenizer with {vokenizer.img_num} images.")

    # Before vokenization, save the image ids
    dset_name = os.path.split(args.corpus)[-1]
    modifier = f".{vokenizer.img_num}" if vokenizer.img_num != 50000 else ""
    vokens_img_ids_path = os.path.join(
        args.output,
        f"{dset_name}.{'_'.join(img_sets)}{modifier}.ids"
    )
    if args.gpus[0] == 0:
        if os.path.exists(vokens_img_ids_path):
            # If the img_ids file exists, assert that they are the same.
            saved_img_ids = open(vokens_img_ids_path).readlines()
            img_ids = vokenizer.img_ids
            assert len(saved_img_ids) == len(img_ids)
            for saved_img_id, img_id in zip(saved_img_ids, img_ids):
                assert saved_img_id.strip() == img_id
        else:
            vokenizer.dump_img_ids(vokens_img_ids_path)

    while True:
        page_id, sents = input_queue.get()
        # Print the first few sents for debugging
        if args.gpus[0] == 0:
            if page_id < 12 and sents is not None:
                print('page_id:', page_id)
                print('batch_size:', len(sents))
                print('ids of sent[0]:', sents[0])
                print('tokens of sent[0]:', tokenizer.convert_ids_to_tokens(sents[0]))
                print()
        # print(f"Processer {args.gpus}: Get Page Id {page_id}")
        if sents is not None:
            output_str = ''
            results = vokenizer.vokenize_ids(sents)
            idxs = results[1]
            for j, idx in enumerate(idxs):
                assert len(idx[1:-1]) == len(sents[j])
                dump_idx = map(lambda x: str(x.item()), idx[1:-1])
                output_str += ' '.join(dump_idx) + '\n'

            output_queue.put((page_id, output_str))
        else:
            break


def reducer(output_fname, output_queue, total_tokens):
    next_page_id = 0
    heap = queue.PriorityQueue()
    output = open(output_fname, 'a')
    cache = ""
    start_time = None
    processed_tokens = 0

    while True:
        page_id, result = output_queue.get()
        if start_time is None:      # The clock starts to tick when receiving the first package.
            start_time = time.time()
        # print("Reducer: Get Page Id %d" % page_id)
        if result is not None:
            # Put it into the heap
            heap.put((page_id, result))

            # Check the could-be-dumped data in the queue
            while heap.qsize() > 0:
                smallest_page_id, result = heap.get()
                if smallest_page_id == next_page_id:
                    # which means that this page is the next page, thus dump it
                    # print("Reducer: Commit Page Id %d" % next_page_id)
                    processed_tokens += len(result.split(' '))
                    cache += result
                    next_page_id += 1
                else:
                    heap.put((smallest_page_id, result))
                    break
            # print("Reducer: Length of Cache Now", len(cache))
            if len(cache) > 1000000:
                # Dump for every 1000000 characters to reduce IO calls
                output.write(cache)
                output.flush()
                cache = ''
                used_time = int(time.time() - start_time)
                print("Process %d tokens, %d to go, with speed %0.2f tokens/second,"
                      "finished in %0.2f hours" % (
                    processed_tokens,
                    total_tokens - processed_tokens,
                    processed_tokens / used_time,
                    (total_tokens - processed_tokens) / (processed_tokens / used_time) / 3600
                ))
        else:
            if len(cache) > 0:
                output.write(cache)
                output.flush()
                cache = ''
            break

    output.close()


def setup_mp(args, tokens, sent_ranges, vokens_path):
    QUEUE_SIZE = 10000
    input_queue = Queue(maxsize=QUEUE_SIZE)
    output_queue = Queue(maxsize=QUEUE_SIZE)

    workers = []
    num_gpu = torch.cuda.device_count()
    for worker_id in range(args.num_workers):
        gpu_id = worker_id % num_gpu
        curr_args = copy.copy(args)
        curr_args.gpus = (gpu_id,)
        worker = Process(target=processer,
                         args=(curr_args, input_queue, output_queue))
        worker.daemon = True
        worker.start()
        workers.append(worker)

    total_tokens = len(tokens) - sent_ranges[0][0] if len(sent_ranges) > 0 else 0
    reduce = Process(target=reducer,
                     args=(vokens_path, output_queue, total_tokens))
    reduce.start()

    for i, start_id in enumerate(range(0, len(sent_ranges), args.batch_size)):
        sents = []
        for left, right in sent_ranges[start_id: start_id + args.batch_size]:
            sents.append(tokens[left: right])
        input_queue.put((i, sents))

    # Notifying workers the end of input
    for _ in workers:
        input_queue.put((-1, None))

    # wait for workers to terminate
    for w in workers:
        w.join()

    # Notify the reducer the end of output
    output_queue.put((-1, None))

    # wait for reducer to terminate
    reduce.join()


def segment_sent(
        tokens,
        tokenizer,
        tokens_line_info_path,
        tokens_sent_info_path
    ):
    """
    Single-processed segmentation of sentences. We might need to parallel this as well.
    """
    with open(tokens_line_info_path) as f:
        line_starts = list(map(int, f.readlines()))

    nlp = English()
    sentencizer = nlp.create_pipe("sentencizer")
    nlp.add_pipe(sentencizer)

    sent_starts = [0]
    now = 0
    print("Now, split lines into sentences with Spacy:")
    for i in tqdm.tqdm(range(len(line_starts) - 1)):
        start_token_idx = line_starts[i]
        end_token_idx = line_starts[i + 1]
        line_tokens = tokens[start_token_idx: end_token_idx]
        line = ' '.join(tokenizer.convert_ids_to_tokens(line_tokens))
        line = line.replace("[UNK]", "UNK")

        doc = nlp(line)
        sents_len = 0
        sents = []
        for sent in doc.sents:
            if i < 2:
                print(sent)
            sent = str(sent)
            sents.append(sent)
            words = sent.split(' ')
            sent_len = len(words)
            now += sent_len
            sent_starts.append(now)
            sents_len += sent_len

        if sents_len != len(line_tokens):
            print(sents_len)
            print(sents)
            print(len(line_tokens))
            print(line)
            assert False
        assert sent_starts[-1] == end_token_idx

    with open(tokens_sent_info_path, 'w') as f:
        for sent_start in sent_starts:
            f.write(str(sent_start) + "\n")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    # Text
    parser.add_argument('--corpus', type=str, default='/ssd-playpen/data/wiki103/wiki.train.raw')
    # Models
    parser.add_argument('--load', type=str,
                        default='/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_robertal4',
                        help='The directory saved the model (containing'
                             'BEST.pth.model).')
    parser.add_argument('--output', type=str, default=None,
                        help='The directory to save the extracted feature keys.'
                             '"None" would save in the "load" dir')
    parser.add_argument('--tokenizer-name', type=str, default='roberta-base')
    # Vision: Define the vokens set
    parser.add_argument('--image-sets', type=str, default='vg_nococo',
                        help='The splits of images to be extracted')
    parser.add_argument('--max-img-num', type=int, default=50000,
                        help='number of images used. -1 means all images.')
    # Speed Up Options:
    parser.add_argument('--num-workers', type=int, default=-1,
                        help='-1 will use all GPUs.')
    parser.add_argument('--batch-size', type=int, default=16,
                        help='The # of sentences in a batch.')
    args = parser.parse_args()

    if args.num_workers == -1:
        args.num_workers = torch.cuda.device_count()

    if args.output is None:
        args.output = os.path.join(args.load, 'vokens')
    os.makedirs(args.output, exist_ok=True)

    dset_name = os.path.split(args.corpus)[-1]
    img_sets = sorted([img_set.strip() for img_set in args.image_sets.split(',')])
    print()
    print("Main Th"
          "read: Build a virtual vokenizer to check the number of images.")
    keys_dir = args.load + '/keys'  # Save the keys with the model dict
    virtual_vokenizer = Vokenizer(
        None, None, keys_dir,
        img_sets=img_sets, max_img_num=args.max_img_num,
        gpus=(-1,), sent_level=('sent' in args.load))
    modifier = f".{virtual_vokenizer.img_num}" if virtual_vokenizer.img_num != 50000 else ""
    vokens_path = os.path.join(
        args.output,
        f"{dset_name}.{'_'.join(img_sets)}{modifier}"
    )
    tokens_hdf5_path = f'{args.corpus}.{args.tokenizer_name}.hdf5'
    tokens_sent_info_path = f'{args.corpus}.{args.tokenizer_name}.sent'

    # "Load" tokens from hdf5
    tokens_hdf5 = h5py.File(tokens_hdf5_path, 'r')
    tokens = tokens_hdf5['tokens']

    # Calibrate the start line if the vokens have been proceeded.
    if not os.path.exists(tokens_sent_info_path):
        tokens_line_info_path = f'{args.corpus}.{args.tokenizer_name}.line'
        model, tokenizer = load_model_and_tokenizer(args.load, cpu=True)
        segment_sent(
            tokens,
            tokenizer,
            tokens_line_info_path,
            tokens_sent_info_path
        )

    # Load sent info and find the start sentence
    with open(tokens_sent_info_path) as f:
        sent_starts = list(map(int, f.readlines()))

    # Skip the sentences which have been extracted.
    extracted_tokens = 0
    if os.path.isfile(vokens_path):
        with open(vokens_path, 'r') as g:
            for g_line in tqdm.tqdm(g):
                extracted_tokens += len(g_line.strip().split(' '))
    try:
        start_sent_idx = sent_starts.index(extracted_tokens)
    except ValueError as e:
        print("The extracted tokens does not match a starting sentence.")
        print(e)

    # Start to vokenize
    print("Main Thread: Dump visual tokens to %s" % vokens_path)
    print("Main Thread: Start vokenization from the %d'th token" % sent_starts[start_sent_idx])

    sent_ranges = []
    for i in range(start_sent_idx, len(sent_starts) - 1):
        left_token_idx = sent_starts[i]
        right_token_idx = sent_starts[i + 1]
        sent_ranges.append((left_token_idx, right_token_idx))

    setup_mp(args, tokens, sent_ranges, vokens_path)

    # save into hdf5 file
    if os.path.exists(vokens_path + '.hdf5'):
        print("The hdf5 file %s already exists. So the hdf5 file is not converted."
              % (vokens_path + '.hdf5'))
    else:
        with open(args.corpus + '.' + args.tokenizer_name + ".sent") as f:
            for i, line in enumerate(f):
                pass
            num_tokens = int(line)
            num_sents = i

        h5_file = h5py.File(vokens_path + '.hdf5', 'w')
        dset = h5_file.create_dataset("vokens", (num_tokens,), dtype='int32')

        dump_interval = 100000
        dump_iter = 0
        lines = 0

        with open(vokens_path) as f:
            tokens = []
            for line in tqdm.tqdm(f, total=num_sents):
                for token in map(int, line.split(' ')):
                    tokens.append(token)
                if len(tokens) >= dump_interval:
                    dset[dump_iter: dump_iter + len(tokens)] = tokens
                    dump_iter += len(tokens)
                    tokens = []
                lines += 1
            dset[dump_iter: dump_iter + len(tokens)] = tokens
            dump_iter += len(tokens)
            assert num_tokens == dump_iter
            print(lines, num_sents)
            assert lines == num_sents
        h5_file.close()


================================================
FILE: xmatching/__init__.py
================================================


================================================
FILE: xmatching/data.py
================================================
# coding=utf-8
import json
from pathlib import Path
import random

from torch.utils.data import Dataset
from torchvision.datasets.folder import default_loader

from PIL import Image
Image.MAX_IMAGE_PIXELS = None

TINY_IMG_NUM = 1000
FAST_IMG_NUM = 10000

lxrt_imgsplits = {
    'mscoco_train',
    'mscoco_nominival',
    'vgnococo',
    'mscoco_minival',
}
lxrt_langsplits = {
    'mscoco', 'vg', 'vqa', 'gqa', 'visual7w'
}
cc_imgsplits = {
    'cc_train': 'training.tsv',
    'cc_valid': 'validation.tsv',
}
cc_langsplits = {
    'cc',
}

CC_ROOT = 'data/cc'
COCO_ROOT = 'data/mscoco'
VG_ROOT = '/ssd-playpen/data/vg'
LXRT_ROOT = 'data/lxmert'


def make_uid(img_id, source, sent_id):
    """
    see the descriptions in function 'make_datum'
    """
    return "%s:%s:%s" % (img_id, source, sent_id)


def get_img_path(source, img_id):
    if source == 'cc':
        split_tag, _ = img_id.split('_')
        return "%s/images/%s/%s" % (CC_ROOT, split_tag, img_id)
    elif 'COCO' in img_id:
        _, split_tag, _ = img_id.split('_')
        return "%s/images/%s/%s" % (COCO_ROOT, split_tag, img_id + '.jpg')
    else:   # VG images
        return "%s/images/%s.jpg" % (VG_ROOT, img_id)


def make_datum(source: str, img_id: str, sent_id: int, sent: str):
    """
    Create a datum from the provided infos.
    :param source: the dataset of the particular sentence.
    :param img_id: id of the image
    :param sent_id: id of the sentence (of the image)
    :param sent: the sentence
    :return: a dict of datum
    """
    uid = make_uid(img_id, source, sent_id)
    img_path = get_img_path(source, img_id)
    return {
        'uid': uid,
        'img_id': img_id,
        'img_path': img_path,
        'sent': sent,
    }


class ImgSentDataset:
    def __init__(self, img_splits: str, lang_splits: str, tiny=False, fast=False):
        """
        :param split: train, valid, test
        :param sources: The data sources to be loaded, separated by comma.
                       from: mscoco, cc, vg, vqa, gqa, visual7w
                             'vg' stands for visual genome captions
                             'cc' stands for conceptual captions.
                       example: 'mscoco, vg'
        """
        self.img_splits = [img_split.lower().strip() for img_split in img_splits.split(',')]
        self.lang_splits = [lang_split.lower().strip() for lang_split in lang_splits.split(',')]
        self.data = []

        debug_imgs = -1
        if tiny:
            debug_imgs = TINY_IMG_NUM
        elif fast:
            debug_imgs = FAST_IMG_NUM

        # Loading LXRT data (i.e., COCO Cap, VQA, GQA, VG Cap, VG QA (visual7w))
        lxrt_data = []
        lxrt_path = Path(LXRT_ROOT)
        for img_split in self.img_splits:
            if img_split in lxrt_imgsplits:
                fname = img_split + ".json"
                if debug_imgs > 0 and fname != 'mscoco_nominival.json' \
                        and fname != 'mscoco_minival.json':  # Only load nominival when debugging
                    continue
                lxrt_data.extend(json.load((lxrt_path / fname).open()))

        for i, lxrt_datum in enumerate(lxrt_data):
            img_id = lxrt_datum['img_id']
            for lang_split in self.lang_splits:
                if lang_split in lxrt_datum['sentf']:
                    sents = lxrt_datum['sentf'][lang_split]
                    for j, sent in enumerate(sents):
                        self.data.append(make_datum(lang_split, img_id, j, sent))
                        if debug_imgs > 0:  # Only load one sentence if debugging
                            break
            if i+1 == debug_imgs:             # Load top #debug_imgs images
                break

        # Loading Conceptual Caption (CC) data
        for img_split in self.img_splits:
            if img_split in cc_imgsplits:
                cc_path = Path(CC_ROOT)
                for fname in cc_imgsplits[img_split]:
                    for i, line in enumerate((cc_path / fname).open()):
                        sent, img_id = line.split('\t')
                        self.data.append(make_datum('cc', img_id.strip(), 0, sent))
                        if i+1 == debug_imgs:
                            break

    def __len__(self):
        return len(self.data)

    def __getitem__(self, item):
        return self.data[item]

    def shuffle(self):
        random.seed(9595)
        random.shuffle(self.data)


class ImgSentTorchDataset(Dataset):
    def __init__(self,
                 dataset: ImgSentDataset,
                 img_transform,
                 tokenizer,
                 sent_len: int):
        super().__init__()
        self.raw_dataset = dataset
        self.img_transform = img_transform
        self.tokenizer = tokenizer
        self.sent_len = sent_len

    def __len__(self):
        return len(self.raw_dataset)

    def __getitem__(self, item: int):
        datum = self.raw_dataset[item]

        uid = datum['uid']
        img_id = datum['img_id']
        img_path = datum['img_path']
        sent = datum['sent']

        # Step 1: Load and pre-process the image
        try:
            pil_img = default_loader(img_path)
        except Exception as e:
            print(e)
            print(img_path)
            return self.__getitem__((item + 95) % self.__len__())
        tensor_img = self.img_transform(pil_img)

        # Step 2: Tokenization (to integers) and Padding
        encoded_sent = self.tokenizer.encode_plus(
            sent,
            add_special_tokens=True,
            max_length=self.sent_len,
            truncation=True,
            # pad_to_max_length=True,
            padding='max_length',
            return_tensors='pt'     # Return PyTorch (pt) tensors
        )
        input_ids = encoded_sent['input_ids'].squeeze()
        attention_mask = encoded_sent['attention_mask'].squeeze()
        # print('sent', sent)
        # print('input_ids', input_ids)
        # print('attention_mask', attention_mask)

        return uid, (input_ids, attention_mask, ), (tensor_img, )


================================================
FILE: xmatching/frozen_batch_norm.py
================================================
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved

# Note: This file is copied from https://github.com/facebookresearch/detectron2/blob/master/detectron2/layers/batch_norm.py
# to avoid any future change from that project.

import torch
from torch import nn
from torch.nn import functional as F


class FrozenBatchNorm2d(nn.Module):
    """
    BatchNorm2d where the batch statistics and the affine parameters are fixed.
    It contains non-trainable buffers called
    "weight" and "bias", "running_mean", "running_var",
    initialized to perform identity transformation.
    The pre-trained backbone models from Caffe2 only contain "weight" and "bias",
    which are computed from the original four parameters of BN.
    The affine transform `x * weight + bias` will perform the equivalent
    computation of `(x - running_mean) / sqrt(running_var) * weight + bias`.
    When loading a backbone model from Caffe2, "running_mean" and "running_var"
    will be left unchanged as identity transformation.
    Other pre-trained backbone models may contain all 4 parameters.
    The forward is implemented by `F.batch_norm(..., training=False)`.
    """

    _version = 3

    def __init__(self, num_features, eps=1e-5):
        super().__init__()
        self.num_features = num_features
        self.eps = eps
        self.register_buffer("weight", torch.ones(num_features))
        self.register_buffer("bias", torch.zeros(num_features))
        self.register_buffer("running_mean", torch.zeros(num_features))
        self.register_buffer("running_var", torch.ones(num_features) - eps)

    def forward(self, x):
        if x.requires_grad:
            # When gradients are needed, F.batch_norm will use extra memory
            # because its backward op computes gradients for weight/bias as well.
            scale = self.weight * (self.running_var + self.eps).rsqrt()
            bias = self.bias - self.running_mean * scale
            scale = scale.reshape(1, -1, 1, 1)
            bias = bias.reshape(1, -1, 1, 1)
            return x * scale + bias
        else:
            # When gradients are not needed, F.batch_norm is a single fused op
            # and provide more optimization opportunities.
            return F.batch_norm(
                x,
                self.running_mean,
                self.running_var,
                self.weight,
                self.bias,
                training=False,
                eps=self.eps,
            )

    def __repr__(self):
        return "FrozenBatchNorm2d(num_features={}, eps={})".format(self.num_features, self.eps)

    @classmethod
    def convert_frozen_batchnorm(cls, module):
        """
        Convert BatchNorm/SyncBatchNorm in module into FrozenBatchNorm.
        Args:
            module (torch.nn.Module):
        Returns:
            If module is BatchNorm/SyncBatchNorm, returns a new module.
            Otherwise, in-place convert module and return it.
        Similar to convert_sync_batchnorm in
        https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/batchnorm.py
        """
        bn_module = nn.modules.batchnorm
        bn_module = (bn_module.BatchNorm2d, bn_module.SyncBatchNorm)
        res = module
        if isinstance(module, bn_module):
            res = cls(module.num_features)
            if module.affine:
                res.weight.data = module.weight.data.clone().detach()
                res.bias.data = module.bias.data.clone().detach()
            res.running_mean.data = module.running_mean.data
            res.running_var.data = module.running_var.data
            res.eps = module.eps
        else:
            for name, child in module.named_children():
                new_child = cls.convert_frozen_batchnorm(child)
                if new_child is not child:
                    res.add_module(name, new_child)
        return res

================================================
FILE: xmatching/loss.py
================================================
import torch


def hinge(x):
    return torch.clamp(x, min=0.)


def paired_hinge_rank_loss(
        lang_output: torch.Tensor,
        visn_output: torch.Tensor,
        lang_mask: torch.Tensor,
        margin: float,
):
    """
    Consider the first half as positive and the second half as negative.
    :param lang_output: [batch_size, max_len, hid_dim]
    :param visn_output: [batch_size, hid_dim]
    :param lang_mask: Int Tensor [batch_size, max_len], 1 for tokens, 0 for paddings.
    :param margin: margin in the ranking loss
    :return: a scalar loss
    """
    batch_size, lang_len, dim = lang_output.shape
    assert batch_size % 2 == 0 and batch_size == visn_output.shape[0]
    assert margin > 0.

    # Expand the visn_output to match each word
    visn_output = visn_output.unsqueeze(1)      # [b, 1, hid_dim]

    # Split to positive and negative sets.
    half_batch_size = batch_size // 2
    pos_lang, neg_lang = torch.split(lang_output, half_batch_size, dim=0)
    pos_visn, neg_visn = torch.split(visn_output, half_batch_size, dim=0)

    # Calculate positive and negative scores.
    true_pos_score = (pos_lang * pos_visn).sum(-1)           # [batch_size / 2, max_len]
    true_neg_score = (neg_lang * neg_visn).sum(-1)           # [batch_size / 2, max_len]
    false_pos_score = (pos_lang * neg_visn).sum(-1)          # [batch_size / 2, max_len]
    false_neg_score = (neg_lang * pos_visn).sum(-1)          # [batch_size / 2, max_len]

    # Hinge Loss
    float_lang_mask = lang_mask.type(lang_output.dtype)      # Either fp16 or fp32
    pos_lang_mask, neg_lang_mask = torch.split(float_lang_mask, half_batch_size, dim=0)
    pos_loss = hinge(margin - true_pos_score + false_pos_score) * pos_lang_mask
    neg_loss = hinge(margin - true_neg_score + false_neg_score) * neg_lang_mask

    # Averaging
    cnt = float_lang_mask.sum()    # Number of words.
    loss = (pos_loss.sum() + neg_loss.sum()) / cnt

    return loss


def batchwise_hinge_rank_loss(
        lang_output: torch.Tensor,
        visn_output: torch.Tensor,
        lang_mask: torch.Tensor,
        margin: float,
):
    """
    Consider all un-matched pairs in the batch as negative samples.
    :param lang_output: [batch_size, max_len, hid_dim]
    :param visn_output: [batch_size, hid_dim]
    :param lang_mask: Int Tensor [batch_size, max_len], 1 for tokens, 0 for paddings.
    :param margin: margin in the ranking loss
    :return: a scalar loss
    """
    batch_size, lang_len, dim = lang_output.shape
    assert batch_size % 2 == 0 and batch_size == visn_output.shape[0]
    assert margin > 0.

    # Expand the visn_output to match each word
    visn_output = visn_output.unsqueeze(1)                  # [b, 1, dim]

    # The score of positive pairs
    positive_score = (lang_output * visn_output.unsqueeze(1)).sum(-1)    # [b, max_len]

    # The score of negative pairs. Note that the diagonal is actually the positive score,
    # but it would be zero-graded in calculating the loss below.
    negative_scores = (lang_output.reshape(batch_size, 1, lang_len, dim) *
                       visn_output.reshape(1, batch_size, 1, dim)).sum(-1)    # [b(lang), b(visn), max_len]
    # negative_scores = torch.einsum('ikd,jd->ijk', lang_output, visn_output)

    # Calculate of the hinge rank loss, let me explain why it works:
    # For the diagonal, the scores are for positive, we thus create a positive_mask to neglect these scores.
    #   max(0., margin - x^T x + (x^T x - 2 margin) )
    # = max(0., -margin)
    # = 0.      , since we have made sure that margin > 0
    # During backwards, the operator max(0., -margin) would raise a grad of 0 to the operand "-margin",
    #   thus it is just what we want.
    float_lang_mask = lang_mask.type(lang_output.dtype)       # Either fp16 or fp32
    positive_mask = torch.eye(batch_size)
    negative_scores = negative_scores - positive_mask.unsqueeze(-1) * margin * 2
    lang_loss = hinge(margin - positive_score.unsqueeze(1) + negative_scores) * float_lang_mask.unsqueeze(1)
    visn_loss = hinge(margin - positive_score.unsqueeze(0) + negative_scores) * float_lang_mask.unsqueeze(1)

    # Averaging
    # Each sentence is duplicated by batch_size thus the total length is also multiplied by this term.
    cnt = max(float_lang_mask.sum() * batch_size, 1.)    # Number of words.
    lang_loss = lang_loss.sum() / cnt
    visn_loss = visn_loss.sum() / cnt

    return lang_loss + visn_loss


================================================
FILE: xmatching/main.py
================================================
import collections
import os
import pickle
import sys

import torch
import torch.multiprocessing as mp
import torchvision.transforms as transforms
import torch.nn as nn
import torch.distributed as dist
import tqdm
from transformers import BertTokenizer

sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from xmatching.data import ImgSentDataset, ImgSentTorchDataset
from xmatching.loss import paired_hinge_rank_loss
from xmatching.metric import batchwise_accuracy, batchwise_recall
from xmatching.model import LangModel, VisnModel, JointModel, LANG_MODELS
from xmatching.param import parse_args


def is_port_in_use(port):
    import socket
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        return s.connect_ex(('localhost', port)) == 0


def main():
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    port = 9595
    while is_port_in_use(port):
        port += 1
    print("Use port", port)
    os.environ['MASTER_PORT'] = str(port)

    # Using all available gpus for multi-processing distributed
    args = parse_args()
    args.gpus = torch.cuda.device_count()
    print("Use gpus ", list(range(args.gpus)))
    args.world_size = args.gpus * args.nodes
    # mp.spawn(setup, nprocs=args.gpus, args=(args,))
    # args.world_size = args.gpus * args.nodes
    mp.spawn(train, nprocs=args.gpus, args=(args,))


def train(gpu, args):
    device = torch.device('cuda', gpu)
    rank = args.nr * args.gpus + gpu
    dist.init_process_group(
        backend='nccl',
        init_method='env://',
        world_size=args.world_size,
        rank=rank
    )

    # Models
    lang_layers = list(map(lambda x: -int(x), args.lang_layers.split(',')))     # The layers concated as the output.
    lang_model = LangModel(args.dim, arch=args.lang, layers=lang_layers,
                           pretrained=args.lang_pretrained, finetuning=args.lang_finetune)
    visn_model = VisnModel(args.dim, arch=args.visn,
                           pretrained=args.visn_pretrained, finetuning=args.visn_finetune)
    # The use of joint model would help synchronization in distributed learning.
    model = JointModel(lang_model, visn_model)

    # Since we will disallow the broadcast of buffers in DDP
    # we want make sure that there are no buffers besides batch normalization and position id.
    for name, buffer in model.named_buffers():
        assert 'bn' in name or 'downsample' in name or "position_ids" in name

    if args.load is not None:
        state_dict = torch.load(args.load, map_location=device)
        new_state_dict = {}
        for key, value in state_dict.items():        # If the ddp state_dict is saved
            if 'num_batches_tracked' not in key:
                if key.startswith("module."):
                    new_state_dict[key[len("module."):]] = state_dict[key]
                else:
                    new_state_dict[key] = state_dict[key]
        model_keys = set(model.state_dict().keys())
        load_keys = set(new_state_dict.keys())
        print("Keys in model but not in load:")
        for key in sorted(model_keys - load_keys):
            print(key)
        print("Keys in load but not in model:")
        for key in sorted(load_keys - model_keys):
            print(key)
        model.load_state_dict(new_state_dict)

    # Pre-processing Hyper-Params
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
    train_transform = transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        normalize
    ])
    valid_transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        normalize
    ])
    Model, Tokenizer, weight = LANG_MODELS[args.lang]
    tokenizer = Tokenizer.from_pretrained(weight)
    # tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    max_len = args.max_len

    # Dump the pre-processing objs for future feature extractions.
    if gpu == 0:
        pickle.dump(tokenizer, open(
            os.path.join(args.output, 'tokenizer.pkl'), 'wb'))
        pickle.dump(valid_transform, open(
            os.path.join(args.output, 'img_transform.pkl'), 'wb'))

    # Data Sets
    train_set = ImgSentDataset(args.train_imgs, args.train_langs, tiny=args.tiny, fast=args.fast)
    train_tset = ImgSentTorchDataset(
        train_set, train_transform, tokenizer, max_len
    )
    print("GPU %d: load %d data in training." % (gpu, len(train_set)))
    valid_set = ImgSentDataset(args.valid_imgs, args.valid_langs, tiny=args.tiny, fast=args.fast)
    valid_set.shuffle()         # Valid set only gets shuffled once!!!
    print("GPU %d: load %d data in validation." % (gpu, len(valid_set)))
    valid_tset = ImgSentTorchDataset(
        valid_set, valid_transform, tokenizer, max_len
    )
    print()

    # Data Loader
    train_sampler = torch.utils.data.distributed.DistributedSampler(
        train_tset,
        num_replicas=args.world_size,
        rank=rank,
        shuffle=True,
    )
    train_loader = torch.utils.data.DataLoader(
        dataset=train_tset,
        batch_size=(args.batch_size // args.world_size),
        shuffle=False,          # Will be shuffled in the sampler.
        num_workers=max(args.num_workers // args.world_size, 1),
        pin_memory=True,
        sampler=train_sampler,
        drop_last=True
    )

    valid_loader = torch.utils.data.DataLoader(
        dataset=valid_tset,
        batch_size=256,             # Fix batch_size to have stable batchwise evaluations.
        shuffle=False,
        num_workers=args.num_workers,
        pin_memory=True,
        drop_last=True
    )

    if args.optim == 'bert':
        from transformers import AdamW, get_linear_schedule_with_warmup
        no_decay = ["bias", "LayerNorm.weight"]
        optimizer_grouped_parameters = [
            {
                "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
                "weight_decay": 0.01,
            },
            {
                "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                "weight_decay": 0.0,
            },
        ]
        optimizer = AdamW(optimizer_grouped_parameters, lr=args.lr, eps=1e-8)
        t_total = len(train_loader) * args.epochs
        warmup_steps = int(t_total * args.warmup_ratio)
        print("Train for %d steps and warm up for %d steps" % (t_total, warmup_steps))
        scheduler = get_linear_schedule_with_warmup(
            optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total
        )
    else:
        if args.optim == 'sgd':
            optimizer = args.optimizer(
                [param for param in model.parameters() if param.requires_grad],
                args.lr,
                momentum=0.9
            )
        else:
            optimizer = args.optimizer(
                [param for param in model.parameters() if param.requires_grad],
                args.lr,
                # momentum=0.9
            )

    # Loss and optimizer
    criterion = paired_hinge_rank_loss
    torch.cuda.set_device(gpu)
    model.cuda(gpu)

    if args.fp16:
        try:
            from apex import amp
            from apex.parallel import DistributedDataParallel as DDP
            model, optimizer = amp.initialize(model, optimizer,
                                              opt_level='O2')
            # Defautly, current apex DDP would not broadcast the buffers.
            model = DDP(model)
        except Exception as e:
            print(e)
            print("Please install apex library")
            return
    else:
        # Note that we disallow broad cast buffers here to reduce communication cost.
        model = nn.parallel.DistributedDataParallel(
            model,
            device_ids=[gpu],
            find_unused_parameters=True,
            broadcast_buffers=False,
        )

    if args.test_only or args.load:     # Test the loading performance
        if gpu == 0:
            print("Test: GPU %d will test %d data in %d iterations." %
                  (gpu, len(valid_loader) * 256, len(valid_loader)))
            results = valid(args, model, criterion, valid_loader)
            print("Initial test results:")
            for key, value in results.items():
                print('\t%s: %0.4f' % (key, value))
        if args.test_only:
            exit()

    best_valid_loss = 9595.
    for epoch in range(args.epochs):
        if gpu == 0:
            print("Training of Epoch %d: GPU %d will process %d data in %d iterations." %
                  (epoch, gpu, len(train_loader) * args.batch_size // args.world_size, len(train_loader)))
        prev_loss = total_loss = 0.
        for i, (uid, lang_input, visn_input) in enumerate(tqdm.tqdm(train_loader, disable=(gpu!=0))):
            # Currently, lang_input is the (input_ids, attention_mask)
            # visn_input is (tensor_img)
            lang_input = tuple(x.cuda(non_blocking=True) for x in lang_input)
            visn_input = tuple(x.cuda(non_blocking=True) for x in visn_input)

            # Forward pass
            model.zero_grad()
            lang_output, visn_output = model(lang_input, visn_input)
            loss = criterion(lang_output, visn_output, lang_input[1], args.margin)
            total_loss += loss.item()

            # Backward
            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            # Step
            if args.fp16:
                torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), 5.)
            else:
                torch.nn.utils.clip_grad_norm_(model.parameters(), 5.)
            optimizer.step()
            if args.optim == 'bert':
                scheduler.step()

            # # Logging
            # interval = 100
            # if (i+1) % interval == 0:
            #     print("GPU %d Epoch %d Iter %d: Training Loss %0.4f" %
            #           (gpu, epoch, i+1, (total_loss - prev_loss) / interval))
            #     prev_loss = total_loss
        if gpu == 0:
            print("GPU %d Epoch %d: Total Training Loss %0.4f" % (gpu, epoch, total_loss / len(train_loader)))
            print()
            print("Validation: GPU %d will process %d data in %d iterations." %
                  (gpu, len(valid_loader) * 256, len(valid_loader)))
            results = valid(args, model, criterion, valid_loader, use_tqdm=True)
            for key, value in results.items():
                print('\t%s: %0.4f' % (key, value))
            if results['loss'] < best_valid_loss:
                best_valid_loss = results['loss']
                snap_path = os.path.join(args.output, 'BEST.pth')
                print("GPU 0: Save snapshot to ", snap_path)
                torch.save(model.module.state_dict(), snap_path)
                torch.save(model.module, snap_path + '.model')
            print("BEST valid loss %0.4f" % best_valid_loss)
            print()


def valid(args, model, criterion, valid_loader, use_tqdm=True):
    model.eval()
    results = collections.defaultdict(lambda: 0)
    iterator = tqdm.tqdm(valid_loader) if use_tqdm else valid_loader
    for i, (uid, lang_input, visn_input) in enumerate(iterator):
        # Currently, lang_input is the (input_ids, attention_mask)
        # visn_input is (tensor_img)
        lang_input = tuple(x.cuda(non_blocking=True) for x in lang_input)
        visn_input = tuple(x.cuda(non_blocking=True) for x in visn_input)

        with torch.no_grad():
            # Forward pass
            lang_output, visn_output = model(lang_input, visn_input)

            # Evaluation
            results['loss'] += criterion(lang_output, visn_output, lang_input[1], args.margin).item()
            recall_results = batchwise_recall(lang_output, visn_output, lang_input[1], recalls=(1, 5, 10))
            for key, value in recall_results.items():
                results['R%d' % key] += value

    for key in results:
        results[key] = results[key] / len(valid_loader)
    model.train()

    return results


if __name__ == "__main__":
    main()


================================================
FILE: xmatching/metric.py
================================================
import torch


def batchwise_accuracy(lang_output, visn_output, lang_mask):
    """
    Calculate the accuracy of contextual word retrieval, average by batch.
    :param lang_output: [batch_size, max_len, hid_dim]
    :param visn_output: [batch_size, hid_dim]
    :param lang_mask: Int Tensor [batch_size, max_len], 1 for tokens, 0 for paddings.
    :return:
    """
    batch_size, lang_len, dim = lang_output.shape
    assert batch_size % 2 == 0 and batch_size == visn_output.shape[0]

    # Expand the visn_output to match each word
    visn_output = visn_output.unsqueeze(1)                  # [b, 1, dim]

    # The score of negative pairs. Note that the diagonal is actually the positive score,
    # but it would be zero-graded in calculating the loss below.
    negative_scores = (lang_output.reshape(batch_size, 1, lang_len, dim) *
                       visn_output.reshape(1, batch_size, 1, dim)).sum(-1)    # [b(lang), b(visn), max_len]
    # negative_scores = torch.einsum('ikd,jd->ijk', lang_output, visn_output)

    max_neg_score, max_neg_idx = negative_scores.max(1)        # [batch, max_len], the batch_idx of max-aligned img
    pos_idx = torch.arange(0, batch_size, dtype=torch.int64).to(lang_output.device)

    correct = (pos_idx.unsqueeze(1) == max_neg_idx)
    bool_lang_mask = lang_mask.type(correct.dtype)
    correct = correct * bool_lang_mask
    correct_num = correct.sum()

    accuracy = correct_num * 1. / bool_lang_mask.sum()

    return accuracy


def batchwise_recall(lang_output, visn_output, lang_mask, recalls=(1,)):
    """
    Calculate the accuracy of contextual word retrieval, average by batch.
    :param lang_output: [batch_size, max_len, hid_dim]
    :param visn_output: [batch_size, hid_dim]
    :param lang_mask: Int Tensor [batch_size, max_len], 1 for tokens, 0 for paddings.
    :param recall: a list, which are the number of recalls to be evaluated.
    :return:
    """
    batch_size, lang_len, dim = lang_output.shape
    assert batch_size % 2 == 0 and batch_size == visn_output.shape[0]

    # Expand the visn_output to match each word
    visn_output = visn_output.unsqueeze(1)                  # [b, 1, dim]

    # The score of positive pairs
    positive_score = (lang_output * visn_output).sum(-1)    # [b, max_len]

    # The score of negative pairs. Note that the diagonal is actually the positive score,
    # but it would be zero-graded in calculating the loss below.
    negative_scores = (lang_output.reshape(batch_size, 1, lang_len, dim) *
                       visn_output.reshape(1, batch_size, 1, dim)).sum(-1)    # [b(lang), b(visn), max_len]
    # negative_scores = torch.einsum('ikd,jd->ijk', lang_output, visn_output)

    result = {}
    for recall in recalls:
        kthscore, kthidx = torch.kthvalue(negative_scores, batch_size - recall, dim=1)     # [b, max_len]
        # print(kthscore.shape) print(positive_score.shape)
        correct = (positive_score >= kthscore)                                # [b, max_len]
        bool_lang_mask = lang_mask.type(correct.dtype)
        correct = correct * bool_lang_mask
        correct_num = correct.sum()
        # print(correct_num)
        # print(bool_lang_mask.sum())
        result[recall] = (correct_num * 1. / bool_lang_mask.sum()).item()

    return result


================================================
FILE: xmatching/model.py
================================================
import torch
from torch import nn
import torchvision.models as models
from transformers import *

from .frozen_batch_norm import FrozenBatchNorm2d


LANG_MODELS = {
          'bert':    (BertModel,       BertTokenizer,       'bert-base-uncased'),
          'bert-large':  (BertModel,       BertTokenizer,       'bert-large-uncased'),
          'gpt':     (OpenAIGPTModel,  OpenAIGPTTokenizer,  'openai-gpt'),
          'gpt2':    (GPT2Model,       GPT2Tokenizer,       'gpt2'),
          'ctrl':    (CTRLModel,       CTRLTokenizer,       'ctrl'),
          'xl':      (TransfoXLModel,  TransfoXLTokenizer,  'transfo-xl-wt103'),
          'xlnet':   (XLNetModel,      XLNetTokenizer,      'xlnet-base-cased'),
          'xlm':     (XLMModel,        XLMTokenizer,        'xlm-mlm-enfr-1024'),
          'distil':  (DistilBertModel, DistilBertTokenizer, 'distilbert-base-cased'),
          'roberta': (RobertaModel,    RobertaTokenizer,    'roberta-base'),
          'xlm-roberta': (XLMRobertaModel, XLMRobertaTokenizer, 'xlm-roberta-base'),
}


def get_visn_arch(arch):
    try:
        return getattr(models, arch)
    except AttributeError as e:
        print(e)
        print("There is no arch %s in torchvision." % arch)


class VisnModel(nn.Module):
    def __init__(self, dim, arch='resnet50', pretrained=True, finetuning=False):
        """
        :param dim: dimension of the output
        :param arch: backbone architecture,
        :param pretrained: load feature with pre-trained vector
        :param finetuning: finetune the model
        """
        super().__init__()
        self.finetuning = finetuning

        # Setup Backbone
        resnet = get_visn_arch(arch)(pretrained=pretrained)
        backbone_dim = resnet.fc.in_features
        if not self.finetuning:
            for param in resnet.parameters():
                param.requires_grad = False
        resnet.fc = nn.Identity()
        self.backbone = resnet

        # Surgery on the Networks
        # 1. Frozen Batch Norm
        #    Note that BatchNorm modules have been in-place replaced!
        #    This piece of code is copied from Detectron2, and it was copied from mask-rcnn?
        self.backbone = FrozenBatchNorm2d.convert_frozen_batchnorm(
            self.backbone)
        # print(self.backbone)
        # 2. Frozen the first two (blocks of) layers
        for module in [self.backbone.conv1,
                       self.backbone.layer1]:
            for param in module.parameters():
                param.requires_grad = False

        print(f"Visn Model: {arch}, Finetune: {finetuning}, Pre-trained: {pretrained}")
        print(f"Visn Model: backbone dim {backbone_dim} --> output dim {dim}")

        # Setup follow-up layers
        self.mlp = nn.Sequential(
            nn.Linear(backbone_dim, 256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, dim),
        )

    def forward(self, img):
        """
        :param img: a tensor of shape [batch_size, H, W, C]
        :return: a tensor of [batch_size, d]
        """
        if not self.finetuning:
            with torch.no_grad():
                x = self.backbone(img)
                x = x.detach()
        else:
            x = self.backbone(img)
        x = self.mlp(x)         # [b, dim]
        x = x / x.norm(2, dim=-1, keepdim=True)
        return x


class LangModel(nn.Module):
    def __init__(self, dim, arch='BERT', layers=(-1,), pretrained=True, finetuning=False):
        """
        :param dim: dimension of the output
        :param arch: backbone architecture,
        :param aggregate: one of 'last4',
        :param pretrained: load feature with pre-trained vector
        :param finetuning: finetune the model
        """
        super().__init__()
        self.finetuning = finetuning

        # Setup Backbone
        Model, Tokenizer, weight = LANG_MODELS[arch]
        bert = Model.from_pretrained(
            weight,
            output_hidden_states=True
        )
        if not pretrained:
            bert.init_weights()

        if not self.finetuning:
            for param in bert.parameters():
                param.requires_grad = False
        backbone_dim = bert.config.hidden_size
        self.backbone = bert
        self.layers = sorted(layers)

        print(f"Language Model: {arch} with weight {weight}; Fine-tuning: {finetuning}, Pre-trained: {pretrained}.")
        print(f"Language Model: using layers {self.layers}, result in backbone dim {backbone_dim * len(self.layers)} "
              f"--> output dim {dim}.")

        # Setup follow-up layers
        self.mlp = nn.Sequential(
            nn.Linear(backbone_dim * len(self.layers), 256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, dim),
        )

    def forward(self, input_ids, attention_mask, token_type_ids=None):
        """
        :param input_ids: [batch_size, max_len]
        :param attention_mask: [batch_size, max_len]
        :param token_type_ids: [batch_size, max_len]
        :return: [batch_size, max_len, dim]
        """
        if not self.finetuning:
            with torch.no_grad():
                x = self.backbone(
                    input_ids,
                    attention_mask=attention_mask,
                    token_type_ids=token_type_ids,
                )
        else:
            x = self.backbone(
                input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids,
            )

        # sequence_output, pooled_output, (hidden_states), (attentions) --> seq_output
        if type(self.backbone) is XLNetModel:
            output, hidden_states = x[:2]
        else:
            output, pooled_output, hidden_states = x[:3]

        # gather the layers
        if type(self.backbone) is XLNetModel:
            x = torch.cat(list(hidden_states[layer].permute(1, 0, 2) for layer in self.layers), -1)
        else:
            x = torch.cat(list(hidden_states[layer] for layer in self.layers), -1)

        if not self.finetuning:
            x = x.detach()

        # [batch_size, max_len, backbone_dim] -->
        # [batch_size, max_len, output_dim]
        x = self.mlp(x)
        x = x / x.norm(2, dim=-1, keepdim=True)
        return x


class JointModel(nn.Module):
    def __init__(self, lang_model, visn_model):
        super().__init__()
        self.lang_model = lang_model
        self.visn_model = visn_model

    def forward(self, lang_input, visn_input):
        lang_output = self.lang_model(*lang_input)
        visn_output = self.visn_model(*visn_input)
        return lang_output, visn_output


================================================
FILE: xmatching/param.py
================================================
# coding=utf-8
# Copyleft 2020 project COL.
# Copyleft 2019 project LXRT.

import argparse
import random

import numpy as np
import torch


def get_optimizer(optim):
    # Bind the optimizer
    if optim == 'rms':
        # print("Optimizer: Using RMSProp")
        optimizer = torch.optim.RMSprop
    elif optim == 'adam':
        # print("Optimizer: Using Adam")
        optimizer = torch.optim.Adam
    elif optim == 'adamax':
        # print("Optimizer: Using Adamax")
        optimizer = torch.optim.Adamax
    elif optim == 'sgd':
        # print("Optimizer: sgd")
        optimizer = torch.optim.SGD
    elif 'bert' in optim:
        optimizer = 'bert'      # The bert optimizer will be bind later.
    else:
        assert False, "Please add your optimizer %s in the list." % optim

    return optimizer


def parse_args():
    parser = argparse.ArgumentParser()

    # Data Splits
    parser.add_argument("--sources", default='mscoco', help="mscoco, cc, vg, vqa, gqa, visual7w")
    parser.add_argument("--train-imgs", default='mscoco_train,mscoco_nominival,vg_nococo')
    parser.add_argument("--valid-imgs", default='mscoco_minival')
    parser.add_argument("--train-langs", default='mscoco',
                        help='Some of mscoco, cc, vg, vqa, gqa, visual7w.'
                             'split by comma')
    parser.add_argument("--valid-langs", default='mscoco',
                        help='Some of mscoco, cc, vg, vqa, gqa, visual7w.'
                             'split by comma')
    parser.add_argument("--test", default=None)
    parser.add_argument("--test-only", action='store_true')

    # Datasets Configuration
    parser.add_argument("--fast", action='store_true')
    parser.add_argument("--tiny", action='store_true')
    parser.add_argument("--max-len", default=20, type=int)

    # Training Hyper-parameters
    parser.add_argument('--batchSize', dest='batch_size', type=int, default=256)
    parser.add_argument('--optim', default='bert')
    parser.add_argument('--lr', type=float, default=1e-4)
    parser.add_argument('--warmup-ratio', type=float, default=0.05)
    parser.add_argument('--epochs', type=int, default=10)
    parser.add_argument('--dropout', type=float, default=0.1)
    parser.add_argument('--seed', type=int, default=9595, help='random seed')
    parser.add_argument("--fp16", action='store_true')

    # Model Hyper-parameters
    parser.add_argument('--visn', type=str, default='resnext101_32x8d', help='The vision backbone model.')
    parser.add_argument('--lang', type=str, default='bert', help='The language backbone model.')
    parser.add_argument('--lang-layers', type=str, default='-1', help='The language backbone model.')
    parser.add_argument('--dim', type=int, default=64, help='The output dim of the joint emb.')

    # Model Loading
    parser.add_argument('--load', type=str, default=None,
                        help='Load the model (usually the fine-tuned model).')
    parser.add_argument('--lang-finetune', action='store_true', help='finetune the language encoder.')
    parser.add_argument('--visn-finetune', action='store_true', help='finetune the visual encoder.')
    parser.add_argument('--lang-pretrained', action='store_true', help='Use the pre-trained language encoder.')
    parser.add_argument('--visn-pretrained', action='store_true', help='Use the pre-trained visual encoder.')

    # Optimization
    parser.add_argument("--margin", default=0.5, type=float, help='The margin in the hinge losses.')
    parser.add_argument("--loss", dest='loss', default='paired_hinge',
                        type=str)

    # Training configuration
    parser.add_argument("--num-workers", default=0, type=int)
    parser.add_argument('--output', type=str, default='snap/test')

    # Distributed Training Configuration
    parser.add_argument('-n', '--nodes', default=1,
                        type=int, metavar='N')
    parser.add_argument('-g', '--gpus', default=1, type=int,
                        help='number of gpus per node')
    parser.add_argument('-nr', '--nr', default=0, type=int,
                        help='ranking within the nodes')

    # Parse the arguments.
    args = parser.parse_args()

    # Bind optimizer class.
    args.optimizer = get_optimizer(args.optim)

    # Set seeds
    torch.manual_seed(args.seed)
    random.seed(args.seed)
    np.random.seed(args.seed)

    return args


# args = parse_args()