Full Code of airsplay/vokenization for AI

master 5601b799184e cached

70 files

295.5 KB

75.4k tokens

190 symbols

1 requests

Download .txt

Showing preview only (315K chars total). Download the full file or copy to clipboard to get everything.

Repository: airsplay/vokenization
Branch: master
Commit: 5601b799184e
Files: 70
Total size: 295.5 KB

Directory structure:
gitextract_6h_zq3l4/

├── LICENSE
├── README.md
├── data/
│   ├── lxmert/
│   │   └── .gitignore
│   ├── mscoco/
│   │   └── .gitignore
│   ├── vg/
│   │   └── .gitignore
│   ├── wiki/
│   │   ├── get_data_cased.bash
│   │   ├── get_data_cased_untokenized.bash
│   │   ├── install-tools.sh
│   │   └── tools/
│   │       ├── remove_accent.py
│   │       ├── segment_th.py
│   │       └── tokenize.sh
│   └── wiki103/
│       ├── get_data_cased.sh
│       └── get_data_uncased.sh
├── requirements.txt
├── scripts/
│   ├── base_vlm_wiki.bash
│   ├── base_vlm_wiki_glue.bash
│   ├── base_wiki.bash
│   ├── base_wiki_glue.bash
│   ├── extract_keys.bash
│   ├── mpvokenize_wiki.bash
│   ├── mpvokenize_wiki103.bash
│   ├── run_glue_at_epoch.bash
│   ├── run_glue_epochs.bash
│   ├── run_xmatching.bash
│   ├── small_vlm_wiki103.bash
│   ├── small_vlm_wiki103_glue.bash
│   ├── small_wiki103.bash
│   ├── small_wiki103_glue.bash
│   └── xmatching_benchmark.bash
├── snap/
│   ├── bert/
│   │   └── .gitkeep
│   ├── vlm/
│   │   └── .gitkeep
│   └── xmatching/
│       └── .gitkeep
├── tokenization/
│   ├── to_hdf5.py
│   ├── tokenize_dataset.py
│   ├── tokenize_wiki103_bert.bash
│   ├── tokenize_wiki103_roberta.bash
│   ├── tokenize_wiki_bert.bash
│   └── tokenize_wiki_roberta.bash
├── vlm/
│   ├── __init__.py
│   ├── configs/
│   │   ├── bert-12L-768H.json
│   │   ├── bert-4L-768H.json
│   │   ├── bert-6L-512H.json
│   │   └── bert_base.json
│   ├── data.py
│   ├── model.py
│   ├── param.py
│   ├── run_glue.py
│   ├── run_glue_epochs.py
│   ├── run_lm_distributed.py
│   ├── run_vlm_distributed.py
│   └── show_glue_results_epochs.py
├── vokenization/
│   ├── __init__.py
│   ├── common.py
│   ├── create_image_ids.py
│   ├── evaluate_diversity.py
│   ├── evaluate_retrieval.py
│   ├── extract_vision_keys.py
│   ├── indexing.py
│   ├── revokenization.py
│   ├── revokenize_corpus_mp.py
│   ├── vokenization.py
│   └── vokenize_corpus_mp.py
└── xmatching/
    ├── __init__.py
    ├── data.py
    ├── frozen_batch_norm.py
    ├── loss.py
    ├── main.py
    ├── metric.py
    ├── model.py
    └── param.py

================================================
FILE CONTENTS
================================================

================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2020 Hao Tan

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# Vokenization

PyTorch code for the EMNLP 2020 paper "[Vokenization: Improving Language Understanding with Contextualized, 
Visual-Grounded Supervision](https://arxiv.org/pdf/2010.06775.pdf)" (Hao Tan and Mohit Bansal).

**Outline**
* [Contextualized Cross-Modal Matching](#contextualized-cross-modal-matching-xmatching)
    * [Downloading Image and Captioning Data](#download-image-and-captioning-data)
    * [Model Training](#training-the-cross-modal-matching-model)
    * [Benchmark (Optional)](#benchmarking-cross-modal-matching-models-optional)
* [Vokenization](#vokenization-vokenization)
    * [Downloading Pure-Language Data](#downloading-and-pre-processing-pure-language-data)
    * [Extracting Visual Feature](#extracting-image-features)
    * [Vokenization Process](#the-vokenization-process)
* [Visually-Supervised Language Model](#visually-supervised-language-model-vlm)
    * [VLM Pre-training](#pre-training-with-vlm)
    * [GLUE Evaluation](#glue-evaluation)
    * [MLM Pre-training (as baselines)](#bert-as-baselines)
    
> Note: I recommend to focus on "Wiki103" first and 
> ingore the code blocks related to "English Wikipedia".
> "Eng Wiki" might take too long to complete.

## Installation
```shell script
pip install -r requirements.txt
```

Require python 3.6 + (to support huggingface [transformers](https://github.com/huggingface/transformers)).

## Contextualized Cross-Modal Matching (xmatching)
In this [module](xmatching) (corresponding to Sec 3.2 of the [paper](https://arxiv.org/pdf/2010.06775.pdf)), 
we want to learn a token-image matching model from sentence-image aligned data (i.e., image captioning data).
The model "contextually" measures the relevance between tokens (i.e., words) and images.
The terminology "contextual" emphasize the nature that 
the sentences (the context) are considered
when measuring the token-image relevance score.


### Download Image and Captioning Data
1. Download MS COCO images:
    ```shell script
    # MS COCO (Train 13G, Valid 6G)
    mkdir -p data/mscoco
    wget http://images.cocodataset.org/zips/train2014.zip -P data/mscoco
    wget http://images.cocodataset.org/zips/val2014.zip -P data/mscoco
    unzip data/mscoco/train2014.zip -d data/mscoco/images/ && rm data/mscoco/train2014.zip
    unzip data/mscoco/val2014.zip -d data/mscoco/images/ && rm data/mscoco/val2014.zip
    ```
   If you already have COCO image on disk. Save them as 
    ```
    data
      |-- mscoco
            |-- images
                 |-- train2014
                         |-- COCO_train2014_000000000009.jpg
                         |-- COCO_train2014_000000000025.jpg
                         |-- ......
                 |-- val2014
                         |-- COCO_val2014_000000000042.jpg
                         |-- ......
    ```

2. Download captions (split following the LXMERT project):
    ```shell script
    mkdir -p data/lxmert
    wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_train.json -P data/lxmert/
    wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_nominival.json -P data/lxmert/
    wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/vgnococo.json -P data/lxmert/
    wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_minival.json -P data/lxmert/
    ```

### Training the Cross-Modal Matching Model
The model is trained on MS COCO with pairwise hinge loss (details in Sec. 3.2 of the [paper](https://arxiv.org/pdf/2010.06775.pdf)).

Running Commands:
```bash
# Run the cross-modal matching model with single-machine multi-processing distributed training
# "0,1" indicates using the GPUs 0 and 1.
# "bert_resnext" is the name of this snapshot and would be saved at snap/xmatching/bert_resnext
# "--visn resnext101_32x8d" is the vision backbone
# "--lang bert" is the langaugae backbone
# Speed: 20 min ~ 30 min / 1 Epoch, 20 Epochs by default.
bash scripts/run_xmatching.bash 0,1 bert_resnext --visn resnext101_32x8d --lang bert
```
The options `--visn` and `--lang` specify the architecture of the encoder.
Tested options 
```
--visn $VISN_MODEL
VISN_MODEL={resnet18, resnet34, resnet50, resnet101, resnet152, 
            wide_resnet50_2, wide_resnet101_2, resnext101_32x8d (default), ...} 
--lang $LANG_MODEL
LANG_MODEL={bert, roberta, xlnet, bert-large, ...}
```
For visual backbones, the models in [torchvision](https://pytorch.org/docs/stable/torchvision/models.html) are mostly supported.
You might need to handle the last FC layer, because it is written differently in different backbones.
The language backbones are initialized from huggingface [transformers](https://github.com/huggingface/transformers).

> We found that the results with XLNet is pretty low but have not identified 
> the reason. Results of other backbones are similar.

## Vokenization (vokenization)
The vokenization is a bridge between the cross-modality (words-and-image) matching models (xmatching) and 
visually-supervised lagnauge models (vlm).
The final goal is to convert the language tokens to related images 
(we called them **vokens**).
These **vokens** enable the visual supervision of the language model.
We mainly provide pr-eprocessing tools (i.e., feature extraction, tokenization, and vokenization) and
evaluation tools of previous cross-modal matching models here.
Here is a diagram of these processes and we next discuss them one-by-one:
```
Extracting Image Features-----> Benchmakring the Matching Models (Optional) --> Vokenization
Downloading Language Data --> Tokenization -->-->--/
```

### Downloading and Pre-Processing Pure-Language Data 
We provide scripts to get the datasets "wiki103" and "wiki".
We would note them as "XX-cased" or "XX-uncased" where the suffix "cased" / "uncased" only indicates
the property of the raw text.
1. **Wiki103**. The [wiki103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) dataset
is a seleted subset of English Wikipedia, containing around 100M tokens.
    ```shell script
    bash data/wiki103/get_data_cased.sh
    ```
2. **English Wikipedia**. 
The script to download and process wiki data are modified from [XLM](https://github.com/facebookresearch/XLM).
It will download a 17G file. 
The speed depends on the networking and it usually takes several hours to filter the data.
The process ends with around 2.8B tokens.
    ```shell script
    bash data/wiki/get_data_cased.bash en
    ```
    Note: For *RoBERTa*, it requires an untokenized version of wiki (o.w. the results would be much lower), 
    so please use the following command:
    ```shell script
    bash data/wiki/get_data_cased_untokenized.bash en
    ```
   
> Note: I recommend to focus on "Wiki103" first and 
> ingore the code blocks related to "English Wikipedia".
> "Eng Wiki" might take too long to complete.
   
### Tokenization of Language Data
We next tokenize the language corpus.
It would locally save three files: 
"$dataset_name.$tokenizer_name", 
"$dataset_name.$tokenizer_name.hdf5",
and "$dataset_name.$tokenizer_name.line".
Taking the wiki103 dataset and BERT tokenizer as an example, 
we convert the training file into
```
data 
 |-- wiki103-cased 
        |-- wiki.train.raw.bert-base-uncased
        |-- wiki.train.raw.bert-base-uncased.hdf5
        |-- wiki.train.raw.bert-base-uncased.line
```
The txt file `wiki.train.raw.bert-base-uncased` saves the tokens and each line in this file is the tokens of a line 
in the original file,
The hdf5 file `wiki.train.raw.bert-base-uncased.hdf5` stores all the tokens continuously and use
`wiki.train.raw.bert-base-uncased.line` to index the starting token index of each line.
The ".line" file has `L+1` lines where `L` is the number of lines in the original files.
Each line has a range "line[i]" to "line[i+1]" in the hdf5 file.

Commands:
1. Wiki103 (around 10 min)
    ```shell script
    bash tokenization/tokenize_wiki103_bert.bash 
    ```
2. English Wikipedia (around 3 hours)
    ```shell script
    bash tokenization/tokenize_wiki_bert.bash 
    ```

### Extracting Image Features
The image pre-processing extracts the image features to build the keys in the vokenization retrieval process.

#### Download the Visual Genome (VG) images
Since MS COCO images are used in training the cross-modal matching model
as in [xmatching](#contextualized-cross-modal-matching-xmatching).
We will use the [Visual Genome](https://visualgenome.org/) images as 
candidate vokens for retrievel.
We here download the images first.
```shell script
wget https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip -P data/vg/
wget https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip -P data/vg/
unzip data/vg/images.zip -d data/vg/images && rm data/vg/images.zip
unzip data/vg/images2.zip -d data/vg/images && rm data/vg/images2.zip
cd data/vg/images
mv VG_100K/* .
mv VG_100K_2/* .
rm -rf VG_100K VG_100K_2
cd ../../../
```
If you already have Visual Genome image on disk. Save them as 
```
data
|-- vg
    |-- images
         |-- 1000.jpg
         |-- 1001.jpg
         |-- ......
```
    
#### Build Universal Image Ids
We first build a list of universal image indexes with 
[vokenization/create_image_ids.py](vokenization/create_image_ids.py). 
It is used to unify the image ids in different experiments 
thus the feature array stored in hdf5 could be universally indexed.
The image ids are saved under a shared path `LOCAL_DIR` (default to `data/vokenization`)
 defined in [vokenization/common.py](vokenization/common.py).
The image ids are saved under `data/vokenization/images` with format `{IMAGE_SET}_ids.txt`.
We will make sure that all the experiments agree with this meta info,
so that we would not get different indexing in different retrieval experiments.

> Note: The ids created by [create_image_ids.py](vokenization/create_image_ids.py) are only the order of the images.
> The actual images in the dictionary are provided by `extract_keys.bash`, thus is corresponding to the 
> `_paths.txt`, because the `extract_keys` will filter all broken images and non-existing images.

Commands:
```bash
# Step 1, Build image orders.
python vokenization/create_image_ids.py  
```

#### Extracting Image Features

Extract image features regarding the list built above, using code 
[vokenization/extract_vision_keys.py](vokenization/extract_vision_keys.py). 
The code will first read the image ids saved in `data/vokenization/images/{IMAGE_SET}_ids.txt` and locate the images.
The features will be saved under `snap/xmatching/bert_resnext/keys/{IMAGE_SET}.hdf5`.
It finishes within 1 hour.

Commands:
```bash
# Step 2, Extract features. 
# bash scripts/extract_keys.bash $GPU_ID $MODEL_NAME 
bash scripts/extract_keys.bash 0 bert_resnext 
```


### Benchmarking Cross-Modal Matching Models (Optional)
> Before evaluating, please make sure that `extracting_image_features` and `tokenization` are completed.

We benchmark the performance of cross-modal matching models from large scale.
The evaluation includes two different metrics: diversity and the retrieval performance.

Diversity 
(in [vokenization/evaluate_diversity.py](vokenization/evaluate_diversity.py))
ensures that the same [token type](https://arxiv.org/pdf/1902.06006.pdf)
is mapped to diverse images regarding its context (i.e., the sentence).
Retrieval 
(in [vokenization/evaluate_retrieval.py](vokenization/evaluate_retrieval.py)) 
measures the correspondence of the token and the retrieved images.

We gather these two utils into one script and the command here:
```bash
bash scripts/xmatching_benchmark.bash 0 bert_resnext
```

### The Vokenization Process
After all these steps, we could start to vokenize the language corpus.
It would load the tokens saved in `dataset_name.tokenizer_name.hdf5` 
and uses the line-split information in `dataset_name.tokenzier_name.line`.

The code is optimized and could be continued by just rerunning it.
The vokens will be saved in `snap/xmatching/bert_resnext/vokens/wiki.train.raw.vg_nococo.hdf5` by default.
The file `snap/xmatching/bert_resnext/vokens/wiki.train.raw.vg_nococo.ids` contains the universal image ids 
for each voken, 
e.g., the image id `vg_nococo/8` corresponds to 8-th feature
saved in `snap/xmatching/bert_resnext/keys/vg_nococo.hdf5`.


> Note: `--tokenizer-name` must be provided in the script.


Commands
1. Wiki103 (around 1 hour on 4 Titan V)
    ```shell script
    # Note: mp is the abbreviation for "multi-processing"
    # bash scripts/mpvokenize_wiki103.bash $USE_GPUS $SNAP_NAME
    bash scripts/mpvokenize_wiki103.bash 0,1,2,3 bert_resnext
    ```
2. English Wikipedia (around 1 day on 4 Titan V)
    ```shell script
    # bash scripts/mpvokenize_wiki.bash $USE_GPUS $SNAP_NAME
    bash scripts/mpvokenize_wiki.bash 0,1,2,3 bert_resnext
    ```

> The script will call
> [vokenization/vokenize_corpus_mp.py](vokenization/vokenize_corpus_mp.py)
> to vokenize a corpus. 
> The vokenziation happens in [vokenization/vokenization.py](vokenization/vokenization.py) and
> it use [vokenization/indexing.py](vokenization/indexing.py) to do nearest neighbor search
> (based on [faiss](https://github.com/facebookresearch/faiss)).


## Visually-Supervised Language Model (vlm)

### Pre-Training with VLM
As discussed in Sec. 2 of the [paper](https://arxiv.org/pdf/2010.06775.pdf),
we use previous generated vokens to pre-train the model 
with visual supervision.

#### Wiki103 
After the [vokenization process](#the-vokenization-process) of wiki103,
we could run the model with command:
```shell script
# bash scripts/small_vlm_wiki103_glue.bash $GPUs $SNAP_NAME
bash scripts/small_vlm_wiki103.bash 0,1,2,3 wiki103_bert_small
```
It will call 
[vlm/run_vlm_distributed.py](vlm/run_vlm_distributed.py)
and run a BERT-6Layers-512Hiddens model on [wiki103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)
dataset with the support of voken supervisions.
The snapshot will be saved to `snap/vlm/wiki103_bert_small`.
We recommend to run this Wiki103 experiment first since it will finish 
in a reasonable time (20 hours).
The pure BERT pre-training option is also available [later](#bert-as-baselines)
for comparisons.

Note: defautly, the mixed-precision training is not used.
To support the mixed precision pre-training, 
please install the [nvidia/apex](https://github.com/NVIDIA/apex) library with command:
```shell script
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
```
After that, you could bring back the option `--fp16` and `--fp16_opt_level O2` in 
the script `scripts/small_vlm_wiki103.bash`.
I recommend to use `--fp16_opt_level O2`.
Although the option O2 might be [unstable](https://github.com/NVIDIA/apex/issues/818#issuecomment-639012282),
it saves a lot memory:
the max per-gpu-batch-size is 32 with O1 but 64 with O2.

#### English Wikipedia
After the [vokenization process](#the-vokenization-process) of wiki103,
we could run the model with command:
```shell script
# bash scripts/base_vlm_wiki.bash $GPUs $SNAP_NAME
bash scripts/base_vlm_wiki.bash 0,1,2,3 wiki_bert_base
```
It will run a BERT-12Layers-768Hiddens (same as BERT_BASE) model on the English Wikipedia
dataset with the support of voken supervisions.
The snapshot will be saved to `snap/vlm/wiki_bert_base`.

It takes around 3-5 days on 4 Titan V / GTX 2080
and around 5-7 days to finish in 4 Titan Pascal/T4 cards.
(This estimation is accurate since I inevitably run experiments on all these servers...).
Titan V / 2080 / T4 have native support of mixed precision training (triggered by `--fp16` option and need
installing [apex](https://github.com/NVIDIA/apex)).
The speed would be much faster.
Titan Pascal would also save some memory with the `--fp16` option.


### GLUE Evaluation
We defautly use the [GLUE](https://gluebenchmark.com/) benchmark
(e.g., [SST](https://nlp.stanford.edu/sentiment/index.html),
[MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398),
[QQP](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs),
[MNLI](https://cims.nyu.edu/~sbowman/multinli/),
[QNLI](https://rajpurkar.github.io/SQuAD-explorer/),)
 as downstreaming tasks.
Other tasks could be evaluated following the setup [here](https://github.com/huggingface/transformers/tree/28d183c90cbf91e94651cf4a655df91a52ea1033/examples)
by changing the option `--model_name_or_path` to the correct snapshot path `snap/bert/wiki103`.

#### Download GLUE dataset
This downloaindg scrip is copied from [huggingface transformers](https://github.com/huggingface/transformers/tree/master/examples/text-classification)
project.
Since the [transformers](https://github.com/huggingface/transformers) is still under dense
development, the change of APIs might affect the code. 
I have upgraded the code compaticability to transformers==3.3.
```shell script
wget https://raw.githubusercontent.com/huggingface/transformers/master/utils/download_glue_data.py
python download_glue_data.py --data_dir data/glue --tasks all
```

#### Finetuning on GLUE Tasks
The pre-trained snapshots are evaluated by fine-tuning them on the [GLUE](https://gluebenchmark.com/) 
benchmark.
The code are modified from the huggingface [transformers](https://github.com/huggingface/transformers).

Running GLUE evaluation for snapshots from different epochs:
```bash
# bash scripts/run_glue_epochs.bash $GPUS #SNAP_PATH --snaps $NUM_OF_SNAPS                            
bash scripts/run_glue_epochs.bash 0,1,2,3 snap/vlm/wiki103_bert_small --snaps 7                            
```
It will assess 7 snaps using all 0,1,2,3 GPUs. 
Setting `snaps=-1` will assess all checkpoints.
If you just want to evaluate the last (usually the best) snapshot, please use:
```
bash scripts/run_glue_epochs.bash 0 snap/vlm/wiki103_bert_small --snaps 1
```

#### Showing the results
For all results saved under `snap/` (whatever the dir names),
running the folloing command will print out all the results.
```bash
python vlm/show_glue_results_epochs.py 
```

It will print results like
```
snap/vlm/test_finetune/glueepoch_checkpoint-epoch0019
     RTE    MRPC   STS-B    CoLA   SST-2    QNLI     QQP    MNLI MNLI-MM    GLUE
   54.51   84.72   87.18   52.32   90.02   88.36   87.16   81.92   82.57   78.75
snap/vlm/bert_6L_512H_wiki103_sharedheadctr_noshuffle/glueepoch_checkpoint-epoch0029
     RTE    MRPC   STS-B    CoLA   SST-2    QNLI     QQP    MNLI MNLI-MM    GLUE
   58.12   82.76   84.45   26.74   89.56   84.40   86.52   77.56   77.99   74.23
```

### BERT (As baselines)
We also provide pure language-model pre-training as baselines.

#### Wiki103
```shell script
# bash scripts/small_wiki103.bash $GPUs $SNAP_NAME
bash scripts/small_wiki103.bash 0,1,2,3 bert_small
```
It will call 
[vlm/run_lm_distributed.py](vlm/run_lm_distributed.py)
and run a BERT-6Layers-512Hiddens model on [wiki103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)
dataset with the masked language model only.
The snapshot will be saved to `snap/bert/wiki103_bert_small`.

Or you could directly using the script `small_wiki103_glue.bash` to 
enable GLUE evaluation after finishing pre-training.
```shell script
bash scripts/small_wiki103_glue.bash 0,1,2,3 bert_small
```

#### English Wikipedia
Command:
```shell script
# bash scripts/base_wiki.bash $GPUs $SNAP_NAME
bash scripts/base_wiki.bash 0,1,2,3 bert_wiki
```

With GLUE evaluation:
```shell script
bash scripts/base_wiki_glue.bash 0,1,2,3 bert_wiki
```

## Pre-processed Data and Pre-trained Models
### Data

Wiki103 (100M tokens)
```
mkdir -p data/wiki103-cased
wget  https://nlp.cs.unc.edu/data/vokenization/wiki103-cased/wiki.test.raw.bert-base-uncased.hdf5 -P data/wiki103-cased
wget https://nlp.cs.unc.edu/data/vokenization/wiki103-cased/wiki.train.raw.bert-base-uncased.hdf5 -P data/wiki103-cased
wget https://nlp.cs.unc.edu/data/vokenization/wiki103-cased/wiki.valid.raw.bert-base-uncased.hdf5 -P data/wiki103-cased
```

Wiki (2800 M tokens)
```
mkdir -p data/wiki-cased
wget  https://nlp.cs.unc.edu/data/vokenization/wiki-cased/en.test.raw.bert-base-uncased.hdf5 -P data/wiki-cased
wget  https://nlp.cs.unc.edu/data/vokenization/wiki-cased/en.train.raw.bert-base-uncased.hdf5 -P data/wiki-cased
wget  https://nlp.cs.unc.edu/data/vokenization/wiki-cased/en.valid.raw.bert-base-uncased.hdf5 -P data/wiki-cased
```

### Models
- Cross-Modal Matching model: [https://nlp.cs.unc.edu/data/vokenization/coco_hinge05_dim64_resxt101_bertl4.zip](https://nlp.cs.unc.edu/data/vokenization/coco_hinge05_dim64_resxt101_bertl4.zip)
- BERT (on Wiki): [https://nlp.cs.unc.edu/data/vokenization/bert_12L_768H_wiki.zip](https://nlp.cs.unc.edu/data/vokenization/bert_12L_768H_wiki.zip)
- BERT + VLM (on Wiki): [https://nlp.cs.unc.edu/data/vokenization/vlm_12L_768H_wiki.zip](https://nlp.cs.unc.edu/data/vokenization/vlm_12L_768H_wiki.zip)
- RoBERTa + VLM (on Wiki): [https://nlp.cs.unc.edu/data/vokenization/vlm_roberta_12L_768H_wiki.zip](https://nlp.cs.unc.edu/data/vokenization/vlm_roberta_12L_768H_wiki.zip)

## Reference
If you find our project useful, please cite this paper:
```
@inproceedings{tan2020vokenization,
  title={Vokenization: Improving Language Understanding with Contextualized, 
Visual-Grounded Supervision},
  author={Tan, Hao and Bansal, Mohit},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing},
  year={2020}
}
```

## Acknowledgement
I thank the support from [Bloomberg Data Science Ph.D. Fellowship](https://www.techatbloomberg.com/bloomberg-data-science-ph-d-fellowship/).
We thank the reviewers and [Yixin Nie](https://easonnie.github.io/) 
and [Jie Lei](https://www.cs.unc.edu/~jielei/)
for their helpful discussions.
Part of the code are built based on huggingface [transformers](https://github.com/huggingface/transformers) and 
facebook [xlm](https://github.com/facebookresearch/XLM) and [faiss](https://github.com/facebookresearch/faiss).

4K3.


================================================
FILE: data/lxmert/.gitignore
================================================
/mscoco_minival.json
/mscoco_nominival.json
/mscoco_train.json
/vgnococo.json


================================================
FILE: data/mscoco/.gitignore
================================================
/images


================================================
FILE: data/vg/.gitignore
================================================
/images


================================================
FILE: data/wiki/get_data_cased.bash
================================================
# Copyright (c) 2019-present, Facebook, Inc.
# Copy frrom https://github.com/facebookresearch/XLM
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#

#
# Usage: ./get-data-wiki.sh $lg (en)
#

set -e

lg=$1  # input language

# data path
WIKI_PATH=data/wiki-cased
MAIN_PATH=$WIKI_PATH

# tools paths
TOOLS_PATH=$MAIN_PATH/tools
TOKENIZE=$TOOLS_PATH/tokenize.sh
REMOVE_ACCENT=$TOOLS_PATH/remove_accent.py

# Wiki data
WIKI_DUMP_NAME=${lg}wiki-latest-pages-articles.xml.bz2
WIKI_DUMP_LINK=https://dumps.wikimedia.org/${lg}wiki/latest/$WIKI_DUMP_NAME

# install tools
data/wiki/install-tools.sh $TOOLS_PATH

# create Wiki paths
mkdir -p $WIKI_PATH/bz2
mkdir -p $WIKI_PATH/txt

# download Wikipedia dump
echo "Downloading $lg Wikipedia dump from $WIKI_DUMP_LINK ..."
wget -c $WIKI_DUMP_LINK -P $WIKI_PATH/bz2/
echo "Downloaded $WIKI_DUMP_NAME in $WIKI_PATH/bz2/$WIKI_DUMP_NAME"

# extract and tokenize Wiki data
echo "*** Cleaning and tokenizing $lg Wikipedia dump ... ***"
#python -m $TOOLS_PATH/wikiextractor/wikiextractor/WikiExtractor $WIKI_PATH/bz2/$WIKI_DUMP_NAME --processes 24 -q -o - \
if [ ! -f $WIKI_PATH/txt/$lg.all.raw ]; then
  python $TOOLS_PATH/wikiextractor/WikiExtractor.py $WIKI_PATH/bz2/$WIKI_DUMP_NAME --processes 24 -q -o - \
  | sed "/^\s*\$/d" \
  | grep -v "^<doc id=" \
  | grep -v "</doc>\$" \
  | $TOKENIZE $lg $TOOLS_PATH \
  | python $REMOVE_ACCENT \
  > $WIKI_PATH/txt/$lg.all.raw
fi
echo "*** Tokenized ( + accent-removal) $lg Wikipedia dump to $WIKI_PATH/txt/train.${lg} ***"

# split into train / valid / test
echo "*** Split into train / valid / test ***"
split_data() {
    NLINES=`wc -l $1  | awk -F " " '{print $1}'`;
    NTRAIN=$((NLINES - 10000));
    NVAL=$((NTRAIN + 5000));
    cat $1 | head -$NTRAIN             > $2;
    cat $1 | head -$NVAL | tail -5000  > $3;
    cat $1 | tail -5000                > $4;
}
split_data $WIKI_PATH/txt/$lg.all.raw $WIKI_PATH/txt/$lg.train.raw $WIKI_PATH/txt/$lg.valid.raw $WIKI_PATH/txt/$lg.test.raw

# File structure
mv $WIKI_PATH/txt/* $WIKI_PATH/
rm -rf $WIKI_PATH/bz2
rm -rf $WIKI_PATH/txt


================================================
FILE: data/wiki/get_data_cased_untokenized.bash
================================================
# Copyright (c) 2019-present, Facebook, Inc.
# Copy frrom https://github.com/facebookresearch/XLM
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#

#
# Usage: ./get-data-wiki.sh $lg (en)
#

set -e

lg=$1  # input language

# data path
WIKI_PATH=data/wiki-cased-untokenized
MAIN_PATH=$WIKI_PATH

# tools paths
TOOLS_PATH=$MAIN_PATH/tools
TOKENIZE=$TOOLS_PATH/tokenize.sh
REMOVE_ACCENT=$TOOLS_PATH/remove_accent.py

# Wiki data
WIKI_DUMP_NAME=${lg}wiki-latest-pages-articles.xml.bz2
WIKI_DUMP_LINK=https://dumps.wikimedia.org/${lg}wiki/latest/$WIKI_DUMP_NAME

# install tools
data/wiki/install-tools.sh $TOOLS_PATH

# create Wiki paths
mkdir -p $WIKI_PATH/bz2
mkdir -p $WIKI_PATH/txt

# download Wikipedia dump
if [ ! -f $WIKI_PATH/bz2/enwiki-latest-pages-articles.xml.bz2 ]; then
    echo "Downloading $lg Wikipedia dump from $WIKI_DUMP_LINK ..."
    wget -c $WIKI_DUMP_LINK -P $WIKI_PATH/bz2/
    echo "Downloaded $WIKI_DUMP_NAME in $WIKI_PATH/bz2/$WIKI_DUMP_NAME"
fi

# extract and tokenize Wiki data
#cd $MAIN_PATH
echo "*** Cleaning and tokenizing $lg Wikipedia dump ... ***"
if [ ! -f $WIKI_PATH/txt/$lg.all.raw ]; then
  python $TOOLS_PATH/wikiextractor/WikiExtractor.py $WIKI_PATH/bz2/$WIKI_DUMP_NAME --processes 24 -q -o - \
  | sed "/^\s*\$/d" \
  | grep -v "^<doc id=" \
  | grep -v "</doc>\$" \
  | python $REMOVE_ACCENT \
  > $WIKI_PATH/txt/$lg.all.raw
fi
echo "*** Not Tokenized ( but + accent-removal) $lg Wikipedia dump to $WIKI_PATH/txt/train.${lg} ***"

# split into train / valid / test
echo "*** Split into train / valid / test ***"
split_data() {
    NLINES=`wc -l $1  | awk -F " " '{print $1}'`;
    NTRAIN=$((NLINES - 10000));
    NVAL=$((NTRAIN + 5000));
    cat $1 | head -$NTRAIN             > $2;
    cat $1 | head -$NVAL | tail -5000  > $3;
    cat $1 | tail -5000                > $4;
}
split_data $WIKI_PATH/txt/$lg.all.raw $WIKI_PATH/txt/$lg.train.raw $WIKI_PATH/txt/$lg.valid.raw $WIKI_PATH/txt/$lg.test.raw

# File structure
mv $WIKI_PATH/txt/* $WIKI_PATH/
rm -rf $WIKI_PATH/bz2
rm -rf $WIKI_PATH/txt


================================================
FILE: data/wiki/install-tools.sh
================================================
# Copyright (c) 2019-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#

set -e

# data path
TOOLS_PATH=$1

# tools
MOSES_DIR=mosesdecoder
FASTBPE_DIR=fastBPE
FASTBPE=fast
WMT16_SCRIPTS=wmt16-scripts

# tools path
mkdir -p $TOOLS_PATH

# Copy the scripts to TOOLS_PATH
cp -r data/wiki/tools/* $TOOLS_PATH


#
# Download and install tools
#

old=$(pwd)
cd $TOOLS_PATH


# Download Moses
if [ ! -d "$MOSES_DIR" ]; then
  echo "Cloning Moses from GitHub repository..."
  git clone https://github.com/moses-smt/mosesdecoder.git
fi

# Download fastBPE
if [ ! -d "$FASTBPE_DIR" ]; then
  echo "Cloning fastBPE from GitHub repository..."
  git clone https://github.com/glample/fastBPE
fi

# Compile fastBPE
if [ ! -f "$FASTBPE_DIR/$FASTBPE" ]; then
  echo "Compiling fastBPE..."
  cd fastBPE
  g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
  cd ..
fi

# Download Sennrich's tools
if [ ! -d "$WMT16_SCRIPTS" ]; then
  echo "Cloning WMT16 preprocessing scripts..."
  git clone https://github.com/rsennrich/wmt16-scripts.git
fi

# Download WikiExtractor
if [ ! -d wikiextractor ]; then
    echo "Cloning WikiExtractor from GitHub repository..."
    git clone https://github.com/attardi/wikiextractor.git
    cd wikiextractor
    git checkout e4abb4cbd019b0257824ee47c23dd163919b731b
    cd ..
fi

cd $old

# # Chinese segmenter
# if ! ls $TOOLS_PATH/stanford-segmenter-* 1> /dev/null 2>&1; then
#   echo "Stanford segmenter not found at $TOOLS_PATH/stanford-segmenter-*"
#   echo "Please install Stanford segmenter in $TOOLS_PATH"
#   exit 1
# fi
# 
# # Thai tokenizer
# if ! python -c 'import pkgutil; exit(not pkgutil.find_loader("pythainlp"))'; then
#   echo "pythainlp package not found in python"
#   echo "Please install pythainlp (pip install pythainlp)"
#   exit 1
# fi
# 


================================================
FILE: data/wiki/tools/remove_accent.py
================================================
# Copyright (c) 2019-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#

import sys
import unicodedata
import six


def convert_to_unicode(text):
    """
    Converts `text` to Unicode (if it's not already), assuming UTF-8 input.
    """
    # six_ensure_text is copied from https://github.com/benjaminp/six
    def six_ensure_text(s, encoding='utf-8', errors='strict'):
        if isinstance(s, six.binary_type):
            return s.decode(encoding, errors)
        elif isinstance(s, six.text_type):
            return s
        else:
            raise TypeError("not expecting type '%s'" % type(s))

    return six_ensure_text(text, encoding="utf-8", errors="ignore")


def run_strip_accents(text):
    """
    Strips accents from a piece of text.
    """
    text = unicodedata.normalize("NFD", text)
    output = []
    for char in text:
        cat = unicodedata.category(char)
        if cat == "Mn":
            continue
        output.append(char)
    return "".join(output)


for line in sys.stdin:
    line = convert_to_unicode(line.rstrip())
    line = run_strip_accents(line)
    print(u'%s' % line)


================================================
FILE: data/wiki/tools/segment_th.py
================================================
# Copyright (c) 2019-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#

import sys
from pythainlp.tokenize import word_tokenize

for line in sys.stdin.readlines():
    line = line.rstrip('\n')
    print(' '.join(word_tokenize(line)))


================================================
FILE: data/wiki/tools/tokenize.sh
================================================
# Copyright (c) 2019-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#

# Tokenize text data in various languages
# Usage: e.g.   cat wiki.ar | tokenize.sh ar

set -e

N_THREADS=8

lg=$1
TOOLS_PATH=$2

# moses
MOSES=$TOOLS_PATH/mosesdecoder
REPLACE_UNICODE_PUNCT=$MOSES/scripts/tokenizer/replace-unicode-punctuation.perl
NORM_PUNC=$MOSES/scripts/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$MOSES/scripts/tokenizer/remove-non-printing-char.perl
TOKENIZER=$MOSES/scripts/tokenizer/tokenizer.perl

# Chinese
if [ "$lg" = "zh" ]; then
  $TOOLS_PATH/stanford-segmenter-*/segment.sh pku /dev/stdin UTF-8 0 | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR
# Thai
elif [ "$lg" = "th" ]; then
  cat - | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | python $TOOLS_PATH/segment_th.py
# Japanese
elif [ "$lg" = "ja" ]; then
  cat - | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | kytea -notags
# other languages
else
  cat - | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | $TOKENIZER -no-escape -threads $N_THREADS -l $lg
fi


================================================
FILE: data/wiki103/get_data_cased.sh
================================================
OUTPUT=data/wiki103-cased
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip -P $OUTPUT/
unzip $OUTPUT/wikitext-103-raw-v1.zip -d $OUTPUT
mv $OUTPUT/wikitext-103-raw/* $OUTPUT
rm -rf $OUTPUT/wikitext-103-raw-v1.zip $OUTPUT/wikitext-103-raw


================================================
FILE: data/wiki103/get_data_uncased.sh
================================================
OUTPUT=data/wiki103

wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip -P $OUTPUT/
unzip $OUTPUT/wikitext-103-v1.zip -d $OUTPUT
mv $OUTPUT/wikitext-103/* $OUTPUT
rm -rf $OUTPUT/wikitext-103-v1.zip $OUTPUT/wikitext-103


================================================
FILE: requirements.txt
================================================
torch
#==1.4.0
torchvision
#==0.5.0
transformers==3.3.0
tensorboardX

# For GLUE evaluation
sklearn

# Fiass supports fast indexing.
# The code has a torch-implemented GPU indexing, so do not worry if you could not install faiss.
faiss-gpu>=1.6.3

# Spacy is used in sentence segmentation where the sentences are the input the cross-modality matching model.
spacy

# A higher h5py version to support h5py.VirtualLayout
h5py>=2.10.0


================================================
FILE: scripts/base_vlm_wiki.bash
================================================
# The name of experiment
GPUS=$1
NAME=$2

# Create dirs and make backup
output=snap/bert/$NAME
mkdir -p $output/src
cp -r vlm $output/src/
cp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash
cp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash 
cp $0 $output/run.bash

export TRAIN_FILE=data/wiki-cased/en.train.raw
export TEST_FILE=data/wiki-cased/en.valid.raw

# Pre-training
CUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \
    --output_dir=$output \
	--overwrite_output_dir \
	--config_name=vlm/configs/bert-12L-768H.json \
	--tokenizer_name=bert-base-uncased \
    --model_type=bert \
	--block_size=126 \
	--per_gpu_train_batch_size=32 \
    --per_gpu_eval_batch_size=32 \
	--gradient_accumulation_steps=2 \
    --max_steps=200000 \
	--learning_rate=2e-4 \
	--weight_decay=0.01 \
	--warmup_steps=5000 \
    --mlm_probability 0.15 \
    --mlm_ratio 1.0 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --col_data \
    --split_sent \
    --do_voken_cls \
    --voken_labels all \
    --voken_dir snap/xmatching/bert_resnext/vokens \
    --voken_suffix vg_nococo \
    --mlm ${@:3} | tee $output/log.log

    #--fp16 \
	#--fp16_opt_level O2 \



================================================
FILE: scripts/base_vlm_wiki_glue.bash
================================================
# The name of experiment
GPUS=$1
NAME=$2

# Create dirs and make backup
output=snap/bert/$NAME
mkdir -p $output/src
cp -r vlm $output/src/
cp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash
cp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash 
cp $0 $output/run.bash

export TRAIN_FILE=data/wiki-cased/en.train.raw
export TEST_FILE=data/wiki-cased/en.valid.raw

# Pre-training
CUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \
    --output_dir=$output \
	--overwrite_output_dir \
	--config_name=vlm/configs/bert-12L-768H.json \
	--tokenizer_name=bert-base-uncased \
    --model_type=bert \
	--block_size=126 \
	--per_gpu_train_batch_size=32 \
    --per_gpu_eval_batch_size=32 \
	--gradient_accumulation_steps=2 \
    --max_steps=200000 \
	--learning_rate=2e-4 \
	--weight_decay=0.01 \
	--warmup_steps=5000 \
    --mlm_probability 0.15 \
    --mlm_ratio 1.0 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --col_data \
    --split_sent \
    --do_voken_cls \
    --voken_labels all \
    --voken_dir snap/xmatching/bert_resnext/vokens \
    --voken_suffix vg_nococo \
    --mlm ${@:3} | tee $output/log.log

    #--fp16 \
	#--fp16_opt_level O2 \

# Wait for clearing the GPU cache
sleep 30
bash scripts/run_glue_epochs.bash $GPUS $output --snaps 4



================================================
FILE: scripts/base_wiki.bash
================================================
GPUS=$1
# The name of experiment
NAME=$2

# Create dirs and make backup
output=snap/bert/$NAME
mkdir -p $output/src
cp -r vlm/*.py $output/src/
cp $0 $output/run.bash
cp run_glue_epochs.bash $output/run_glue_epochs.bash
cp run_glue_at_epoch.bash $output/run_glue_at_epoch.bash 

export TRAIN_FILE=data/wiki-cased/en.train.raw
export TEST_FILE=data/wiki-cased/en.valid.raw

# Pre-training
CUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \
    --output_dir=$output \
	--overwrite_output_dir \
	--config_name=vlm/configs/bert-12L-768H.json \
	--tokenizer_name=bert-base-uncased \
    --model_type=bert \
	--block_size=126 \
	--per_gpu_train_batch_size=64 \
    --per_gpu_eval_batch_size=64 \
	--gradient_accumulation_steps=1 \
    --max_steps 220000 \
	--learning_rate=2e-4 \
	--weight_decay=0.01 \
	--warmup_steps=5000 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --col_data \
    --split_sent \
    --mlm ${@:3} | tee $output/log.log

    #--fp16 \
	#--fp16_opt_level O2 \


================================================
FILE: scripts/base_wiki_glue.bash
================================================
GPUS=$1
# The name of experiment
NAME=$2

# Create dirs and make backup
output=snap/bert/$NAME
mkdir -p $output/src
cp -r vlm/*.py $output/src/
cp $0 $output/run.bash
cp run_glue_epochs.bash $output/run_glue_epochs.bash
cp run_glue_at_epoch.bash $output/run_glue_at_epoch.bash 

export TRAIN_FILE=data/wiki-cased/en.train.raw
export TEST_FILE=data/wiki-cased/en.valid.raw

# Pre-training
CUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \
    --output_dir=$output \
	--overwrite_output_dir \
	--config_name=vlm/configs/bert-12L-768H.json \
	--tokenizer_name=bert-base-uncased \
    --model_type=bert \
	--block_size=126 \
	--per_gpu_train_batch_size=64 \
    --per_gpu_eval_batch_size=64 \
	--gradient_accumulation_steps=1 \
    --max_steps 220000 \
	--learning_rate=2e-4 \
	--weight_decay=0.01 \
	--warmup_steps=5000 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --col_data \
    --split_sent \
    --mlm ${@:3} | tee $output/log.log

    #--fp16 \
	#--fp16_opt_level O2 \

#--shuffle \
# Wait for clearing the GPU cache
sleep 30
bash scripts/run_glue_epochs.bash $GPUS $output --snaps -1


================================================
FILE: scripts/extract_keys.bash
================================================
CUDA_VISIBLE_DEVICES=$1 python vokenization/extract_vision_keys.py \
    --image-sets vg_nococo,coco_minival,coco_nominival,coco_train,cc_valid \
    --load-dir snap/xmatching/$2


================================================
FILE: scripts/mpvokenize_wiki.bash
================================================
GPU=$1

LOAD=snap/xmatching/$2
DATA_DIR=data/wiki-cased
TOKENIZER=bert-base-uncased

for DATA_NAME in en.valid.raw en.test.raw en.train.raw
do 
    CUDA_VISIBLE_DEVICES=$GPU python vokenization/vokenize_corpus_mp.py \
        --load $LOAD \
        --corpus=$DATA_DIR/$DATA_NAME \
        --tokenizer-name $TOKENIZER \
        --image-sets vg_nococo \
        --max-img-num 50000 
done



================================================
FILE: scripts/mpvokenize_wiki103.bash
================================================
GPU=$1

LOAD=snap/xmatching/$2
WIKI_DIR=data/wiki103-cased
TOKENIZER=bert-base-uncased

for DATA_NAME in wiki.valid.raw wiki.test.raw wiki.train.raw
do 
    CUDA_VISIBLE_DEVICES=$GPU python vokenization/vokenize_corpus_mp.py \
        --load $LOAD \
        --corpus=$WIKI_DIR/$DATA_NAME \
        --tokenizer-name $TOKENIZER \
        --image-sets vg_nococo \
        --max-img-num 50000
done



================================================
FILE: scripts/run_glue_at_epoch.bash
================================================
export GLUE_DIR=data/glue/
EPOCHS=$2
MODEL=$3
CKPT=$4

for TASK_NAME in WNLI RTE MRPC STS-B CoLA SST-2 QNLI QQP MNLI
do
    CUDA_VISIBLE_DEVICES=$1 python vlm/run_glue.py \
        --model_type bert \
        --tokenizer_name=bert-base-uncased \
        --model_name_or_path $MODEL/$CKPT \
        --task_name $TASK_NAME \
        --do_train \
        --do_eval \
        --do_lower_case \
        --data_dir $GLUE_DIR/$TASK_NAME \
        --save_steps -1 \
        --max_seq_length 126 \
        --per_gpu_eval_batch_size=32   \
        --per_gpu_train_batch_size=32   \
        --learning_rate 1e-4 \
        --warmup_steps 0.1 \
        --num_train_epochs $EPOCHS.0 \
        --output_dir $MODEL/glueepoch_$CKPT/$TASK_NAME
done

        #--overwrite_output_dir \


================================================
FILE: scripts/run_glue_epochs.bash
================================================
GPUS=$1
MODEL=$2
 
python vlm/run_glue_epochs.py --gpus $GPUS --load $MODEL \
    ${@:3}



================================================
FILE: scripts/run_xmatching.bash
================================================
GPUS=$1
# The name of experiment
NAME=$2

# Create dirs and make backup
output=snap/xmatching/$NAME
mkdir -p $output/src/
cp -r xmatching $output/src/
cp $0 $output/run.bash

# Pre-training
CUDA_VISIBLE_DEVICES=$GPUS unbuffer python xmatching/main.py \
    --train-imgs mscoco_train,mscoco_nominival --valid-imgs mscoco_minival \
    --train-langs mscoco --valid-langs mscoco \
    --max-len 20 --dim 64 \
    --lang-layers 4,3,2,1 \
    --lang-pretrained --visn-pretrained \
    --num-workers 8 --batchSize 256 --optim adam --lr 1e-3 --epochs 20 \
    --nodes 1 --nr 0 \
    --output $output ${@:3} | tee $output/log.log

#--visn resnext101_32x8d --lang bert \


================================================
FILE: scripts/small_vlm_wiki103.bash
================================================
# The name of experiment
GPUS=$1
NAME=$2

# Create dirs and make backup
output=snap/vlm/$NAME
mkdir -p $output/src
cp -r vlm $output/src/
cp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash
cp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash 
cp $0 $output/run.bash

export TRAIN_FILE=data/wiki103-cased/wiki.train.raw
export TEST_FILE=data/wiki103-cased/wiki.valid.raw

# Pre-training
CUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \
    --output_dir=$output \
	--overwrite_output_dir \
	--config_name=vlm/configs/bert-6L-512H.json \
	--tokenizer_name=bert-base-uncased \
    --model_type=bert \
	--block_size=126 \
	--per_gpu_train_batch_size=32 \
    --per_gpu_eval_batch_size=32 \
	--gradient_accumulation_steps=2 \
	--num_train_epochs=40 \
	--learning_rate=2e-4 \
	--weight_decay=0.01 \
	--warmup_steps=10000 \
    --mlm_probability 0.15 \
    --mlm_ratio 1.0 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --col_data \
    --split_sent \
    --do_voken_cls \
    --voken_labels all \
    --voken_dir snap/xmatching/bert_resnext/vokens \
    --voken_suffix vg_nococo \
    --mlm ${@:3} | tee $output/log.log

    #--fp16 \
	#--fp16_opt_level O2 \



================================================
FILE: scripts/small_vlm_wiki103_glue.bash
================================================
# The name of experiment
GPUS=$1
NAME=$2

# Create dirs and make backup
output=snap/vlm/$NAME
mkdir -p $output/src
cp -r vlm $output/src/
cp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash
cp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash 
cp $0 $output/run.bash

export TRAIN_FILE=data/wiki103-cased/wiki.train.raw
export TEST_FILE=data/wiki103-cased/wiki.valid.raw

# Pre-training
CUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \
    --output_dir=$output \
	--overwrite_output_dir \
	--config_name=vlm/configs/bert-6L-512H.json \
	--tokenizer_name=bert-base-uncased \
    --model_type=bert \
	--block_size=126 \
	--per_gpu_train_batch_size=32 \
    --per_gpu_eval_batch_size=32 \
	--gradient_accumulation_steps=2 \
	--num_train_epochs=40 \
	--learning_rate=2e-4 \
	--weight_decay=0.01 \
	--warmup_steps=10000 \
    --mlm_probability 0.15 \
    --mlm_ratio 1.0 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --col_data \
    --split_sent \
    --do_voken_cls \
    --voken_labels all \
    --voken_dir snap/xmatching/bert_resnext/vokens \
    --voken_suffix vg_nococo \
    --mlm ${@:3} | tee $output/log.log

    #--fp16 \
	#--fp16_opt_level O2 \

# Wait for clearing the GPU cache
sleep 30
bash scripts/run_glue_epochs.bash $GPUS $output --snaps 4



================================================
FILE: scripts/small_wiki103.bash
================================================
# The name of experiment
GPUS=$1
NAME=$2

# Create dirs and make backup
output=snap/bert/$NAME
mkdir -p $output/src
cp -r vlm/*.py $output/src/
cp $0 $output/run.bash

export TRAIN_FILE=data/wiki103-cased/wiki.train.raw
export TEST_FILE=data/wiki103-cased/wiki.valid.raw

# Pre-training
CUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \
    --output_dir=$output \
	--overwrite_output_dir \
	--config_name=vlm/configs/bert-6L-512H.json \
	--tokenizer_name=bert-base-uncased \
    --model_type=bert \
	--block_size=126 \
	--per_gpu_train_batch_size=64 \
    --per_gpu_eval_batch_size=64 \
	--gradient_accumulation_steps=1 \
	--num_train_epochs=44 \
	--learning_rate=2e-4 \
	--weight_decay=0.01 \
	--warmup_steps=10000 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --col_data \
    --split_sent \
    --shuffle \
    --mlm ${@:3} | tee $output/log.log

    #--fp16 \
	#--fp16_opt_level O2 \




================================================
FILE: scripts/small_wiki103_glue.bash
================================================
# The name of experiment
GPUS=$1
NAME=$2

# Create dirs and make backup
output=snap/bert/$NAME
mkdir -p $output/src
cp -r vlm/*.py $output/src/
cp $0 $output/run.bash

export TRAIN_FILE=data/wiki103-cased/wiki.train.raw
export TEST_FILE=data/wiki103-cased/wiki.valid.raw

# Pre-training
CUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \
    --output_dir=$output \
	--overwrite_output_dir \
	--config_name=vlm/configs/bert-6L-512H.json \
	--tokenizer_name=bert-base-uncased \
    --model_type=bert \
	--block_size=126 \
	--per_gpu_train_batch_size=64 \
    --per_gpu_eval_batch_size=64 \
	--gradient_accumulation_steps=1 \
	--num_train_epochs=44 \
	--learning_rate=2e-4 \
	--weight_decay=0.01 \
	--warmup_steps=10000 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --col_data \
    --split_sent \
    --shuffle \
    --mlm ${@:3} | tee $output/log.log

    #--fp16 \
	#--fp16_opt_level O2 \


# Wait for clearing the GPU cache
sleep 30
bash scripts/run_glue_epochs.bash $GPUS $output --snaps 4


================================================
FILE: scripts/xmatching_benchmark.bash
================================================
# Benchmarking the cross-modal matching model with
#     1. Retrieval scores.
#     2. Voken Diversity w.r.t words in specific language corpus.
# Please run this after image_key_retrivel and tokenization. 
#    i.e., step 1 and step2 in readme.md

MODEL=$2
MODELPATH=snap/xmatching/$MODEL
rm -rf $MODELPATH/analysis.log

# Retrieval scores
CUDA_VISIBLE_DEVICES=$1 unbuffer python vokenization/evaluate_retrieval.py \
    --load $MODELPATH \
    --image-sets coco_minival,cc_valid \
    | tee -a $MODELPATH/analysis.log

# Diversity
# Test diversity of vision-and-language (captioning) datasets
CUDA_VISIBLE_DEVICES=$1 unbuffer python vokenization/evaluate_diversity.py \
    --load $MODELPATH \
    --image-sets vg_nococo \
    --corpus coco_minival,cc_valid \
    | tee -a $MODELPATH/analysis.log

# Test diversity of pure-language corpus
CUDA_VISIBLE_DEVICES=$1 unbuffer python vokenization/evaluate_diversity.py \
    --load $MODELPATH \
    --image-sets vg_nococo \
    --corpus data/wiki103-cased/wiki.valid.raw \
    --maxsents 95000 \
    | tee -a $MODELPATH/analysis.log


================================================
FILE: snap/bert/.gitkeep
================================================


================================================
FILE: snap/vlm/.gitkeep
================================================


================================================
FILE: snap/xmatching/.gitkeep
================================================
/*


================================================
FILE: tokenization/to_hdf5.py
================================================
import h5py
import numpy as np
import tqdm

from transformers import AutoTokenizer


def validate_hdf5(fname, tokenizer_name):
    print("--------------------------------------------")
    print("Start to valid the hdf5 file", fname + '.' + tokenizer_name + '.hdf5')

    with open(fname) as f:
        lines = []
        for line in f:
            if 'wiki' in fname:
                # Wiki103: remove document title
                if line.startswith(' = '):
                    continue
                # Full Wiki: Remove the too short lines.
                if len(line.strip().split(' ')) < 5:
                    continue

            if len(line.strip()) == 0:
                # Always drop empty line
                continue
            lines.append(line)

    # Use the slow tokenizer to validate the results of the fast tokenizer.
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

    h5_file = h5py.File(fname + '.' + tokenizer_name + '.hdf5', 'r')
    tokens = h5_file['tokens']

    print("Start to check the first 10 lines:")
    ids = []
    for line in lines[:10]:
        ids.extend(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(line)))
    ids = np.array(ids)
    first_tokens = np.array(tokens[:len(ids)])
    if np.array_equal(ids, first_tokens):
        print("PASS")
    else:
        print(' '.join(tokenizer.convert_ids_to_tokens(ids)))
        print()
        print(' '.join(tokenizer.convert_ids_to_tokens(first_tokens)))
        assert False, "FAIL"

    print("Start to check the last 10 lines:")
    ids = []
    for line in lines[-10:]:
        ids.extend(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(line)))
    ids = np.array(ids)
    last_tokens = np.array(tokens[-len(ids):])
    if np.array_equal(ids, last_tokens):
        print("PASS")
    else:
        print(' '.join(tokenizer.convert_ids_to_tokens(ids)))
        print(' '.join(tokenizer.convert_ids_to_tokens(last_tokens)))
        assert False, "FAIL"
    print("--------------------------------------------")


def to_hdf5(fname, tokenizer_name, validate=True):
    print("Process %s" % fname)

    h5_file = h5py.File(fname + '.' + tokenizer_name + '.hdf5', 'w')
    dset = h5_file.create_dataset("tokens",
                                  (0,),
                                  maxshape=(None,),
                                  dtype='int32')

    dump_interval = 1000000
    dump_iter = 0
    with open('%s.%s' % (fname, tokenizer_name)) as f:
        lines = 0
        tokens = []
        for line in tqdm.tqdm(f):
            for token in map(int, line.split(' ')):
                tokens.append(token)
            if len(tokens) >= dump_interval:
                dset.resize((dump_iter + len(tokens),))
                dset[dump_iter: dump_iter + len(tokens)] = tokens
                dump_iter += len(tokens)
                tokens = []
            lines += 1

        dset.resize((dump_iter + len(tokens),))
        dset[dump_iter: dump_iter + len(tokens)] = tokens
        dump_iter += len(tokens)

    assert len(dset) == dump_iter
    h5_file.close()

    if validate:
        validate_hdf5(fname, tokenizer_name)

    print()



================================================
FILE: tokenization/tokenize_dataset.py
================================================
# coding=utf-8
# Copyleft 2020 project COL.

import argparse
from pathlib import Path

from transformers import AutoTokenizer
import time

from to_hdf5 import to_hdf5

def tokenize_dataset(data_dir, fname, tokenizer_name, lines_are_sents=False):
    data_path = Path(data_dir)

    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, use_fast=True)

    f = open(data_path / fname)
    g = open((data_path / ('%s.%s' % (fname, tokenizer_name))), 'w')

    # Statistics
    dcmt_cnt = 0
    token_cnt = 0
    line_cnt = 0
    line_starts = []

    # Logging and dumping hyper-parameters
    cache = ''
    log_interval = log_iter = 1000000
    dump_interval = dump_iter = 100000
    start_time = time.time()

    for i, line in enumerate(f):
        # Identify the start of documents, ignore it.
        if 'wiki103' in data_dir:
            if line.startswith(' = '):
                dcmt_cnt += 1
                continue
        elif 'wiki' in data_dir:
            if len(line.strip().split(' ')) == 1:
                dcmt_cnt += 1
                continue

        if 'wiki' in data_dir:
            # Remove too short lines. Book corpus does not need this.
            if len(line.strip().split(' ')) < 5:
                continue

        # Drop empty line (1)
        if len(line.strip()) == 0:
            continue

        tokenized_line = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(line))
        # tokenized_line = tokenizer.encode(line, add_special_tokens=False)
        if len(tokenized_line) == 0:    # Drop empty line (2)
            continue

        line_cnt += 1
        line_starts.append(token_cnt)
        if i < 5:
            print()
            print('Line:', line)
            print('Tokens:', ' '.join(tokenizer.convert_ids_to_tokens(tokenized_line)))
        token_cnt += len(tokenized_line)
        cache += ' '.join(map(str, tokenized_line)) + '\n'

        if (token_cnt + 1) > dump_iter:
            g.write(cache)
            cache = ''
            dump_iter += dump_interval

        if (token_cnt + 1) > log_iter:
            used_time = time.time() - start_time
            print("Process %d tokens in %d seconds, %0.4f tokens per second." % (
                token_cnt, used_time, token_cnt / used_time))
            log_iter += log_interval

    # Deal with the last remaining tokens.
    line_starts.append(token_cnt)
    g.write(cache)

    # Dump Line starts
    identifier = 'sent' if lines_are_sents else 'line'
    with open(data_path / ('%s.%s.%s' % (fname, tokenizer_name, identifier)), 'w') as f:
        for line_start in line_starts:
            f.write(str(line_start) + "\n")

    f.close()
    g.close()
    print(f"Documents: {dcmt_cnt}, Lines: {line_cnt}, Words: {token_cnt} in dataset {fname}")

    to_hdf5(str(data_path / fname), tokenizer_name)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    # Required parameters
    parser.add_argument(
        "datadir", default=None, type=str, help="The input training data file (a text file)."
    )
    parser.add_argument(
        "fname", default=None, type=str, help="The input training data file (a text file)."
    )
    parser.add_argument(
        "tokenizer_name", default=None, type=str, help="The input training data file (a text file)."
    )
    parser.add_argument(
        "--lines-are-sents", action='store_true',
        help="Add this if the line are already segmented to sentences, instead of paragraphs."
    )

    param = parser.parse_args()

    tokenize_dataset(
        param.datadir,
        param.fname,
        param.tokenizer_name,
        param.lines_are_sents,
    )



================================================
FILE: tokenization/tokenize_wiki103_bert.bash
================================================
DATA_DIR=data/wiki103-cased
TOKENIZER=bert-base-uncased
python tokenization/tokenize_dataset.py $DATA_DIR wiki.valid.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR wiki.test.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR wiki.train.raw $TOKENIZER


================================================
FILE: tokenization/tokenize_wiki103_roberta.bash
================================================
DATA_DIR=data/wiki103-cased
TOKENIZER=roberta-base
python tokenization/tokenize_dataset.py $DATA_DIR wiki.valid.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR wiki.test.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR wiki.train.raw $TOKENIZER


================================================
FILE: tokenization/tokenize_wiki_bert.bash
================================================
DATA_DIR=data/wiki-cased
TOKENIZER=bert-base-uncased
python tokenization/tokenize_dataset.py $DATA_DIR en.valid.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR en.test.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR en.train.raw $TOKENIZER


================================================
FILE: tokenization/tokenize_wiki_roberta.bash
================================================
DATA_DIR=data/wiki-cased-untokenized/
TOKENIZER=roberta-base
python tokenization/tokenize_dataset.py $DATA_DIR en.valid.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR en.test.raw $TOKENIZER
python tokenization/tokenize_dataset.py $DATA_DIR en.train.raw $TOKENIZER


================================================
FILE: vlm/__init__.py
================================================
import data


================================================
FILE: vlm/configs/bert-12L-768H.json
================================================
{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}


================================================
FILE: vlm/configs/bert-4L-768H.json
================================================
{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 4,
  "type_vocab_size": 2,
  "vocab_size": 30522
}


================================================
FILE: vlm/configs/bert-6L-512H.json
================================================
{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 512,
  "initializer_range": 0.02,
  "intermediate_size": 2048,
  "max_position_embeddings": 512,
  "num_attention_heads": 8,
  "num_hidden_layers": 6,
  "type_vocab_size": 2,
  "vocab_size": 30522
}


================================================
FILE: vlm/configs/bert_base.json
================================================
{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}


================================================
FILE: vlm/data.py
================================================
import copy
import os
import random

import h5py
import torch
from torch.utils.data import DataLoader, Dataset
import tqdm


class CoLDataset(Dataset):
    IGNORE_ID = -100
    sent_strategy = 'first'

    def __init__(self, file_path, tokenizer_name, tokenizer, block_size=512,
                 split_sent=False, voken_dir=None, suffix=None, verbose=False,
                 voken_ablation=None):

        # Open token's hdf5
        token_path = file_path + '.' + tokenizer_name + '.hdf5'
        assert os.path.isfile(token_path)
        if verbose:
            print("-------- Load Data -------")
            print("Load tokens from", token_path)
        self.token_hdf5 = h5py.File(token_path, 'r')
        self.tokenizer = tokenizer
        self.tokens = self.token_hdf5['tokens']
        self.verbose = verbose
        self.voken_ablation = voken_ablation
        self._iter_cnt = 0

        # Open voken's hdf5 and load voken ids
        if voken_dir is not None:
            assert suffix is not None, 'Please provide suffix of the voken, e.g., vg_nococo.5000.'
            self.sent_level = 'sent' in voken_dir
            dset_fname = os.path.split(file_path)[-1]
            voken_path = os.path.join(voken_dir, f"{dset_fname}.{suffix}.hdf5")
            voken_ids_path = os.path.join(voken_dir, f"{dset_fname}.{suffix}.ids")
            if verbose:
                print("Load vokens from", voken_path)
            self.voken_hdf5 = h5py.File(voken_path, 'r')
            self.vokens = self.voken_hdf5['vokens']
            assert len(self.vokens) == len(self.tokens)
            self._voken_ids = list(
                map(lambda x: x.strip(),
                    open(voken_ids_path).readlines())
            )
            if verbose:
                print("\t with voken size", self.voken_size)
                print("\t top 5 voken ids are:", self._voken_ids[:5])
        else:
            self.vokens = None

        # Split for every block_size tokens
        # The last block without full length will be dropped.
        num_tokens = len(self.tokens)
        self.starts = list(range(0, num_tokens, block_size))
        self.batches = list(zip(self.starts[:-1], self.starts[1:]))

        manual_filtered =False
        if "en.train.raw" in file_path and tokenizer_name == "bert-base-uncased":
            self.batches = manual_filter(self.batches)
            if verbose:
                print("Data: Mannually filter the range for counties.")
            manual_filtered = True

        # batch_info
        if verbose:
            print("Split sent with block size", block_size)
            print(f"Total batches: {len(self.batches)}")
            print(f"Total tokens: {len(self.tokens)}")
            if voken_dir is not None:
                print(f"Total vokens: {len(self.vokens)}")
            if voken_ablation is not None:
                print("The model will process voken ablation strategy:", voken_ablation)
            print()

        block_check(self.batches, block_size, fixed_size=True, manual_filtered=manual_filtered)
        if self.voken_ablation == 'token':
            self._voken_ids = list(range(30522))

    @property
    def voken_size(self):
        return len(self._voken_ids)

    @property
    def voken_ids(self):
        return copy.copy(self._voken_ids)

    def assert_equal_vokens(self, dataset):
        assert self.voken_size == dataset.voken_size
        for vid, vid1 in zip(self.voken_ids, dataset.voken_ids):
            assert vid == vid1

    def __len__(self):
        return len(self.batches) - 1

    def __getitem__(self, item):
        token_start, token_end = self.batches[item]
        if self._iter_cnt < 5 and self.verbose:
            print(f"Data Loader: data iteration {self._iter_cnt}, with range {token_start} to {token_end}.")
            self._iter_cnt += 1
        tokens = list(self.tokens[token_start: token_end])
        token_tensor = torch.tensor(
            self.tokenizer.build_inputs_with_special_tokens(tokens),
            dtype=torch.long)
        if self.vokens is not None:
            vokens = list(self.vokens[token_start: token_end])

            vokens = self.maybe_do_sent_level(vokens)
            vokens = self.maybe_do_ablation_study(vokens, tokens)

            voken_tensor = torch.tensor(
                [self.IGNORE_ID] + vokens + [self.IGNORE_ID],
                dtype=torch.long
            )

            return token_tensor, voken_tensor
        else:
            return token_tensor

    def maybe_do_sent_level(self, vokens):
        if not self.sent_level:
            return vokens
        else:
            if self.sent_strategy == 'all':
                vokens = [
                    (-voken-1 if voken < 0 else voken)
                    for voken in vokens
                ]
            elif self.sent_strategy == 'first':
                vokens = [
                    (self.IGNORE_ID if voken < 0 else voken)
                    for voken in vokens
                ]
            return vokens

    def maybe_do_ablation_study(self, vokens, tokens):
        if self.voken_ablation is None:
            return vokens
        else:
            if self._iter_cnt < 5 and self.verbose:
                print("Before voken ablation: ", vokens)
            if self.voken_ablation == 'random':
                vokens = [random.randint(0, self.voken_size - 1)
                          for _ in range(len(vokens))]
            elif self.voken_ablation == 'shuffle':
                random.shuffle(vokens)
            elif self.voken_ablation == 'reverse':
                vokens = vokens[::-1]
            elif self.voken_ablation == 'token':
                vokens = tokens
            if self._iter_cnt < 5 and self.verbose:
                print("After voken ablation: ", vokens)
            return vokens

    def get_item_info(self, item):
        token_start = self.batches[item]
        token_end = self.batches[item + 1]
        return token_start, token_end

    def __del__(self):
        self.token_hdf5.close()
        if self.vokens is not None:
            self.voken_hdf5.close()


FORBIDDEN_RANGE = (
    119314944,      # Start of iter 3700
    187053048       # End of iter 5800
)


def intersect(x, y):
    x1, x2 = x
    y1, y2 = y
    if x2 <= y1 or x2 >= y2:
        # Case 1: [   x    )[   y    )
        # Case 2: [   y    )[   x    )
        return False
    return True


def manual_filter(batches):
    batches = list(filter(
        lambda x: not intersect(x, FORBIDDEN_RANGE),
        batches
    ))
    return batches


def block_check(batches, block_size, fixed_size=False, manual_filtered=False):
    """
    Check whether the batches satisfy following requirements.
        1. Monotonic
        2. Mutually exclusive
        3. Range < block_size
    """
    last_end = 0
    for start_token, end_token in batches:
        assert last_end <= start_token
        if fixed_size:
            assert (end_token - start_token) == block_size, 'len([%d, %d)) != %d' % (start_token, end_token, block_size)
        else:
            assert (end_token - start_token) <= block_size, 'len([%d, %d)) > %d' % (start_token, end_token, block_size)
        if manual_filtered:
            assert not intersect((start_token, end_token), FORBIDDEN_RANGE)
        last_end = end_token


def get_voken_feats(dataset: CoLDataset, feat_dir: str):
    """
    Load pre-extracted visual features regarding img_ids of vokens.
    """
    set2id2feat = {}
    voken_feats = []
    for voken_id in dataset.voken_ids:
        voken_img_set, voken_img_id = voken_id.split('/')
        if voken_img_set not in set2id2feat:
            img_ids = list(map(
                lambda x: x.rstrip(),
                open(os.path.join(feat_dir, f"{voken_img_set}.ids"))
            ))
            img_feats = h5py.File(
                os.path.join(feat_dir, f"{voken_img_set}.hdf5"), 'r'
            )['keys'][:]
            id2feat = {}
            assert len(img_ids) == len(img_feats)
            for img_id, img_feat in zip(img_ids, img_feats):
                id2feat[img_id] = img_feat
            set2id2feat[voken_img_set] = id2feat
        voken_feats.append(set2id2feat[voken_img_set][voken_img_id])
    return voken_feats





================================================
FILE: vlm/model.py
================================================
import math

import torch
import torch.nn.functional as F
from torch.nn import CrossEntropyLoss, MSELoss, SmoothL1Loss
from torch import nn
from transformers import (
    BertConfig,
    BertForMaskedLM,
)

from transformers.modeling_bert import BertOnlyMLMHead


BertLayerNorm = torch.nn.LayerNorm


# The GLUE function is copied from huggingface transformers:
# https://github.com/huggingface/transformers/blob/c6acd246ec90857b70f449dcbcb1543f150821fc/src/transformers/activations.py
def _gelu_python(x):
    """ Original Implementation of the gelu activation function in Google Bert repo when initially created.
        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
        Also see https://arxiv.org/abs/1606.08415
    """
    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))


if torch.__version__ < "1.4.0":
    gelu = _gelu_python
else:
    gelu = F.gelu


class CoLBertConfig(BertConfig):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.voken_size = None
        self.voken_dim = None
        self.do_voken_cls = False
        self.do_voken_reg = False
        self.do_voken_ctr = False
        self.shared_head = False
        self.verbose = False


class BertSharedHead(BertOnlyMLMHead):
    """Bert Head for masked language modeling."""

    def __init__(self, config):
        super().__init__(config)
        self.do_voken_cls = config.do_voken_cls
        self.do_voken_ctr = config.do_voken_ctr

        assert int(self.do_voken_cls) + int(self.do_voken_ctr) == 1
        if self.do_voken_cls:
            self.visn_decoder = nn.Linear(config.hidden_size, config.voken_size, bias=True)

        if self.do_voken_ctr:
            self.visn_decoder = nn.Linear(config.voken_dim, config.hidden_size, bias=True)

    def forward(self, features, **kwargs):
        """
        :param features: [batch, length, dim]
        :return: lang_scores [batch, length, vocab_size],
                 visn_scores [batch, length, voken_size]
        """
        x = self.predictions.transform(features)    # batch_size, length, dim

        lang_scores = self.predictions.decoder(x) + self.predictions.bias

        if self.do_voken_cls:
            visn_scores = self.visn_decoder(x)
        elif self.do_voken_ctr:
            voken_feats = kwargs['voken_feats']
            y = self.visn_decoder(voken_feats)  # voken_size, dim
            visn_scores = torch.einsum('bik,jk->bij', x, y)
        else:
            assert False

        return lang_scores, visn_scores


class BertVLMClassificationHead(nn.Module):
    """Bert Head for masked language modeling."""

    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)

        self.decoder = nn.Linear(config.hidden_size, config.voken_size, bias=True)
        # self.decoder = nn.Sequential(
        #     nn.Linear(config.hidden_size, 256, bias=True),
        #     nn.Linear(256, config.voken_size, bias=True),
        # )
        if config.verbose:
            print(f"VLM Classification Head: Build model with voken_size {config.voken_size}")

    def forward(self, features, **kwargs):
        x = self.dense(features)
        x = gelu(x)
        x = self.layer_norm(x)
        x = self.decoder(x)

        return x


class BertVLMContrastiveHeadNew(nn.Module):
    """Bert Head for masked language modeling."""

    def __init__(self, config):
        super().__init__()
        self.joint_dim = 512
        print(f"Contrastive Head: Using joint dim {self.joint_dim}")
        self.voken_size = config.voken_size
        self.dense = nn.Linear(config.hidden_size, self.joint_dim)
        self.layer_norm_x = BertLayerNorm(self.joint_dim, eps=config.layer_norm_eps)

        self.decoder_voken_feat = nn.Linear(config.voken_dim, self.joint_dim, bias=False)
        self.layer_norm_y = BertLayerNorm(self.joint_dim, eps=config.layer_norm_eps)

    def forward(self, bert_output, voken_feats, **kwargs):
        # Process the bert output
        x = self.dense(bert_output)
        x = gelu(x)
        x = self.layer_norm_x(x)

        # Process the pre-trained voken feats.
        y = self.decoder_voken_feat(voken_feats)      # [v, f] --> [v, 64]
        y = self.layer_norm_y(y)

        score = torch.einsum('ijf,vf->ijv', x, y) / math.sqrt(self.joint_dim)
        assert score.dim() == 3 and score.shape[2] == self.voken_size

        return score


class BertVLMContrastiveHead(nn.Module):
    """Bert Head for masked language modeling."""

    def __init__(self, config):
        super().__init__()
        self.voken_size = config.voken_size
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)

        self.joint_dim = 64
        self.decoder_bert_output = nn.Linear(config.hidden_size, self.joint_dim, bias=False)
        self.decoder_voken_feat = nn.Linear(config.voken_dim, self.joint_dim, bias=False)

    def forward(self, bert_output, voken_feats, **kwargs):
        # Process the bert output
        x = self.dense(bert_output)
        x = gelu(x)
        x = self.layer_norm(x)
        x = self.decoder_bert_output(x)                   # [b, l, f] --> [b, l, 64]

        # Process the pre-trained voken feats.
        y = self.decoder_voken_feat(voken_feats)      # [v, f] --> [v, 64]

        score = torch.einsum('ijf,vf->ijv', x, y) / math.sqrt(self.joint_dim)
        assert score.dim() == 3 and score.shape[2] == self.voken_size

        return score


class BertVLMRegressionHead(nn.Module):
    """Bert Head for masked language modeling."""

    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)

        self.decoder = nn.Linear(config.hidden_size, config.voken_dim, bias=True)

    def forward(self, features, **kwargs):
        x = self.dense(features)
        x = gelu(x)
        x = self.layer_norm(x)

        # project back to size of vocabulary with bias
        x = self.decoder(x)

        return x


class CoLwithBert(BertForMaskedLM):
    config_class = CoLBertConfig

    def __init__(self, config):
        super().__init__(config)
        self.do_voken_cls = config.do_voken_cls
        self.do_voken_reg = config.do_voken_reg
        self.do_voken_ctr = config.do_voken_ctr
        self.shared_head = config.shared_head
        self.verbose = config.verbose

        if self.verbose:
            print(f"Model: do voken cls -- {self.do_voken_cls}, do_voken_reg -- {self.do_voken_reg},"
                  f" do voken ctr -- {self.do_voken_ctr}")

        self.token_cls_loss_fct = CrossEntropyLoss()

        if self.shared_head:
            if self.verbose:
                print("Model: Using shared head for Voken and Token predictions.")
            self.cls = BertSharedHead(config)
            # Reinit the weight of the new head.
            self.init_weights()
        else:
            # Voken Classification
            if config.do_voken_cls:
                self.visual_cls_head = BertVLMClassificationHead(config)

            # Voken Regression
            if config.do_voken_reg:
                assert config.voken_dim is not None, "you need to set voken dim in the config."
                self.visual_reg_head = BertVLMRegressionHead(config)

            # Voken Constrastive
            if config.do_voken_ctr:
                assert config.voken_dim is not None, "you need to set voken dim in the config."
                self.visual_ctr_head = BertVLMContrastiveHeadNew(config)

        # Build voken features embeddings if needed.
        if self.do_voken_ctr or self.do_voken_reg:
            # The voken emb will be preloaded by func "init_voken_feat_emb"
            self.voken_feat_emb = nn.Embedding(
                config.voken_size,
                config.voken_dim
            )
            # Freeze this embedding
            for p in self.voken_feat_emb.parameters():
                p.requires_grad = False

        # Build Loss functions
        if config.do_voken_cls:
            # Voken Classification
            self.voken_cls_loss_fct = CrossEntropyLoss()
        if config.do_voken_reg:
            # Voken Regression
            self.voken_reg_loss_fct = SmoothL1Loss(reduction='none')
            # self.voken_reg_loss_fct = torch.nn.L1Loss(reduction='none')
        if config.do_voken_ctr:
            # Voken Constrastive
            self.voken_ctr_loss_fct = CrossEntropyLoss()

    def init_voken_feat_emb(self, feats):
        if self.verbose:
            print(f"Model: load the voken features with shape {feats.shape}")
            print("\tBefore Loading, std and mean are: ", self.voken_feat_emb.weight.std(), self.voken_feat_emb.weight.mean())
        assert feats.shape == (self.config.voken_size, self.config.voken_dim)
        self.voken_feat_emb.weight.data[:] = torch.Tensor(feats)
        self.original_voken_feats = torch.Tensor(feats).clone()
        self.original_voken_feats = self.original_voken_feats.half()
        if self.verbose:
            print("\tAfter Loading, std and mean are: ", self.voken_feat_emb.weight.std(), self.voken_feat_emb.weight.mean())
            print("\tThe 1st, 2nd, and last voken feats are: ")
            print("\t", self.voken_feat_emb.weight[0])
            print("\t", self.voken_feat_emb.weight[1])
            print("\t", self.voken_feat_emb.weight[-1])
        assert not self.voken_feat_emb.weight.requires_grad
        # print(self.voken_feat_emb.weight.dtype)
        # assert torch.all(torch.eq(self.voken_feat_emb.weight.cuda(),
        #                           self.original_voken_feats)), "The voken feats have been updated during training."

    def to(self, *args):
        if self.do_voken_ctr or self.do_voken_reg:
            self.original_voken_feats = self.original_voken_feats.to(*args)
        return super().to(*args)

    def forward(
            self,
            input_ids=None,
            attention_mask=None,
            token_type_ids=None,
            position_ids=None,
            head_mask=None,
            inputs_embeds=None,
            masked_lm_labels=None,
            encoder_hidden_states=None,
            encoder_attention_mask=None,
            lm_labels=None,
            voken_labels=None,
    ):
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
        )
        sequence_output = outputs[0]

        if not self.shared_head:
            voken_loss = 0.
            if self.do_voken_cls:
                assert voken_labels is not None
                voken_scores = self.visual_cls_head(sequence_output)
                voken_cls_loss = self.voken_cls_loss_fct(voken_scores.view(-1, self.config.voken_size), voken_labels.view(-1))
                voken_loss += voken_cls_loss

            if self.do_voken_reg:
                assert voken_labels is not None
                voken_prediction = self.visual_reg_head(sequence_output)

                # Get the mask and pre-trained features
                voken_label_mask = (voken_labels != -100)               # Get a mask of [0, 1, 1, ...., 1, 0], [b, len]
                safe_voken_labels = voken_labels.clone()
                safe_voken_labels[~voken_label_mask] = 0
                voken_feats = self.voken_feat_emb(safe_voken_labels)         # [b, len] --> [b, len, f]

                # Loss
                voken_reg_loss = self.voken_reg_loss_fct(voken_prediction, voken_feats)   # [b, len, f]

                # [b, l, f] * ([b,l] --> [b, l, 1]) = [b, l, f]
                voken_reg_loss = (voken_reg_loss * voken_label_mask.float().unsqueeze(-1))

                # [b, l, f] --sum-> [b, l] --mean-> [1,]
                voken_reg_loss = voken_reg_loss.sum(-1).mean()

                voken_loss += voken_reg_loss

            if self.do_voken_ctr:
                assert torch.all(torch.eq(self.voken_feat_emb.weight,
                                          self.original_voken_feats)), "The voken feats have been updated during training."

                voken_scores = self.visual_ctr_head(
                    sequence_output, self.voken_feat_emb.weight
                )
                voken_ctr_loss = self.voken_ctr_loss_fct(
                    voken_scores.view(-1, self.config.voken_size),
                    voken_labels.view(-1)
                )
                voken_loss += voken_ctr_loss

            if masked_lm_labels is not None:
                prediction_scores = self.cls(sequence_output)
                token_loss = self.token_cls_loss_fct(
                    prediction_scores.view(-1, self.config.vocab_size),
                    masked_lm_labels.view(-1))
            else:
                token_loss = torch.tensor(0.)
        else:
            voken_loss, token_loss = self.calculate_shared_loss(
                sequence_output,
                masked_lm_labels,
                voken_labels,
            )

        return voken_loss, token_loss

    def calculate_shared_loss(self, sequence_output, masked_lm_labels, voken_labels):
        if self.do_voken_cls:
            lang_scores, visn_scores = self.cls(sequence_output)
        else:
            lang_scores, visn_scores = self.cls(
                sequence_output,
                voken_feats=self.voken_feat_emb.weight
            )

        assert voken_labels is not None

        voken_loss_func = self.voken_cls_loss_fct if self.do_voken_cls else self.voken_ctr_loss_fct
        voken_loss = voken_loss_func(
            visn_scores.view(-1, self.config.voken_size),
            voken_labels.view(-1)
        )

        if masked_lm_labels is not None:
            token_loss = self.token_cls_loss_fct(
                lang_scores.view(-1, self.config.vocab_size),
                masked_lm_labels.view(-1)
            )
        else:
            token_loss = torch.tensor(0.)

        return voken_loss, token_loss


class SimpleBertForMaskedLM(BertForMaskedLM):

    def __init__(self, config):
        super().__init__(config)

    def forward(
            self,
            input_ids=None,
            attention_mask=None,
            token_type_ids=None,
            position_ids=None,
            head_mask=None,
            inputs_embeds=None,
            masked_lm_labels=None,
            encoder_hidden_states=None,
            encoder_attention_mask=None,
            lm_labels=None,
    ):
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
        )
        sequence_output = outputs[0]

        prediction_scores = self.cls(sequence_output)
        loss_fct = CrossEntropyLoss()
        token_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))

        return token_loss,


================================================
FILE: vlm/param.py
================================================
import argparse


def process_args():
    parser = argparse.ArgumentParser()

    # Datasets
    parser.add_argument(
        "--train_data_file", default=None, type=str,
        help="The input training data file (a text file).")
    parser.add_argument(
        "--eval_data_file", default=None, type=str,
        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
    parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
    parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")

    # Data loader
    parser.add_argument("--col_data", action="store_true", help="Using the specific dataset object in data.py")
    parser.add_argument("--split_sent", action="store_true", help="Overwrite the cached training and evaluation sets")
    parser.add_argument("--shuffle", action="store_true", help="Shuffle the training dataset")
    parser.add_argument(
        "--block_size", default=-1, type=int,
        help="Optional input sequence length after tokenization."
             "The training dataset will be truncated in block of this size for training."
             "Default to the model max input length for single sentence inputs (take into account special tokens).",
    )

    # Logging and Saving
    parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
    parser.add_argument(
        "--output_dir", type=str,
        help="The output directory where the model predictions and checkpoints will be written.",)
    parser.add_argument(
        "--overwrite_output_dir", action="store_true",
        help="Overwrite the content of the output directory")

    # Model types
    parser.add_argument(
        "--model_type", type=str, help="The model architecture to be trained or fine-tuned.",)
    parser.add_argument(
        "--should_continue", action="store_true", help="Whether to continue from latest checkpoint in output_dir")
    parser.add_argument(
        "--model_name_or_path", default=None, type=str,
        help="The model checkpoint for weights initialization. Leave None if you want to train a model from scratch.",)
    parser.add_argument(
        "--config_name", default=None, type=str,
        help="Optional pretrained config name or path if not the same as model_name_or_path. If both are None, initialize a new config.",)
    parser.add_argument(
        "--tokenizer_name", default=None, type=str,
        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path. If both are None, initialize a new tokenizer.",)
    parser.add_argument(
        "--cache_dir", default=None, type=str,
        help="Optional directory to store the pre-trained models downloaded from s3 (instead of the default one)",)
    parser.add_argument(
        "--overwrite_cache", action="store_true",
        help="Overwrite the cached training and evaluation sets")

    # MLM tasks
    parser.add_argument(
        "--mlm", action="store_true", help="Train with masked-language modeling loss instead of language modeling.")
    parser.add_argument(
        "--mlm_probability", type=float, default=0.15, help="Ratio of tokens to mask for masked language modeling loss")
    parser.add_argument(
        "--mlm_ratio", type=float, default=1., help="The ratio of mlm loss in the total loss.")

    # VLM related params
    parser.add_argument("--voken_dir", type=str, default='snap1/coco_hinge05_dim64_resxt101_robertal4/vokens',
                        help='Where the vokens are saved')
    parser.add_argument("--voken_suffix", type=str, default='vg_nococo.10000',
                        help='The suffix after the voken file, e.g., en.train.raw.{suffix} where suffix==vgcoco.1000')
    parser.add_argument("--voken_labels", type=str, default='all',
                        help='all: Calculate voken loss for all tokens;'
                             'mask: Calculate voken loss for masked tokens.'
                             'nonmask: Calculate voken loss for non-masked tokens.')
    parser.add_argument("--voken_feat_dir", type=str, default=None,
                        help='Where the vokens are saved')
    parser.add_argument("--do_voken_cls", action='store_true', help='Will do voken classification task')
    parser.add_argument("--do_voken_reg", action='store_true', help='Will do voken regression task (not used in this paper)')
    parser.add_argument("--do_voken_ctr", action='store_true', help='Will do voken contrastive task (not used in this paper)')
    parser.add_argument("--shared_head", action='store_true', help='Share the head if more than one tasks (e.g., cls, reg, ctr) are used (not used in this paper)')

    # Batch Size and Training Steps
    parser.add_argument("--seed", type=int, default=95, help="random seed for initialization")
    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int, help="Batch size per GPU/CPU for training.")
    parser.add_argument("--per_gpu_eval_batch_size", default=4, type=int, help="Batch size per GPU/CPU for evaluation.")
    parser.add_argument("--gradient_accumulation_steps", type=int, default=1,
        help="Number of updates steps to accumulate before performing a backward/update pass.",)
    parser.add_argument("--num_train_epochs", default=1.0, type=float, help="Total number of training epochs to perform.")
    parser.add_argument("--max_steps", default=-1, type=int,
        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",)

    # Optimizer
    parser.add_argument("--lamb", action="store_true", help='Use the LAMB optimizer in apex')
    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
    parser.add_argument("--warmup_ratio", default=0., type=float, help="Linear warmup over warmup_steps.")
    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
    parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")

    # Distributed Training
    parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
    parser.add_argument("--nodes", type=int, default=1)
    parser.add_argument("--nr", type=int, default=0)

    # Half Precision
    parser.add_argument(
        "--fp16", action="store_true",
        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",)
    parser.add_argument(
        "--fp16_opt_level", type=str, default="O1",
        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
             "See details at https://nvidia.github.io/apex/amp.html",)

    # Ablation Study
    parser.add_argument("--voken_ablation", default=None,
                        help="random, shuffle, reverse, token")


    args = parser.parse_args()
    return args


================================================
FILE: vlm/run_glue.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Finetuning the library models for sequence classification on GLUE (Bert, XLM, XLNet, RoBERTa, Albert, XLM-RoBERTa)."""


import argparse
import glob
import json
import logging
import os
import random

import numpy as np
import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange

from transformers import (
    MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,
    WEIGHTS_NAME,
    AdamW,
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
    glue_compute_metrics as compute_metrics,
    glue_convert_examples_to_features as convert_examples_to_features,
    glue_output_modes as output_modes,
    glue_processors as processors,
)
# from transformers import glue_compute_metrics as compute_metrics
# from transformers import glue_convert_examples_to_features as convert_examples_to_features
# from transformers import glue_output_modes as output_modes
# from transformers import glue_processors as processors


try:
    from torch.utils.tensorboard import SummaryWriter
except ImportError:
    from tensorboardX import SummaryWriter


logger = logging.getLogger(__name__)

#MODEL_CONFIG_CLASSES = list(MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys())
#MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)

#ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in MODEL_CONFIG_CLASSES), (),)


def set_seed(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)


def train(args, train_dataset, model, tokenizer):
    """ Train the model """
    # if args.local_rank in [-1, 0]:
    #    tb_writer = SummaryWriter()

    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)

    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
    else:
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    # Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]

    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
    num_warmup_steps = int(t_total * args.warmup_steps)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=t_total
    )

    # Check if saved optimizer or scheduler states exist
    #if os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt")) and os.path.isfile(
        #os.path.join(args.model_name_or_path, "scheduler.pt")
    #):
        ## Load in optimizer and scheduler states
        #optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
        #scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))

    if args.fp16:
        try:
            from apex import amp
        except ImportError:
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)

    # multi-gpu training (should be after apex fp16 initialization)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Distributed training (should be after apex fp16 initialization)
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True,
        )

    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    global_step = 0
    epochs_trained = 0
    steps_trained_in_current_epoch = 0
    # Check if continuing training from a checkpoint
    #if os.path.exists(args.model_name_or_path):
        # set global_step to global_step of last saved checkpoint from model path
        #try:
            #global_step = int(args.model_name_or_path.split("-")[-1].split("/")[0])
        #except ValueError:
            #global_step = 0
        #epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
        #steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)

        #logger.info("  Continuing training from checkpoint, will skip to saved global_step")
        #logger.info("  Continuing training from epoch %d", epochs_trained)
        #logger.info("  Continuing training from global step %d", global_step)
        #logger.info("  Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)

    tr_loss, logging_loss = 0.0, 0.0
    model.zero_grad()
    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0],
    )
    set_seed(args)  # Added here for reproductibility
    for _ in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
        for step, batch in enumerate(epoch_iterator):

            # Skip past any already trained steps if resuming training
            if steps_trained_in_current_epoch > 0:
                steps_trained_in_current_epoch -= 1
                continue

            model.train()
            batch = tuple(t.to(args.device) for t in batch)
            inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
            if args.model_type != "distilbert":
                inputs["token_type_ids"] = (
                    batch[2] if args.model_type in ["bert", "xlnet", "albert"] else None
                )  # XLM, DistilBERT, RoBERTa, and XLM-RoBERTa don't use segment_ids
            outputs = model(**inputs)
            loss = outputs[0]  # model outputs are always tuple in transformers (see doc)

            if args.n_gpu > 1:
                loss = loss.mean()  # mean() to average on multi-gpu parallel training
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            tr_loss += loss.item()
            if (step + 1) % args.gradient_accumulation_steps == 0 or (
                # last step in epoch but step is always smaller than gradient_accumulation_steps
                len(epoch_iterator) <= args.gradient_accumulation_steps
                and (step + 1) == len(epoch_iterator)
            ):
                if args.fp16:
                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                else:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)

                optimizer.step()
                scheduler.step()  # Update learning rate schedule
                model.zero_grad()
                global_step += 1

                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
                    logs = {}
                    if (
                        args.local_rank == -1 and args.evaluate_during_training
                    ):  # Only evaluate when single GPU otherwise metrics may not average well
                        results = evaluate(args, model, tokenizer)
                        for key, value in results.items():
                            eval_key = "eval_{}".format(key)
                            logs[eval_key] = value

                    loss_scalar = (tr_loss - logging_loss) / args.logging_steps
                    learning_rate_scalar = scheduler.get_lr()[0]
                    logs["learning_rate"] = learning_rate_scalar
                    logs["loss"] = loss_scalar
                    logging_loss = tr_loss

                    #for key, value in logs.items():
                        #tb_writer.add_scalar(key, value, global_step)
                    print(json.dumps({**logs, **{"step": global_step}}))

                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
                    # Save model checkpoint
                    output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
                    if not os.path.exists(output_dir):
                        os.makedirs(output_dir)
                    model_to_save = (
                        model.module if hasattr(model, "module") else model
                    )  # Take care of distributed/parallel training
                    model_to_save.save_pretrained(output_dir)
                    tokenizer.save_pretrained(output_dir)

                    torch.save(args, os.path.join(output_dir, "training_args.bin"))
                    logger.info("Saving model checkpoint to %s", output_dir)

                    torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
                    torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
                    logger.info("Saving optimizer and scheduler states to %s", output_dir)

            if args.max_steps > 0 and global_step > args.max_steps:
                epoch_iterator.close()
                break
        if args.max_steps > 0 and global_step > args.max_steps:
            train_iterator.close()
            break

    #if args.local_rank in [-1, 0]:
        #tb_writer.close()

    return global_step, tr_loss / global_step


def evaluate(args, model, tokenizer, prefix=""):
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
    eval_outputs_dirs = (args.output_dir, args.output_dir + "-MM") if args.task_name == "mnli" else (args.output_dir,)

    results = {}
    for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
        eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)

        if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
            os.makedirs(eval_output_dir)

        args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
        # Note that DistributedSampler samples randomly
        eval_sampler = SequentialSampler(eval_dataset)
        eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)

        # multi-gpu eval
        if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
            model = torch.nn.DataParallel(model)

        # Eval!
        logger.info("***** Running evaluation {} *****".format(prefix))
        logger.info("  Num examples = %d", len(eval_dataset))
        logger.info("  Batch size = %d", args.eval_batch_size)
        eval_loss = 0.0
        nb_eval_steps = 0
        preds = None
        out_label_ids = None
        for batch in tqdm(eval_dataloader, desc="Evaluating"):
            model.eval()
            batch = tuple(t.to(args.device) for t in batch)

            with torch.no_grad():
                inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
                if args.model_type != "distilbert":
                    inputs["token_type_ids"] = (
                        batch[2] if args.model_type in ["bert", "xlnet", "albert"] else None
                    )  # XLM, DistilBERT, RoBERTa, and XLM-RoBERTa don't use segment_ids
                outputs = model(**inputs)
                tmp_eval_loss, logits = outputs[:2]

                eval_loss += tmp_eval_loss.mean().item()
            nb_eval_steps += 1
            if preds is None:
                preds = logits.detach().cpu().numpy()
                out_label_ids = inputs["labels"].detach().cpu().numpy()
            else:
                preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
                out_label_ids = np.append(out_label_ids, inputs["labels"].detach().cpu().numpy(), axis=0)

        eval_loss = eval_loss / nb_eval_steps
        if args.output_mode == "classification":
            preds = np.argmax(preds, axis=1)
        elif args.output_mode == "regression":
            preds = np.squeeze(preds)
        result = compute_metrics(eval_task, preds, out_label_ids)
        results.update(result)

        print(eval_output_dir, prefix)
        output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
        with open(output_eval_file, "w") as writer:
            logger.info("***** Eval results {} *****".format(prefix))
            for key in sorted(result.keys()):
                logger.info("  %s = %s", key, str(result[key]))
                writer.write("%s = %s\n" % (key, str(result[key])))

    return results


def load_and_cache_examples(args, task, tokenizer, evaluate=False):
    if args.local_rank not in [-1, 0] and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

    processor = processors[task]()
    output_mode = output_modes[task]
    # Load data features from cache or dataset file
    cached_features_file = os.path.join(
        args.data_dir,
        "cached_{}_{}_{}_{}".format(
            "dev" if evaluate else "train",
            #list(filter(None, args.model_name_or_path.split("/"))).pop(),
            args.tokenizer_name,
            str(args.max_seq_length),
            str(task),
        ),
    )
    if os.path.exists(cached_features_file) and not args.overwrite_cache:
        logger.info("Loading features from cached file %s", cached_features_file)
        features = torch.load(cached_features_file)
    else:
        logger.info("Creating features from dataset file at %s", args.data_dir)
        label_list = processor.get_labels()
        if task in ["mnli", "mnli-mm"] and args.model_type in ["roberta", "xlmroberta"]:
            # HACK(label indices are swapped in RoBERTa pretrained model)
            label_list[1], label_list[2] = label_list[2], label_list[1]
        examples = (
            processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
        )
        features = convert_examples_to_features(
            examples,
            tokenizer,
            label_list=label_list,
            max_length=args.max_seq_length,
            output_mode=output_mode,
            # pad_on_left=bool(args.model_type in ["xlnet"]),  # pad on the left for xlnet
            # pad_token=tokenizer.pad_token_id,
            # pad_token_segment_id=tokenizer.pad_token_type_id,
        )
        if args.local_rank in [-1, 0]:
            logger.info("Saving features into cached file %s", cached_features_file)
            torch.save(features, cached_features_file)
    for i in range(3):
        print('ids:', features[i].input_ids)
        print('tokens:', tokenizer.convert_ids_to_tokens(features[i].input_ids))
        print('att:', features[i].attention_mask)

    if args.local_rank == 0 and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

    # Convert to Tensors and build dataset
    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
    all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
    if output_mode == "classification":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
    elif output_mode == "regression":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.float)

    dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
    return dataset


def main():
    parser = argparse.ArgumentParser()

    # Required parameters
    parser.add_argument(
        "--data_dir",
        default=None,
        type=str,
        required=True,
        help="The input data dir. Should contain the .tsv files (or other data files) for the task.",
    )
    parser.add_argument(
        "--model_type",
        default=None,
        type=str,
        required=True,
        #help="Model type selected in the list: " + ", ".join(MODEL_TYPES),
    )
    parser.add_argument(
        "--model_name_or_path",
        default=None,
        type=str,
        required=True,
        #help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
    )
    parser.add_argument(
        "--task_name",
        default=None,
        type=str,
        required=True,
        help="The name of the task to train selected in the list: " + ", ".join(processors.keys()),
    )
    parser.add_argument(
        "--output_dir",
        default=None,
        type=str,
        required=True,
        help="The output directory where the model predictions and checkpoints will be written.",
    )

    # Other parameters
    parser.add_argument(
        "--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name",
    )
    parser.add_argument(
        "--tokenizer_name",
        default="",
        type=str,
        help="Pretrained tokenizer name or path if not the same as model_name",
    )
    parser.add_argument(
        "--cache_dir",
        default="",
        type=str,
        help="Where do you want to store the pre-trained models downloaded from s3",
    )
    parser.add_argument(
        "--max_seq_length",
        default=128,
        type=int,
        help="The maximum total input sequence length after tokenization. Sequences longer "
        "than this will be truncated, sequences shorter will be padded.",
    )
    parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
    parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")
    parser.add_argument(
        "--evaluate_during_training", action="store_true", help="Run evaluation during training at each logging step.",
    )
    parser.add_argument(
        "--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model.",
    )

    parser.add_argument(
        "--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.",
    )
    parser.add_argument(
        "--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for evaluation.",
    )
    parser.add_argument(
        "--gradient_accumulation_steps",
        type=int,
        default=1,
        help="Number of updates steps to accumulate before performing a backward/update pass.",
    )
    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
    parser.add_argument(
        "--num_train_epochs", default=3.0, type=float, help="Total number of training epochs to perform.",
    )
    parser.add_argument(
        "--max_steps",
        default=-1,
        type=int,
        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
    )
    parser.add_argument("--warmup_steps", default=0, type=float, help="Linear warmup over warmup_steps.")

    parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
    parser.add_argument(
        "--eval_all_checkpoints",
        action="store_true",
        help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number",
    )
    parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available")
    parser.add_argument("--from_scratch", action="store_true", help="Avoid using CUDA when available")
    parser.add_argument(
        "--overwrite_output_dir", action="store_true", help="Overwrite the content of the output directory",
    )
    parser.add_argument(
        "--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets",
    )
    parser.add_argument(
        "--nopooler", action="store_true", help="Do not load the pooler",
    )
    parser.add_argument("--seed", type=int, default=9595, help="random seed for initialization")

    parser.add_argument(
        "--fp16",
        action="store_true",
        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",
    )
    parser.add_argument(
        "--fp16_opt_level",
        type=str,
        default="O1",
        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
        "See details at https://nvidia.github.io/apex/amp.html",
    )
    parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
    parser.add_argument("--server_ip", type=str, default="", help="For distant debugging.")
    parser.add_argument("--server_port", type=str, default="", help="For distant debugging.")
    args = parser.parse_args()

    if (
        os.path.exists(args.output_dir)
        and os.listdir(args.output_dir)
        and args.do_train
        and not args.overwrite_output_dir
    ):
        raise ValueError(
            "Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
                args.output_dir
            )
        )

    # Setup distant debugging if needed
    if args.server_ip and args.server_port:
        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
        import ptvsd

        print("Waiting for debugger attach")
        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
        ptvsd.wait_for_attach()

    # Setup CUDA, GPU & distributed training
    if args.local_rank == -1 or args.no_cuda:
        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
        args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count()
    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
        torch.cuda.set_device(args.local_rank)
        device = torch.device("cuda", args.local_rank)
        torch.distributed.init_process_group(backend="nccl")
        args.n_gpu = 1
    args.device = device

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
    )
    logger.warning(
        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
        args.local_rank,
        device,
        args.n_gpu,
        bool(args.local_rank != -1),
        args.fp16,
    )

    # Set seed
    set_seed(args)

    # Prepare GLUE task
    args.task_name = args.task_name.lower()
    if args.task_name not in processors:
        raise ValueError("Task not found: %s" % (args.task_name))
    processor = processors[args.task_name]()
    args.output_mode = output_modes[args.task_name]
    label_list = processor.get_labels()
    num_labels = len(label_list)

    # Load pretrained model and tokenizer
    if args.local_rank not in [-1, 0]:
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

    args.model_type = args.model_type.lower()
    config = AutoConfig.from_pretrained(
        args.config_name if args.config_name else args.model_name_or_path,
        num_labels=num_labels,
        finetuning_task=args.task_name,
        cache_dir=args.cache_dir if args.cache_dir else None,
    )
    tokenizer = AutoTokenizer.from_pretrained(
        args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
        do_lower_case=args.do_lower_case,
        cache_dir=args.cache_dir if args.cache_dir else None,
    )
    model = AutoModelForSequenceClassification.from_pretrained(
        args.model_name_or_path,
        from_tf=bool(".ckpt" in args.model_name_or_path),
        config=config,
        cache_dir=args.cache_dir if args.cache_dir else None,
    )

    if args.nopooler:
        model.bert.pooler.apply(model._init_weights)

    if args.local_rank == 0:
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

    model.to(args.device)

    logger.info("Training/evaluation parameters %s", args)

    # Training
    if args.do_train:
        train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

    # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
        # Create output directory if needed
        if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
            os.makedirs(args.output_dir)

        logger.info("Saving model checkpoint to %s", args.output_dir)
        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
        # They can then be reloaded using `from_pretrained()`
        model_to_save = (
            model.module if hasattr(model, "module") else model
        )  # Take care of distributed/parallel training
        model_to_save.save_pretrained(args.output_dir)
        tokenizer.save_pretrained(args.output_dir)

        # Good practice: save your training arguments together with the trained model
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

        # Load a trained model and vocabulary that you have fine-tuned
        model = AutoModelForSequenceClassification.from_pretrained(args.output_dir)
        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

    # Evaluation
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
        tokenizer = AutoTokenizer.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
            )
            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""
            prefix = prefix if 'checkpoint' in prefix else ''

            model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
            model.to(args.device)
            result = evaluate(args, model, tokenizer, prefix=prefix)
            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
            results.update(result)

    return results


if __name__ == "__main__":
    main()


================================================
FILE: vlm/run_glue_epochs.py
================================================
import argparse
import math
import os
from pathlib import Path
from pprint import pprint
import subprocess
import threading
import time

import torch

parser = argparse.ArgumentParser()
parser.add_argument(
    "--load", default=None, type=str,
    help="The model loaded, e.g., snap/vlm/wiki103_small"
)
parser.add_argument(
    "--gpus", default=None, type=str,
    help="The list of GPU ids, separated by comma, e.g., '2,3'"
)
parser.add_argument(
    "--snaps", default=1, type=int,
    help="The number of snaps evaluated with GLUE benchmark. "
         "-1 means all."
)
parser.add_argument(
    "--start-from", default=0, type=int
)
args = parser.parse_args()

if args.gpus is None:
    # Get all gpus available in this server.
    num_gpus = torch.cuda.device_count()
    # The device id are labeled from 0 to num_gpus-1.
    available_gpus = list(range(num_gpus))
else:
    available_gpus = [int(gpu_id) for gpu_id in args.gpus.split(",")]
    num_gpus = len(available_gpus)

resource = threading.Semaphore(num_gpus)


def get_snap_paths(load):
    load_path = Path(load)
    paths = []
    for dir_path in load_path.iterdir():
        if dir_path.name.startswith("checkpoint-"):
            paths.append(dir_path)
    return paths


def sorted_paths(paths):
    pathXkey = []
    for path in paths:
        name = path.name
        identifier = name[len("checkpoint-"):]
        if identifier == 'last':
            continue
        if 'epoch' in identifier:
            key = identifier
        else:
            key = int(identifier)
        pathXkey.append((path, key))
    pathXkey = sorted(pathXkey, key=lambda x: x[1])
    paths = list(map(lambda x: x[0], pathXkey))
    return paths


def get_test_paths(paths, snaps):
    """
    Return $snaps paths to be tested on GLUE
    """
    if snaps == -1:
        return paths
    interval = len(paths) * 1. / snaps
    test_paths = []
    for i in range(1, snaps+1):
        idx = int(math.ceil(interval * i)) - 1
        test_paths.append(paths[idx])
    return test_paths


# Get all paths needs to be processed
paths = get_snap_paths(args.load)
paths = sorted_paths(paths)
paths = paths[args.start_from:]
paths = get_test_paths(paths, args.snaps)
paths = paths[::-1]         # Run the last epochs first.
path_lock = threading.Lock()


def run_glue():
    while True:
        # Only have one atomic operation (list.pop) here, do not need lock.
        # A Semaphore is enough to control the resources.
        resource.acquire()
        gpu_id = available_gpus.pop(0)

        # Involve multiple atomic operations (list.__len__, list.pop),
        # thus introduce a lock here.
        path_lock.acquire()
        if len(paths) > 0:
            path = paths.pop(0)
        else:
            path_lock.release()
            break
        path_lock.release()

        model = path.parent
        ckpt = path.name
        print(gpu_id, model, ckpt)
        process = subprocess.Popen(
            ['bash',
             'scripts/run_glue_at_epoch.bash',
             str(gpu_id),    # Use GPU
             '3',            # Number of epochs
             model,
             ckpt
             ],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE)
        stdout, stderr = process.communicate()

        available_gpus.append(gpu_id)
        resource.release()

        # Sleep here allows the script (run_glue_at_epoch.bash) to finish
        # thus all memory in GPU will be cleared.
        time.sleep(5)
    return


# Allocate #threads which equals to #GPUs
threads = []
for _ in range(num_gpus):
    threads.append(
        threading.Thread(target=run_glue)
    )
for thread in threads:
    thread.start()

# Join to the main thread, thus the main thread will wait for all the threads.
for thread in threads:
    thread.join()


================================================
FILE: vlm/run_lm_distributed.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
using a masked language modeling (MLM) loss.
"""


import argparse
import glob
import json
import logging
import os
import pickle
import random
import re
import shutil
import sys
from typing import Dict, List, Tuple
from datetime import datetime

import numpy as np
import torch
from torch.nn.utils.rnn import pad_sequence
import torch.multiprocessing as mp
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
from transformers import (
    WEIGHTS_NAME,
    AdamW,
    BertConfig,
    BertForMaskedLM,
    BertTokenizer,
    CamembertConfig,
    CamembertForMaskedLM,
    CamembertTokenizer,
    DistilBertConfig,
    DistilBertForMaskedLM,
    DistilBertTokenizer,
    GPT2Config,
    GPT2LMHeadModel,
    GPT2Tokenizer,
    OpenAIGPTConfig,
    OpenAIGPTLMHeadModel,
    OpenAIGPTTokenizer,
    PreTrainedModel,
    PreTrainedTokenizer,
    RobertaConfig,
    RobertaForMaskedLM,
    RobertaTokenizer,
    get_linear_schedule_with_warmup,
)

sys.path.append(
    os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
)
from vlm.data import CoLDataset
from vlm.param import process_args
from vlm.model import SimpleBertForMaskedLM


try:
    from torch.utils.tensorboard import SummaryWriter
except ImportError:
    from tensorboardX import SummaryWriter


logger = logging.getLogger(__name__)


MODEL_CLASSES = {
    "gpt2": (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
    "openai-gpt": (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
    "bert": (BertConfig, SimpleBertForMaskedLM, BertTokenizer),
    "roberta": (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer),
    "distilbert": (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer),
    "camembert": (CamembertConfig, CamembertForMaskedLM, CamembertTokenizer),
}


class TextDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, args, file_path: str, block_size=512):
        assert os.path.isfile(file_path)

        block_size = block_size - (tokenizer.max_len - tokenizer.max_len_single_sentence)

        directory, filename = os.path.split(file_path)
        cached_features_file = os.path.join(
            directory, args.model_type + "_cached_lm_" + str(block_size) + "_" + filename
        )

        if os.path.exists(cached_features_file) and not args.overwrite_cache:
            logger.info("Loading features from cached file %s", cached_features_file)
            with open(cached_features_file, "rb") as handle:
                self.examples = pickle.load(handle)
        else:
            logger.info("Creating features from dataset file at %s", directory)

            self.examples = []
            with open(file_path, encoding="utf-8") as f:
                text = f.read()

            tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))

            for i in range(0, len(tokenized_text) - block_size + 1, block_size):  # Truncate in block of block_size
                self.examples.append(tokenizer.build_inputs_with_special_tokens(tokenized_text[i : i + block_size]))
            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
            # If your dataset is small, first you should loook for a bigger one :-) and second you
            # can change this behavior by adding (model specific) padding.

            logger.info("Saving features into cached file %s", cached_features_file)
            with open(cached_features_file, "wb") as handle:
                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, item):
        return torch.tensor(self.examples[item], dtype=torch.long)


class LineByLineTextDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, args, file_path: str, block_size=512):
        assert os.path.isfile(file_path)
        # Here, we do not cache the features, operating under the assumption
        # that we will soon use fast multithreaded tokenizers from the
        # `tokenizers` repo everywhere =)
        logger.info("Creating features from dataset file at %s", file_path)

        with open(file_path, encoding="utf-8") as f:
            lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]

        self.examples = tokenizer.batch_encode_plus(lines, add_special_tokens=True, max_length=block_size)["input_ids"]

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i):
        return torch.tensor(self.examples[i], dtype=torch.long)


def load_and_cache_examples(args, tokenizer, evaluate=False):
    file_path = args.eval_data_file if evaluate else args.train_data_file
    if args.col_data:
        return CoLDataset(file_path, args.tokenizer_name, tokenizer, args.block_size,
                          split_sent=args.split_sent,
                          verbose=(args.gpu == 0))
    elif args.line_by_line:
        return LineByLineTextDataset(tokenizer, args, file_path=file_path, block_size=args.block_size)
    else:
        return TextDataset(tokenizer, args, file_path=file_path, block_size=args.block_size)


def set_seed(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)


def mask_tokens(inputs: torch.Tensor, tokenizer: PreTrainedTokenizer, args) -> Tuple[torch.Tensor, torch.Tensor]:
    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """

    if tokenizer.mask_token is None:
        raise ValueError(
            "This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer."
        )

    labels = inputs.clone()
    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
    probability_matrix = torch.full(labels.shape, args.mlm_probability)
    special_tokens_mask = [
        tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
    ]
    probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
    if tokenizer._pad_token is not None:
        padding_mask = labels.eq(tokenizer.pad_token_id)
        probability_matrix.masked_fill_(padding_mask, value=0.0)
    masked_indices = torch.bernoulli(probability_matrix).bool()
    labels[~masked_indices] = -100  # We only compute loss on masked tokens

    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)

    # 10% of the time, we replace masked input tokens with random word
    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
    inputs[indices_random] = random_words[indices_random]

    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
    return inputs, labels


def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
    set_seed(args)  # Added here for reproducibility

    """ Train the model """
    if args.gpu == 0:
        current_time = datetime.now().strftime('%b%d_%H-%M-%S')
        tb_writer = SummaryWriter(args.output_dir + '/runs/' + current_time)

    args.train_batch_size = args.per_gpu_train_batch_size

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    if args.shuffle:
        logger.info(f"Shuffle the dataset in training,"
                       f"GPU: {args.gpu},"
                       f"Rank: {args.rank},"
                       f"Total: {args.world_size}")
    train_sampler = DistributedSampler(
        train_dataset,
        num_replicas=args.world_size,
        rank=args.rank,
        shuffle=args.shuffle,
    )
    train_dataloader = DataLoader(
        train_dataset, sampler=train_sampler, shuffle=False, num_workers=0,
        batch_size=args.train_batch_size, collate_fn=collate, pin_memory=True
    )

    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
    else:
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    # Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]
    optimizer = AdamW(optimizer_grouped_parameters,
                      # betas=(0.9, 0.98),
                      lr=args.learning_rate,
                      eps=args.adam_epsilon)
    if args.warmup_ratio > 0.:
        assert args.warmup_steps == 0
        args.warmup_steps = int(t_total * args.warmup_ratio)
    if args.gpu == 0:
        print("Optimized with lr %f, steps %d, warmup steps %d, and use beta, epsilon %0.8f." % (
            args.learning_rate, t_total, args.warmup_steps, optimizer.defaults['eps']
        ), optimizer.defaults['betas'])
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )

    # Check if saved optimizer or scheduler states exist
    if (
        args.model_name_or_path
        and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))
        and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))
    ):
        # Load in optimizer and scheduler states
        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))

    if args.fp16:
        try:
            from apex import amp
        except ImportError:
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level,
                                          verbosity=0)
        from apex.parallel import DistributedDataParallel as DDP
        model = DDP(model)
    else:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.gpu], find_unused_parameters=True
        )

    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Total train batch size (w. distributed & accumulation) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * args.world_size
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    global_step = 0
    epochs_trained = 0
    # Check if continuing training from a checkpoint
    # if args.model_name_or_path and os.path.exists(args.model_name_or_path):
    #     try:
    #         # set global_step to gobal_step of last saved checkpoint from model path
    #         checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
    #         epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
    #         steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)
    #         logger.info("  Continuing training from checkpoint, will skip to saved global_step")
    #         logger.info("  Continuing training from epoch %d", epochs_trained)
    #     except ValueError:
    #         logger.info("  Do not load model from %s, restart training" % args.model_name_or_path)

    # model_to_resize = model.module if hasattr(model, "module") else model  # Take care of distributed/parallel training
    # model_to_resize.resize_token_embeddings(len(tokenizer))

    model.zero_grad()
    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.gpu != 0
    )
    for epoch in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.gpu != 0)
        tr_loss, logging_loss = 0.0, 0.0
        model.zero_grad()       # Support of accumulating gradients
        for step, batch in enumerate(epoch_iterator):
            inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch)
            inputs = inputs.to(args.device)
            labels = labels.to(args.device)
            # If some of the input is padded, then the attention mask is needed
            attention_mask = (inputs != tokenizer.pad_token_id)         # word_tokens --> 1, pad_token --> 0
            if attention_mask.all():
                attention_mask = None

            if epoch == 0 and step < 3 and args.gpu == 0:
                print(inputs.shape)
                print(inputs[0])
                print(tokenizer.convert_ids_to_tokens(inputs[0].cpu().numpy()))
                print(labels[0])
                print(attention_mask)

            model.train()
            outputs = model(inputs,
                            attention_mask=attention_mask,
                            masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
            loss = outputs[0]  # model outputs are always tuple in transformers (see doc)

            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            tr_loss += loss.item()
            if (step + 1) % args.gradient_accumulation_steps == 0:
                if args.max_grad_norm > 0.:
                    if args.fp16:
                        total_norm = torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                    else:
                        total_norm =torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
                optimizer.step()
                scheduler.step()  # Update learning rate schedule
                model.zero_grad()
                global_step += 1

                if args.gpu == 0 and args.logging_steps > 0 and (step + 1) % args.logging_steps == 0:
                    # Log metrics
                    tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
                    if args.fp16:
                        try:
                            from apex.amp import _amp_state
                            tb_writer.add_scalar("loss_scale", _amp_state.loss_scalers[0]._loss_scale, global_step)
                            tb_writer.add_scalar("scaled_loss", scaled_loss.item(), global_step)
                        except ImportError:
                            logger.warning("Cannot import apex.amp._amp_state, "
                                           "would not state the loss_scale in the log")
                    if args.max_grad_norm > 0.:  # Only clip the grad when it is valid
                        tb_writer.add_scalar("grad_norm", total_norm, global_step)
                    tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
                    logging_loss = tr_loss

            if args.max_steps > 0 and global_step >= args.max_steps:
                break

        # Save it each epoch
        if args.gpu == 0:
            # Save checkpoints
            checkpoint_name = "checkpoint-epoch%04d" % epoch
            save_model(args, checkpoint_name, model, tokenizer, optimizer, scheduler)
            last_path = os.path.join(args.output_dir, 'checkpoint-last')
            # if os.path.exists(last_path):
            #     print(last_path)
            #     os.remove(last_path)
            # os.symlink(os.path.join(args.output_dir, checkpoint_name), last_path)

            # Evaluate the model
            logger.info(" Training loss of Epoch %d: %0.4f" % (epoch, tr_loss / step))
            logger.info(" Evaluation Results of Epoch %d: " % epoch)
            results = evaluate(args, model, tokenizer)
            for key, value in results.items():
                tb_writer.add_scalar("eval_{}".format(key), value, global_step)
                logger.info("\t %s: %0.4f" % (key, value))
            output_eval_file = os.path.join(args.output_dir, checkpoint_name, "eval_results.json")
            json.dump(results, open(output_eval_file, 'w'), sort_keys=True, indent=4)

        if args.max_steps > 0 and global_step >= args.max_steps:
            epoch_iterator.close()
            train_iterator.close()
            break

    if args.gpu == 0:
        tb_writer.close()


def save_model(args, name, model, tokenizer, optimizer, scheduler):
    # Save model checkpoint
    output_dir = os.path.join(args.output_dir, name)
    os.makedirs(output_dir, exist_ok=True)
    model_to_save = (
        model.module if hasattr(model, "module") else model
    )  # Take care of distributed/parallel training
    model_to_save.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

    torch.save(args, os.path.join(output_dir, "training_args.bin"))
    logger.info("Saving model checkpoint to %s", output_dir)

    # torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
    # torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
    # logger.info("Saving optimizer and scheduler states to %s", output_dir)


def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, prefix="") -> Dict:
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)

    args.eval_batch_size = args.per_gpu_eval_batch_size
    # Note that DistributedSampler samples randomly

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate
    )

    # Eval!
    logger.info("***** Running evaluation {} *****".format(prefix))
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", args.eval_batch_size)
    eval_loss = 0.0
    nb_eval_steps = 0
    model.eval()

    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch)
        inputs = inputs.to(args.device)
        labels = labels.to(args.device)
        # If some of the input is padded, then the attention mask is needed
        attention_mask = (inputs != tokenizer.pad_token_id)  # word_tokens --> 1, pad_token --> 0
        if attention_mask.all():
            attention_mask = None

        with torch.no_grad():
            outputs = model(inputs, attention_mask=attention_mask, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
            lm_loss = outputs[0]
            eval_loss += lm_loss.mean().item()
        nb_eval_steps += 1

    eval_loss = eval_loss / nb_eval_steps
    perplexity = torch.exp(torch.tensor(eval_loss)).item()

    result = {"perplexity": perplexity}

    return result


def is_port_in_use(port):
    import socket
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        return s.connect_ex(('localhost', port)) == 0


def main():
    args = process_args()
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    port = 9595
    while is_port_in_use(port):
        port += 1
    print("Use port", port)
    os.environ['MASTER_PORT'] = str(port)

    # Using all available gpus for multi-processing distributed
    args.gpus = torch.cuda.device_count()
    print("Use gpus ", list(range(args.gpus)))
    args.world_size = args.gpus * args.nodes
    mp.spawn(setup, nprocs=args.gpus, args=(args,))


def setup(gpu, args):
    if args.should_continue:
        args.model_name_or_path = 'checkpoint-last'

    # Setup CUDA, GPU & distributed training
    torch.cuda.set_device(gpu)
    device = torch.device("cuda", gpu)
    args.gpu = gpu                                  # Local device id.
    args.device = device                            # Local device object.
    args.rank = args.nr * args.gpus + gpu           # The gpu id in the world.
    torch.distributed.init_process_group(
        backend="nccl",
        init_method='env://',
        world_size=args.world_size,
        rank=args.rank
    )

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.gpu == 0 else logging.WARN,
    )
    logger.warning(
        "Process GPU: %s, num_of_total_GPUs: %s, distributed training: True, 16-bits training: %s",
        args.gpu, args.gpus, args.fp16,
    )

    # Set seed
    set_seed(args)

    # Load pretrained model and token
    # Barrier to make sure only the first process in distributed training
    # download model & vocabizer
    if gpu != 0:
        torch.distributed.barrier()

    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]

    # Get Config
    if args.config_name:
        config = config_class.from_pretrained(args.config_name, cache_dir=args.cache_dir)
    elif args.model_name_or_path:
        config = config_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
    else:
        raise ValueError(
            "Why do you want the default config?? Please use --config_name or --model_name_or_path"
        )

    # Get Tokenizer
    if args.tokenizer_name:
        tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
        # BERT always needs lower cased tokens.
        assert tokenizer.init_kwargs.get("do_lower_case", False)
    elif args.model_name_or_path:
        tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
    else:
        raise ValueError(
            "You are instantiating a new {} tokenizer. This is not supported, "
            "but you can do it from another script, save it,"
            "and load it from here, using --tokenizer_name".format(tokenizer_class.__name__)
        )

    assert args.block_size <= tokenizer.max_len

    if args.model_name_or_path:
        model = model_class.from_pretrained(
            args.model_name_or_path,
            from_tf=bool(".ckpt" in args.model_name_or_path),
            config=config,
            cache_dir=args.cache_dir,
        )
    else:
        logger.info("Training new model from scratch")
        model = model_class(config=config)

    model.to(args.device)

    # End of barrier to make sure only the first process waiting other processes
    if gpu == 0:
        torch.distributed.barrier()

    logger.info("Training/evaluation parameters %s", args)

    # Training
    if args.do_train:
        # Barrier to make sure only the first process in distributed training process the dataset,
        # and the others will use the cache
        if gpu != 0:
            torch.distributed.barrier()
        train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
        if gpu == 0:
            torch.distributed.barrier()

        train(args, train_dataset, model, tokenizer)

    # Evaluation
    if args.do_eval and gpu == 0:
        result = evaluate(args, model, tokenizer)


if __name__ == "__main__":
    main()


================================================
FILE: vlm/run_vlm_distributed.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
using a masked language modeling (MLM) loss.
"""

from datetime import datetime
import json
import logging
import os
import random
import sys
import time
from typing import Dict, List, Tuple

import numpy as np
import torch
from torch.nn.utils.rnn import pad_sequence
import torch.multiprocessing as mp
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
from transformers import (
    MODEL_WITH_LM_HEAD_MAPPING,
    WEIGHTS_NAME,
    AdamW,
    AutoConfig,
    AutoModelWithLMHead,
    AutoTokenizer,
    BertConfig,
    BertForMaskedLM,
    BertTokenizer,
    CamembertConfig,
    CamembertForMaskedLM,
    CamembertTokenizer,
    DistilBertConfig,
    DistilBertForMaskedLM,
    DistilBertTokenizer,
    GPT2Config,
    GPT2LMHeadModel,
    GPT2Tokenizer,
    OpenAIGPTConfig,
    OpenAIGPTLMHeadModel,
    OpenAIGPTTokenizer,
    PreTrainedModel,
    PreTrainedTokenizer,
    RobertaConfig,
    RobertaForMaskedLM,
    RobertaTokenizer,
    get_linear_schedule_with_warmup,
)

sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from vlm.data import CoLDataset, get_voken_feats
from vlm.param import process_args
from vlm.model import CoLBertConfig, CoLwithBert


try:
    from torch.utils.tensorboard import SummaryWriter
except ImportError:
    from tensorboardX import SummaryWriter


logger = logging.getLogger(__name__)


MODEL_CLASSES = {
    "gpt2": (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
    "openai-gpt": (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
    "bert": (CoLBertConfig, CoLwithBert, BertTokenizer),
    "roberta": (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer),
    "distilbert": (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer),
    "camembert": (CamembertConfig, CamembertForMaskedLM, CamembertTokenizer),
}


def load_and_cache_examples(args, tokenizer, evaluate=False):
    file_path = args.eval_data_file if evaluate else args.train_data_file
    return CoLDataset(file_path, args.tokenizer_name, tokenizer, args.block_size,
                      split_sent=args.split_sent, voken_dir=args.voken_dir,
                      suffix=args.voken_suffix,
                      verbose=(args.gpu == 0),
                      voken_ablation=args.voken_ablation)


def set_seed(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)


def mask_tokens(tokens: torch.Tensor, vokens: torch.Tensor, tokenizer: PreTrainedTokenizer, args) \
        -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """ Notice that this function would have a side affect of manipulating the Tensor tokens.
    Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """

    if tokenizer.mask_token is None:
        raise ValueError(
            "This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer."
        )

    labels = tokens.clone()
    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
    probability_matrix = torch.full(labels.shape, args.mlm_probability)
    special_tokens_mask = [
        tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
    ]
    probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
    if tokenizer._pad_token is not None:
        padding_mask = labels.eq(tokenizer.pad_token_id)
        probability_matrix.masked_fill_(padding_mask, value=0.0)
    masked_indices = torch.bernoulli(probability_matrix).bool()
    labels[~masked_indices] = -100  # We only compute loss on masked tokens

    if args.voken_labels == 'mask':
        vokens[~masked_indices] = -100
    elif args.voken_labels == 'nonmask':
        vokens[masked_indices] = -100
    elif args.voken_labels == 'all':
        pass
    else:
        assert "Do not support the voken loss of type %s" % args.voken_labels

    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
    tokens[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)

    # 10% of the time, we replace masked input tokens with random word
    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
    tokens[indices_random] = random_words[indices_random]

    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
    return tokens, labels, vokens


def train(args, train_dataset: CoLDataset, valid_dataset: CoLDataset,
          model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
    set_seed(args)  # Added here for reproducibility

    """ Train the model """
    if args.gpu == 0:
        current_time = datetime.now().strftime('%b%d_%H-%M-%S')
        tb_writer = SummaryWriter(args.output_dir + '/runs/' + current_time)

    args.train_batch_size = args.per_gpu_train_batch_size

    def col_collate(examples):
        tokens, vokens = zip(*examples)
        if tokenizer._pad_token is None:
            tokens = pad_sequence(tokens, batch_first=True)
        else:
            tokens = pad_sequence(tokens, batch_first=True, padding_value=tokenizer.pad_token_id)
        vokens = pad_sequence(vokens, batch_first=True, padding_value=-100)
        return tokens, vokens

    if args.shuffle:
        logger.info(f"Shuffle the dataset in training,"
                       f"GPU: {args.gpu},"
                       f"Rank: {args.rank},"
                       f"Total: {args.world_size}")
    train_sampler = DistributedSampler(
        train_dataset,
        num_replicas=args.world_size,
        rank=args.rank,
        shuffle=args.shuffle,
    )
    train_dataloader = DataLoader(
        train_dataset, sampler=train_sampler, shuffle=False, num_workers=0,
        batch_size=args.train_batch_size, collate_fn=col_collate, pin_memory=True
    )

    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
        # args.num_train_epochs = 9595
    else:
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    # Prepare optimizer and schedule (linear warmup and decay)
    if args.lamb:
        no_decay = ['bias', 'gamma', 'beta', 'LayerNorm']
    else:
        no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {
            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
            "weight_decay": 0.0,
        },
    ]
    if args.lamb:
        logger.info(f"Using LAMB Optimizer with max grad norm {args.max_grad_norm}")
        import apex
        optimizer = apex.optimizers.FusedLAMB(
            optimizer_grouped_parameters,
            lr=args.learning_rate,
            eps=args.adam_epsilon,
            max_grad_norm=args.max_grad_norm
        )
    else:
        optimizer = AdamW(optimizer_grouped_parameters,
                          lr=args.learning_rate,
                          #betas=(0.9, 0.98),
                          eps=args.adam_epsilon)
    if args.gpu == 0:
        print(f"Optimized with lr: {optimizer.defaults['lr']}, total steps: {t_total},"
              f" warmup steps: {args.warmup_steps}, epsilon {optimizer.defaults['eps']},"
              f" beta: {optimizer.defaults['betas']}, weight decay {args.weight_decay}.")
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )

    # Check if saved optimizer or scheduler states exist
    if (
        args.model_name_or_path
        and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))
        and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))
    ):
        # Load in optimizer and scheduler states
        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))

    if args.fp16:
        try:
            from apex import amp
        except ImportError:
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
        from apex.parallel import DistributedDataParallel as DDP
        model = DDP(model)
    else:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.gpu], find_unused_parameters=True
        )

    # Allow not calculating the lm heads.
    if args.mlm_ratio == 0.:
        model.lm_head = None


    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Total train batch size (w. distributed & accumulation) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * args.world_size
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    global_step = 0
    epochs_trained = 0
    # Check if continuing training from a checkpoint
    # if args.model_name_or_path and os.path.exists(args.model_name_or_path):
    #     try:
    #         # set global_step to gobal_step of last saved checkpoint from model path
    #         checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
    #         epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
    #         steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)
    #         logger.info("  Continuing training from checkpoint, will skip to saved global_step")
    #         logger.info("  Continuing training from epoch %d", epochs_trained)
    #     except ValueError:
    #         logger.info("  Do not load model from %s, restart training" % args.model_name_or_path)

    model_to_resize = model.module if hasattr(model, "module") else model  # Take care of distributed/parallel training
    assert model_to_resize.config.vocab_size == len(tokenizer)
    # model_to_resize.resize_token_embeddings(len(tokenizer))

    model.zero_grad()
    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.gpu != 0
    )
    set_seed(args)  # Added here for reproducibility
    LOSS_NAMES = ['token_loss', 'voken_loss', 'total_loss']
    for epoch in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.gpu != 0)
        tr_loss, logging_loss = np.zeros(len(LOSS_NAMES)), 0.0
        model.zero_grad()
        for step, (tokens, vokens) in enumerate(epoch_iterator):
            token_inputs, token_labels, voken_labels = mask_tokens(tokens, vokens, tokenizer, args)
            token_inputs = token_inputs.to(args.device)
            token_labels = token_labels.to(args.device) if args.mlm_ratio != 0. else None
            voken_labels = voken_labels.to(args.device)
            # If some of the input is padded, then the attention mask is needed
            attention_mask = (token_inputs != tokenizer.pad_token_id)         # word_tokens --> 1, pad_token --> 0
            if attention_mask.all():
                attention_mask = None

            if epoch == 0 and step < 3 and args.gpu == 0:
                print()
                print("Token inputs:", token_inputs.shape, token_inputs[0])
                print("Token inputs (in str): ", tokenizer.convert_ids_to_tokens(token_inputs[0].cpu().numpy()))
                print("Attention Mask:", attention_mask)
                print("Token Labels: ", token_labels[0] if token_labels is not None else token_labels)
                print("Token Labels (in str): ", tokenizer.convert_ids_to_tokens(token_labels[0].cpu().numpy()) if token_labels is not None else token_labels)
                print("Voken Labels: ", voken_labels[0])
                print()

            model.train()
            outputs = model(token_inputs,
                            attention_mask=attention_mask,
                            masked_lm_labels=token_labels,
                            voken_labels=voken_labels)
            voken_loss = outputs[0]
            token_loss = outputs[1]

            if args.mlm_ratio == 0.:
                loss = voken_loss
            else:
                loss = voken_loss + args.mlm_ratio * token_loss

            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            # print(f"GPU: {args.gpu}, Global Step: {global_step + 1}, "
            #       f"Step: {step}, "
            #       f"Range: {train_dataset.get_item_info(step * args.world_size + args.gpu)}, "
            #       f"Loss: {loss.item()}, "
            #       f"Scaled Loss: {scaled_loss.item()}")

            tr_loss += np.array((token_loss.item() / args.gradient_accumulation_steps,
                                 voken_loss.item() / args.gradient_accumulation_steps,
                                 loss.item()))

            if (step + 1) % args.gradient_accumulation_steps == 0:
                if args.max_grad_norm > 0. and not args.lamb:
                    # Only clip the grad when it is valid and not using LAMB optimizer,
                    # because the LAMB optimizer already apply grad clipping
                    if args.fp16:
                        total_norm = torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                    else:
                        total_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
                elif args.max_grad_norm <= 0. and step <= args.gradient_accumulation_steps:
                    logger.warning("Have not clipped the gradient because "
                                   "the max_grad_norm is set to %0.2f" % args.max_grad_norm)
                optimizer.step()
                scheduler.step()  # Update learning rate schedule
                model.zero_grad()
                global_step += 1

                if args.gpu == 0 and args.logging_steps > 0 and (step + 1) % args.logging_steps == 0:
                    # Log metrics
                    tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
                    if args.fp16:
                        try:
                            from apex.amp import _amp_state
                            tb_writer.add_scalar("loss_scale", _amp_state.loss_scalers[0]._loss_scale, global_step)
                            tb_writer.add_scalar("scaled_loss", scaled_loss.item(), global_step)
                        except ImportError:
                            logger.warning("Cannot import apex.amp._amp_state, "
                                           "would not state the loss_scale in the log")
                    if args.max_grad_norm > 0. and not args.lamb:  # Only clip the grad when it is valid
                        tb_writer.add_scalar("grad_norm", total_norm, global_step)
                    interval_loss = (tr_loss - logging_loss) / args.logging_steps
                    for loss_idx, loss_name in enumerate(LOSS_NAMES):
                        tb_writer.add_scalar(loss_name, interval_loss[loss_idx], global_step)
                    logging_loss = tr_loss.copy()

            if args.max_steps > 0 and global_step >= args.max_steps:
                break

            # if step == 200:
            #     break
            #
        # Save it each epoch
        if args.gpu == 0:
            # Save checkpoints
            checkpoint_name = "checkpoint-epoch%04d" % epoch
            save_model(args, checkpoint_name, model, tokenizer, optimizer, scheduler)

            # last_path = os.path.join(args.output_dir, 'checkpoint-last')
            # if os.path.exists(last_path):
            #     os.remove(last_path)
            # os.symlink(os.path.join(args.output_dir, checkpoint_name), last_path)

            # Evaluate the model
            for loss_idx, loss_name in enumerate(LOSS_NAMES):
                logger.info(" Training %s of Epoch %d: %0.4f" % (
                    loss_name, epoch, tr_loss[loss_idx] / len(train_dataloader)))

            if args.do_eval:
                logger.info(" Evaluation Results of Epoch %d: " % epoch)
                old_eval_batch_size = args.per_gpu_eval_batch_size
                while args.per_gpu_eval_batch_size > 0:
                    try:
                        results = evaluate(args, valid_dataset, model, tokenizer)
                        break
                    except RuntimeError as e:
                        args.per_gpu_eval_batch_size = int(args.per_gpu_eval_batch_size / 2)
                        print("HALVE THE BATCH SIZE in EVAL.")
                        if args.per_gpu_eval_batch_size == 0:
                            raise e
                        time.sleep(5)
                args.per_gpu_eval_batch_size = old_eval_batch_size

                for key, value in results.items():
                    tb_writer.add_scalar("eval_{}".format(key), value, global_step)
                    logger.info("\t %s: %0.4f" % (key, value))
                tb_writer.add_scalar("epoch", epoch, global_step)
                output_eval_file = os.path.join(args.output_dir, checkpoint_name, "eval_results.json")
                json.dump(results, open(output_eval_file, 'w'), sort_keys=True, indent=4)
            # Currently, only GPU 0 is responsible for the evaluation.
            # torch.cuda.empty_cache()
            # torch.distributed.barrier()
        else:
            pass
            # torch.cuda.empty_cache()
            # torch.distributed.barrier()

        if args.max_steps > 0 and global_step >= args.max_steps:
            epoch_iterator.close()
            train_iterator.close()
            break

    if args.gpu == 0:
        tb_writer.close()


def save_model(args, name, model, tokenizer, optimizer, scheduler):
    # Save model checkpoint
    output_dir = os.path.join(args.output_dir, name)
    os.makedirs(output_dir, exist_ok=True)
    model_to_save = (
        model.module if hasattr(model, "module") else model
    )  # Take care of distributed/parallel training
    model_to_save.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

    torch.save(args, os.path.join(output_dir, "training_args.bin"))
    logger.info("Saving model checkpoint to %s", output_dir)

    # torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
    # torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
    # logger.info("Saving optimizer and scheduler states to %s", output_dir)


def evaluate(args, eval_dataset: CoLDataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, prefix="") -> Dict:
    torch.cuda.empty_cache() 
    # # Loop to handle MNLI double evaluation (matched, mis-matched)
    # eval_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)

    args.eval_batch_size = args.per_gpu_eval_batch_size
    # Note that DistributedSampler samples randomly

    def col_collate(examples):
        tokens, vokens = zip(*examples)
        if tokenizer._pad_token is None:
            tokens = pad_sequence(tokens, batch_first=True)
        else:
            tokens = pad_sequence(tokens, batch_first=True, padding_value=tokenizer.pad_token_id)
        vokens = pad_sequence(vokens, batch_first=True, padding_value=-100)
        return tokens, vokens

    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=col_collate
    )

    # Eval!
    logger.info("***** Running evaluation {} *****".format(prefix))
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", args.eval_batch_size)
    total_token_loss = 0.0
    total_voken_loss = 0.0
    nb_eval_steps = 0
    model.eval()

    for tokens, vokens in tqdm(eval_dataloader, desc="Evaluating"):
        token_inputs, token_labels, voken_labels = mask_tokens(tokens, vokens, tokenizer, args)
        token_inputs = token_inputs.to(args.device)
        token_labels = token_labels.to(args.device) if args.mlm_ratio != 0 else None
        voken_labels = voken_labels.to(args.device)
        # If some of the input is padded, then the attention mask is needed
        attention_mask = (token_inputs != tokenizer.pad_token_id)  # word_tokens --> 1, pad_token --> 0
        if attention_mask.all():
            attention_mask = None

        with torch.no_grad():
            outputs = model(token_inputs,
                            attention_mask=attention_mask,
                            masked_lm_labels=token_labels,
                            voken_labels=voken_labels)
            voken_loss = outputs[0]
            token_loss = outputs[1]

            total_voken_loss += voken_loss.item()
            total_token_loss += token_loss.item()

        nb_eval_steps += 1

    total_token_loss = total_token_loss / nb_eval_steps
    perplexity = torch.exp(torch.tensor(total_token_loss)).item()

    result = {"perplexity": perplexity,
              "voken_loss": total_voken_loss / nb_eval_steps}
    torch.cuda.empty_cache() 

    return result


def is_port_in_use(port):
    import socket
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        return s.connect_ex(('localhost', port)) == 0


def main():
    args = process_args()
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    port = 9595
    while is_port_in_use(port):
        port += 1
    print("Use port", port)
    os.environ['MASTER_PORT'] = str(port)

    # Using all available gpus for multi-processing distributed
    args.gpus = torch.cuda.device_count()
    print("Use gpus ", list(range(args.gpus)))
    args.world_size = args.gpus * args.nodes
    mp.spawn(setup, nprocs=args.gpus, args=(args,))


def setup(gpu, args):
    if args.should_continue:
        args.model_name_or_path = 'checkpoint-last'

    # Setup CUDA, GPU & distributed training
    torch.cuda.set_device(gpu)
    device = torch.device("cuda", gpu)
    args.gpu = gpu                                  # Local device id.
    args.device = device                            # Local device object.
    args.rank = args.nr * args.gpus + gpu           # The gpu id in the world.
    torch.distributed.init_process_group(
        backend="nccl",
        init_method='env://',
        world_size=args.world_size,
        rank=args.rank
    )

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.gpu == 0 else logging.WARN,
    )
    logger.warning(
        "Process GPU: %s, num_of_total_GPUs: %s, distributed training: True, 16-bits training: %s",
        args.gpu, args.gpus, args.fp16,
    )

    # Set seed
    set_seed(args)

    # Load pretrained model and token
    # Barrier to make sure only the first process in distributed training
    # download model & vocabizer
    if gpu != 0:
        torch.distributed.barrier()

    # Use self-defined models, thus avoiding Auto***.
    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]

    # Next, we will initialize the training process in the following order:
    #   1. tokenizer --> 2. dataset --> 3. config --> 4. model.
    # because A) dataset relies on the tokenizer.special_tokens.
    #         B) config relies on the dataset.voken_size.

    # Get Tokenizer
    if args.tokenizer_name:
        tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
    elif args.model_name_or_path:
        tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
    else:
        raise ValueError(
            "You are instantiating a new {} tokenizer. This is not supported, "
            "but you can do it from another script, save it,"
            "and load it from here, using --tokenizer_name".format(tokenizer_class.__name__)
        )

    assert args.block_size <= tokenizer.max_len

    # Barrier to make sure only the first process in distributed training process the dataset,
    # and the others will use the cache
    if gpu != 0:
        torch.distributed.barrier()
    train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
    valid_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)
    if gpu == 0:
        torch.distributed.barrier()

    # Assert the vokens are equal in valid and eval.
    valid_dataset.assert_equal_vokens(train_dataset)

    config_kwargs = {}
    if args.do_voken_reg or args.do_voken_ctr:
        assert args.voken_feat_dir is not None
        voken_feats = get_voken_feats(train_dataset, args.voken_feat_dir)
        config_kwargs['voken_dim'] = len(voken_feats[0])
        if gpu == 0:
            logger.info(f"Load voken feats from {args.voken_feat_dir}"
                        f"with {len(voken_feats)} features and dimension {len(voken_feats[0])}")

    # Get Config
    if args.config_name:
        config = config_class.from_pretrained(
            args.config_name,
            cache_dir=args.cache_dir,
            voken_size=train_dataset.voken_size,
            do_voken_cls=args.do_voken_cls,
            do_voken_reg=args.do_voken_reg,
            do_voken_ctr=args.do_voken_ctr,
            shared_head=args.shared_head,
            verbose=(args.gpu == 0),
            **config_kwargs
        )
    elif args.model_name_or_path:
        config = config_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
    else:
        raise ValueError(
            "Why do you want the default config?? Please use --config_name or --model_name_or_path"
        )

    if args.model_name_or_path:
        logger.info(f"Training model from the weight {args.model_name_or_path}.")
        model = model_class.from_pretrained(
            args.model_name_or_path,
            from_tf=bool(".ckpt" in args.model_name_or_path),
            config=config,
            cache_dir=args.cache_dir,
        )
    else:
        logger.info("Training new model from scratch")
        model = model_class(config=config)

    if args.do_voken_reg or args.do_voken_ctr:
        voken_feats = torch.tensor(voken_feats)
        model.init_voken_feat_emb(voken_feats)

    model.to(args.device)

    # End of barrier to make sure only the first process waiting other processes
    if gpu == 0:
        torch.distributed.barrier()

    if args.model_name_or_path:
        if gpu == 0:
            logger.info("Evaluate the performance of the loaded model.")
            results = evaluate(args, valid_dataset, model, tokenizer)
            for key, value in results.items():
                logger.info("\t %s: %0.4f" % (key, value))
            torch.distributed.barrier()
        else:
            torch.distributed.barrier()

    logger.info("Training/evaluation parameters %s", args)

    # Training
    if args.do_train:
        train(args, train_dataset, valid_dataset, model, tokenizer)

    # Evaluation
    if args.do_eval and gpu == 0:
        results = evaluate(args, valid_dataset, model, tokenizer)
        for key, value in results.items():
            logger.info("\t %s: %0.4f" % (key, value))


if __name__ == "__main__":
    main()


================================================
FILE: vlm/show_glue_results_epochs.py
================================================
import os
from pathlib import Path

root = Path(
    'snap'
)

task2major = {
    'QQP': 'acc_and_f1',
    'STS-B': 'corr',
    'MRPC': 'acc_and_f1',
}

# The tasks sorted by the amount of data
all_tasks = [
    # 'WNLI',
    'RTE',
    'MRPC',
    'STS-B',
    'CoLA',
    'SST-2',
    'QNLI',
    'QQP',
    'MNLI',
    'MNLI-MM',
]


def print_result(glue_dir):
    print(glue_dir)
    results = {}
    for task in glue_dir.iterdir():
        if task.is_dir():
            eval_fpath = task / 'eval_results.txt'
            task_name = task.name
            if eval_fpath.exists():
                with eval_fpath.open() as f:
                    for line in f:
                        metric, value = line.split('=')
                        metric = metric.strip()
                        value = float(value.strip())
                        if task_name in task2major:
                            if metric == task2major[task_name]:
                                results[task_name] = value
                        else:
                            results[task_name] = value
    if len(results) > 0:
        # sorted_keys = sorted(list(results.keys()))
        # for key in sorted_keys:
        #     print("%8s" % key, end='')
        # print("%8s" % 'GLUE', end='')
        # print()
        # for key in sorted_keys:
        #     print("%8.2f" % (results[key] * 100.), end='')
        # print("%8.2f" % (sum(results.values()) * 100. / len(results)), end='')
        # print()
        for task in all_tasks:
            print("%8s" % task, end='')
        print("%8s" % 'GLUE', end='')
        print()
        for task in all_tasks:
            if task in results:
                result = results[task]
                print("%8.2f" % (result * 100), end='')
            else:
                print(" " * 8, end='')
        mean = lambda x: sum(x) / max(len(x), 1)
        avg_result = mean([value for key, value in results.items() if key in all_tasks])
        print("%8.2f" % (avg_result * 100.), end='')
        print()


def search(path):
    def sorted_key(path):
        try:
            return path.stat().st_mtime
        except Exception:
            return 0.
    path_list = sorted(
        path.iterdir(),
        key=sorted_key
        # x.name
    )
    for subdir in path_list:
        if subdir.is_dir():
            if 'glueepoch_' in subdir.name:
                print_result(subdir)
            else:
                search(subdir)

search(root)


================================================
FILE: vokenization/__init__.py
================================================


================================================
FILE: vokenization/common.py
================================================
import os

# Name of image sets
IMAGE_SETS = [
    'coco_train',
    'coco_nominival',
    'coco_minival',
    'vg_nococo',
    'cc_train',
    'cc_valid',
]

# Root of each dataset
# CC_ROOT, COCO_ROOT, VG_ROOT should contain the `images` folder
# CC_ROOT -- images
#              |-- training
#                      |-- training_00009486    # Jpeg files but does not have the extension.
#                      |-- ....
#              |-- validation
#                      |-- validation_00009486
#                      |-- ...
# CC_ROOT = os.getenv('CC_ROOT', 'data/cc')
# COCO_ROOT = os.getenv('COCO_ROOT', 'data/mscoco')
# VG_ROOT = os.getenv('VG_ROOT', 'data/vg')
# LXRT_ROOT = os.getenv('LXRT_ROOT', 'data/lxrt')
CC_ROOT = 'data/cc'
COCO_ROOT = 'data/mscoco'
VG_ROOT = 'data/vg'
LXRT_ROOT = 'data/lxmert'

# THe local directory to save essential image infos
#       (e.g., image ids for the vokenizer, image paths in this server)
# LOCAL_DIR
#   |- images
#         |- coco_train_ids.txt
#         |- coco_train_paths.txt
#         |- cc_train_ids.txt
#         |- cc_train_paths.txt
#         |- ..............
# Running create_image_ids.py will build *_ids.txt
# Running extract_vision_keys.py will build *_paths.txt
LOCAL_DIR = 'data/vokenization'



================================================
FILE: vokenization/create_image_ids.py
================================================
import json
import os
from pathlib import Path
import sys

# sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import common

imgset2lxrtfname = {
    'coco_train': 'mscoco_train.json',
    'coco_nominival': 'mscoco_nominival.json',
    'coco_minival': 'mscoco_minival.json',
    'vg_nococo': 'vgnococo.json',
}

imgset2ccfname = {
    'cc_train': 'training.tsv',
    'cc_valid': 'validation.tsv'
}


def write_ids(img_set, img_ids):
    """
    Write the indexed image ids 'img_ids' for image set 'img_set' to
    the local file.
    """
    info_dir = os.path.join(common.LOCAL_DIR, 'images')
    os.makedirs(info_dir, exist_ok=True)
    print("Write %d image ids for image set %s to %s." % (
        len(img_ids), img_set, os.path.join(info_dir, img_set + '.ids')))
    ids_path = os.path.join(info_dir, img_set + '.ids')
    if os.path.exists(ids_path):
        # If there is an existing ids_path, make sure that they are the same.
        print(f"Already exist the image ids for image set {img_set} at path {ids_path}.")
        print("Now, we want to make sure that they are equal:")
        with open(ids_path, 'r') as f:
            exist_img_ids = list(map(lambda x: x.strip(), f.readlines()))
        success = True
        for i, (exist_img_id, img_id) in zip(exist_img_ids):
            if exist_img_id != img_id:
                print(f"The image id at line {i} is different:")
                print(f"\tIn the file: {exist_img_id}, In this script: {img_id}")
                success = False
        if success:
            print("PASS!")
    else:
        with open(ids_path, 'w') as f:
            for img_id in img_ids:
                f.write(img_id + '\n')


for img_set in common.IMAGE_SETS:
    if img_set in imgset2lxrtfname:
        lxrt_path = Path(common.LXRT_ROOT)
        img_ids = []
        fname = imgset2lxrtfname[img_set]
        for datum in json.load((lxrt_path / fname).open()):
            img_id = datum['img_id']
            img_ids.append(img_id)

        write_ids(img_set, img_ids)

    if img_set in imgset2ccfname:
        cc_path = Path(common.CC_ROOT)
        img_ids = []
        fname = imgset2ccfname[img_set]
        if not (cc_path / fname).exists():
            print("No such file", cc_path / fname)
            continue
        for i, line in enumerate((cc_path / fname).open()):
            sent, img_id = line.split('\t')
            img_ids.append(img_id.strip())

        write_ids(img_set, img_ids)


================================================
FILE: vokenization/evaluate_diversity.py
================================================
import argparse
from collections import defaultdict
import json
import os
import sys

import numpy as np
import tqdm

from vokenization import Vokenizer, load_model_and_tokenizer
import common

imgset2fname = {
    'coco_train': 'mscoco_train.json',
    'coco_nominival': 'mscoco_nominival.json',
    'coco_minival': 'mscoco_minival.json',
    'vg_nococo': 'vgnococo.json',
    'cc_train': 'training.tsv',
    'cc_valid': 'validation.tsv',
}

tokenizer_name = 'bert-base-uncased'


def load_lang_data(corpus_name, topk=10000):
    """
    Load {topk} sentences from the corpus named by {corpus_name}.
    """
    fpath = corpus_name + '.' + tokenizer_name
    tokens = []
    with open(fpath) as f:
        for i, line in enumerate(f):
            tokens.append(list(map(int, line.split(' '))))
            if (i + 1) == topk:
                break
    print("Read %d sentences from the corpus %s located at %s." % (
        len(tokens), corpus_name, fpath
    ))
    return tokens


def load_cc_data(img_set):
    fname = os.path.join(common.CC_ROOT, imgset2fname[img_set])
    sents = []
    with open(fname) as f:
        for line in f:
            sent, _ = line.split('\t')
            sents.append(sent)
    print("Load the %d sentences for image set %s from %s" % (
        len(sents), img_set, fname))
    return sents


def load_lxrt_data(img_set):
    fname = os.path.join(common.LXRT_ROOT, imgset2fname[img_set])
    sents = []
    with open(fname) as f:
        data = json.load(f)
        for datum in data:
            sents.extend(datum['sentf']['mscoco'])
    print("Load the %d sentences for image set %s from %s" % (
        len(sents), img_set, fname))
    return sents


def analyze(token2info):
    """
    :param token2info: token2info: token --> (img_id --> cnt)
    :return:
    """
    names = ['Num Images', 'Max Cnt', 'Avg Cnt', 'Std Cnt']
    results = np.zeros(4)
    num_tokens = 0
    for token in token2info:
        img2cnt = token2info[token]
        cnts = np.array(list(img2cnt.values()))
        num_imgs = len(cnts)
        max_cnt = cnts.max()
        avg_cnt = cnts.mean()
        std_cnt = cnts.std()
        results += (num_imgs, max_cnt, avg_cnt, std_cnt)
        num_tokens += 1
    print("With %d tokens, " % num_tokens)
    results /= num_tokens
    for name, result in zip(names, results):
        print("Average of %s is %0.2f" % (name, result))

    corpus_info = defaultdict(lambda: 0)
    for info in token2info.values():
        for img, cnt in info.items():
            corpus_info[img] += cnt
    print("Cover %d images" % len(corpus_info))

# load = '/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_bertl4'
parser = argparse.ArgumentParser()
parser.add_argument('--load', type=str, default='/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_robertal4',
                    help='The directory saved the model (containing'
                         'BEST.pth.model).')
parser.add_argument('--image-sets', type=str, default='coco_minival',
                    help='The splits of images to be extracted')
parser.add_argument('--corpus', type=str, default='wiki103',
                    help='Evaluated corpus')
parser.add_argument('--maxsents', type=int, default=10000,
                    help='The maximum sentences to be evaluated in the corpus')
args = parser.parse_args()

keys_path = os.path.join(args.load, 'keys')

print("Evaluate for model %s on image sets %s" % (args.load, args.image_sets))
model, tokenizer = load_model_and_tokenizer(args.load)
img_sets = args.image_sets.split(',')
vokenizer = Vokenizer(model, tokenizer, keys_path, img_sets)

corpus_list = args.corpus.split(',')
for corpus in corpus_list:
    corpus = corpus.strip()
    print("\nProcessing corpus %s for diversity test:" % corpus)
    # token2info: token --> (img_id --> cnt)
    token2info = defaultdict(lambda: defaultdict(lambda: 0))

    if corpus in imgset2fname:
        if 'cc' in corpus:
            sents = load_cc_data(corpus)
        else:
            sents = load_lxrt_data(corpus)
        batch_size = 32
        for start_id in tqdm.tqdm(range(0, len(sents), batch_size)):
            batch_sents = sents[start_id: start_id + batch_size]
            scores, ids, tokens, paths = vokenizer.vokenize_sents(batch_sents, topk=None)
            for i in range(len(paths)):
                for token, path in zip(tokens[i][1:-1], paths[i][1:-1]):
                    token2info[token][path] += 1
    else:
        tokens_list = load_lang_data(corpus, args.maxsents)
        batch_size = 16
        for start_id in tqdm.tqdm(range(0, len(tokens_list), batch_size)):
            batch_tokens = tokens_list[start_id: start_id + batch_size]
            scores, ids, tokens, paths = vokenizer.vokenize_ids(batch_tokens, topk=None)
            for i in range(len(paths)):
                for token, path in zip(tokens[i][1:-1], paths[i][1:-1]):
                    token2info[token][path] += 1

    analyze(token2info)






================================================
FILE: vokenization/evaluate_retrieval.py
================================================
import argparse
from collections import defaultdict
import json
import os

import tqdm

from vokenization import Vokenizer, load_model_and_tokenizer
import common

imgset2fname = {
    'coco_train': 'mscoco_train.json',
    'coco_nominival': 'mscoco_nominival.json',
    'coco_minival': 'mscoco_minival.json',
    'vg_nococo': 'vg_nococo.json',
    'cc_train': 'training.tsv',
    'cc_valid': 'validation.tsv',
}


def load_cc_data(img_set):
    fname = os.path.join(common.CC_ROOT, imgset2fname[img_set])
    sentXimgname = []
    with open(fname) as f:
        for line in f:
            sent, gt_img_name = line.split('\t')
            gt_img_name = gt_img_name.strip()
            sentXimgname.append((sent, gt_img_name))
    print("Load the %d (img, sent) pairs for image set %s from %s" % (
        len(sentXimgname), img_set, fname))
    return sentXimgname


def load_lxrt_data(img_set):
    fname = os.path.join(common.LXRT_ROOT, imgset2fname[img_set])
    sentXimgname = []
    with open(fname) as f:
        data = json.load(f)
        for datum in data:
            gt_img_name = datum['img_id'] + '.jpg'
            sents = datum['sentf']['mscoco']
            for sent in sents:
                sentXimgname.append((sent, gt_img_name))
    print("Load the %d (img, sent) pairs for image set %s from %s" % (
        len(sentXimgname), img_set, fname))
    return sentXimgname


# load = '/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_bertl4'
parser = argparse.ArgumentParser()
parser.add_argument('--load', type=str, default='/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_robertal4',
                    help='The directory saved the model (containing'
                         'BEST.pth.model).')
parser.add_argument('--image-sets', type=str, default='coco_minival',
                    help='The splits of images to be extracted')
args = parser.parse_args()

keys_path = os.path.join(args.load, 'keys')

print("Evaluate for model %s on image sets %s" % (args.load, args.image_sets))
model, tokenizer = load_model_and_tokenizer(args.load)
img_sets = args.image_sets.split(',')

sent_level = 'sent' in args.load

for img_set in img_sets:
    vokenizer = Vokenizer(model, tokenizer, keys_path, [img_set],
                          sent_level=sent_level)
    if 'cc' in img_set:
        sentXimgname = load_cc_data(img_set)
    else:
        sentXimgname = load_lxrt_data(img_set)

    topks = [1, 5, 10]
    print("\nEvaluate image set", img_set, "for topk retrieval:", topks)
    total = 0
    arg_topk = None if max(topks) == 1 else max(topks)
    results = defaultdict(lambda: 0)
    batch_size = 32
    for start_id in tqdm.tqdm(range(0, len(sentXimgname), batch_size)):
        batch_sentXimg = sentXimgname[start_id: start_id + batch_size]
        sents, gt_img_names = zip(*batch_sentXimg)
        sents = list(sents)

        scores, ids, tokens, paths_list = vokenizer.vokenize_sents(sents, topk=arg_topk)
        if sent_level:
            paths_list = [x[:3] for x in paths_list]     # Only eval the first vokens.
        if arg_topk is None:
            paths_list = [[[img_id] for img_id in sent] for sent in paths_list]
        for paths, gt_img_name in zip(paths_list, gt_img_names):                # for each sent in batch
            for topk_paths in paths[1:-1]:      # for each token in sent
                for k, kth_path in enumerate(topk_paths):     # for each img_path in topk image paths of a token
                    img_name = os.path.split(kth_path)[-1]
                    if img_name == gt_img_name:
                        results[k + 1] += 1
        total += sum(map(lambda x: len(x) - 2, paths_list))

    accumulate = 0
    for i in range(1, max(topks)+1):
        accumulate += results[i]
        if i in topks:
            print("R%d: %0.2f%%, (Random: %0.4f%%)" % (
                i,
                accumulate / total * 100.,
                i / vokenizer.img_num * 100.
            ))

    del vokenizer






================================================
FILE: vokenization/extract_vision_keys.py
================================================
# In this file, we extract the vision features as the keys in retrieval.
import argparse
import os
import pickle
import shutil
import sys

import h5py
import torch
from torchvision import transforms
from torchvision.datasets.folder import default_loader
import tqdm
from transformers import BertTokenizer
from PIL import Image

import common

# Load all images
Image.MAX_IMAGE_PIXELS = None


def get_img_path(img_set, img_id):
    """
    Get the paths regarding the img_set and img_id.
    THIS FUNCTION MIGHT NEED TO BE MODIFIED.
    """
    source, tag = img_set.split('_')
    if source == 'cc':
        split_tag, _ = img_id.split('_')
        return "%s/images/%s/%s" % (common.CC_ROOT, split_tag, img_id)
    elif 'COCO' in img_id:
        _, split_tag, _ = img_id.split('_')
        return "%s/images/%s/%s" % (common.COCO_ROOT, split_tag, img_id + '.jpg')
    else:   # VG images
        return "%s/images/%s.jpg" % (common.VG_ROOT, img_id)


def get_img_paths_and_ids(img_set):
    """
    Return a list of images paths and image ids in this 'img_set'.
    """

    # Load the image ids from the common local dir,
    # thus make sure that the order of the images are the same.
    info_dir = os.path.join(common.LOCAL_DIR, 'images')
    img_paths = []
    with open(os.path.join(info_dir, img_set + '.ids')) as f:
        img_ids = list(map(lambda x: x.strip(), f.readlines()))
    for img_id in img_ids:
        img_paths.append(get_img_path(img_set, img_id))
    return img_paths, img_ids


def save_img_paths_and_ids(img_set, img_paths, img_ids, output):
    info_dir = os.path.join(common.LOCAL_DIR, 'images')

    # Save

Download .txt

gitextract_6h_zq3l4/

├── LICENSE
├── README.md
├── data/
│   ├── lxmert/
│   │   └── .gitignore
│   ├── mscoco/
│   │   └── .gitignore
│   ├── vg/
│   │   └── .gitignore
│   ├── wiki/
│   │   ├── get_data_cased.bash
│   │   ├── get_data_cased_untokenized.bash
│   │   ├── install-tools.sh
│   │   └── tools/
│   │       ├── remove_accent.py
│   │       ├── segment_th.py
│   │       └── tokenize.sh
│   └── wiki103/
│       ├── get_data_cased.sh
│       └── get_data_uncased.sh
├── requirements.txt
├── scripts/
│   ├── base_vlm_wiki.bash
│   ├── base_vlm_wiki_glue.bash
│   ├── base_wiki.bash
│   ├── base_wiki_glue.bash
│   ├── extract_keys.bash
│   ├── mpvokenize_wiki.bash
│   ├── mpvokenize_wiki103.bash
│   ├── run_glue_at_epoch.bash
│   ├── run_glue_epochs.bash
│   ├── run_xmatching.bash
│   ├── small_vlm_wiki103.bash
│   ├── small_vlm_wiki103_glue.bash
│   ├── small_wiki103.bash
│   ├── small_wiki103_glue.bash
│   └── xmatching_benchmark.bash
├── snap/
│   ├── bert/
│   │   └── .gitkeep
│   ├── vlm/
│   │   └── .gitkeep
│   └── xmatching/
│       └── .gitkeep
├── tokenization/
│   ├── to_hdf5.py
│   ├── tokenize_dataset.py
│   ├── tokenize_wiki103_bert.bash
│   ├── tokenize_wiki103_roberta.bash
│   ├── tokenize_wiki_bert.bash
│   └── tokenize_wiki_roberta.bash
├── vlm/
│   ├── __init__.py
│   ├── configs/
│   │   ├── bert-12L-768H.json
│   │   ├── bert-4L-768H.json
│   │   ├── bert-6L-512H.json
│   │   └── bert_base.json
│   ├── data.py
│   ├── model.py
│   ├── param.py
│   ├── run_glue.py
│   ├── run_glue_epochs.py
│   ├── run_lm_distributed.py
│   ├── run_vlm_distributed.py
│   └── show_glue_results_epochs.py
├── vokenization/
│   ├── __init__.py
│   ├── common.py
│   ├── create_image_ids.py
│   ├── evaluate_diversity.py
│   ├── evaluate_retrieval.py
│   ├── extract_vision_keys.py
│   ├── indexing.py
│   ├── revokenization.py
│   ├── revokenize_corpus_mp.py
│   ├── vokenization.py
│   └── vokenize_corpus_mp.py
└── xmatching/
    ├── __init__.py
    ├── data.py
    ├── frozen_batch_norm.py
    ├── loss.py
    ├── main.py
    ├── metric.py
    ├── model.py
    └── param.py

Download .txt

SYMBOL INDEX (190 symbols across 27 files)

FILE: data/wiki/tools/remove_accent.py
  function convert_to_unicode (line 13) | def convert_to_unicode(text):
  function run_strip_accents (line 29) | def run_strip_accents(text):

FILE: tokenization/to_hdf5.py
  function validate_hdf5 (line 8) | def validate_hdf5(fname, tokenizer_name):
  function to_hdf5 (line 63) | def to_hdf5(fname, tokenizer_name, validate=True):

FILE: tokenization/tokenize_dataset.py
  function tokenize_dataset (line 12) | def tokenize_dataset(data_dir, fname, tokenizer_name, lines_are_sents=Fa...

FILE: vlm/data.py
  class CoLDataset (line 11) | class CoLDataset(Dataset):
    method __init__ (line 15) | def __init__(self, file_path, tokenizer_name, tokenizer, block_size=512,
    method voken_size (line 83) | def voken_size(self):
    method voken_ids (line 87) | def voken_ids(self):
    method assert_equal_vokens (line 90) | def assert_equal_vokens(self, dataset):
    method __len__ (line 95) | def __len__(self):
    method __getitem__ (line 98) | def __getitem__(self, item):
    method maybe_do_sent_level (line 122) | def maybe_do_sent_level(self, vokens):
    method maybe_do_ablation_study (line 138) | def maybe_do_ablation_study(self, vokens, tokens):
    method get_item_info (line 157) | def get_item_info(self, item):
    method __del__ (line 162) | def __del__(self):
  function intersect (line 174) | def intersect(x, y):
  function manual_filter (line 184) | def manual_filter(batches):
  function block_check (line 192) | def block_check(batches, block_size, fixed_size=False, manual_filtered=F...
  function get_voken_feats (line 211) | def get_voken_feats(dataset: CoLDataset, feat_dir: str):

FILE: vlm/model.py
  function _gelu_python (line 20) | def _gelu_python(x):
  class CoLBertConfig (line 35) | class CoLBertConfig(BertConfig):
    method __init__ (line 36) | def __init__(self, *args, **kwargs):
  class BertSharedHead (line 47) | class BertSharedHead(BertOnlyMLMHead):
    method __init__ (line 50) | def __init__(self, config):
    method forward (line 62) | def forward(self, features, **kwargs):
  class BertVLMClassificationHead (line 84) | class BertVLMClassificationHead(nn.Module):
    method __init__ (line 87) | def __init__(self, config):
    method forward (line 100) | def forward(self, features, **kwargs):
  class BertVLMContrastiveHeadNew (line 109) | class BertVLMContrastiveHeadNew(nn.Module):
    method __init__ (line 112) | def __init__(self, config):
    method forward (line 123) | def forward(self, bert_output, voken_feats, **kwargs):
  class BertVLMContrastiveHead (line 139) | class BertVLMContrastiveHead(nn.Module):
    method __init__ (line 142) | def __init__(self, config):
    method forward (line 152) | def forward(self, bert_output, voken_feats, **kwargs):
  class BertVLMRegressionHead (line 168) | class BertVLMRegressionHead(nn.Module):
    method __init__ (line 171) | def __init__(self, config):
    method forward (line 178) | def forward(self, features, **kwargs):
  class CoLwithBert (line 189) | class CoLwithBert(BertForMaskedLM):
    method __init__ (line 192) | def __init__(self, config):
    method init_voken_feat_emb (line 250) | def init_voken_feat_emb(self, feats):
    method to (line 269) | def to(self, *args):
    method forward (line 274) | def forward(
    method calculate_shared_loss (line 358) | def calculate_shared_loss(self, sequence_output, masked_lm_labels, vok...
  class SimpleBertForMaskedLM (line 386) | class SimpleBertForMaskedLM(BertForMaskedLM):
    method __init__ (line 388) | def __init__(self, config):
    method forward (line 391) | def forward(

FILE: vlm/param.py
  function process_args (line 4) | def process_args():

FILE: vlm/run_glue.py
  function set_seed (line 65) | def set_seed(args):
  function train (line 73) | def train(args, train_dataset, model, tokenizer):
  function evaluate (line 264) | def evaluate(args, model, tokenizer, prefix=""):
  function load_and_cache_examples (line 334) | def load_and_cache_examples(args, task, tokenizer, evaluate=False):
  function main (line 397) | def main():

FILE: vlm/run_glue_epochs.py
  function get_snap_paths (line 43) | def get_snap_paths(load):
  function sorted_paths (line 52) | def sorted_paths(paths):
  function get_test_paths (line 69) | def get_test_paths(paths, snaps):
  function run_glue (line 92) | def run_glue():

FILE: vlm/run_lm_distributed.py
  class TextDataset (line 96) | class TextDataset(Dataset):
    method __init__ (line 97) | def __init__(self, tokenizer: PreTrainedTokenizer, args, file_path: st...
    method __len__ (line 130) | def __len__(self):
    method __getitem__ (line 133) | def __getitem__(self, item):
  class LineByLineTextDataset (line 137) | class LineByLineTextDataset(Dataset):
    method __init__ (line 138) | def __init__(self, tokenizer: PreTrainedTokenizer, args, file_path: st...
    method __len__ (line 150) | def __len__(self):
    method __getitem__ (line 153) | def __getitem__(self, i):
  function load_and_cache_examples (line 157) | def load_and_cache_examples(args, tokenizer, evaluate=False):
  function set_seed (line 169) | def set_seed(args):
  function mask_tokens (line 175) | def mask_tokens(inputs: torch.Tensor, tokenizer: PreTrainedTokenizer, ar...
  function train (line 209) | def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTra...
  function save_model (line 425) | def save_model(args, name, model, tokenizer, optimizer, scheduler):
  function evaluate (line 443) | def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenize...
  function is_port_in_use (line 491) | def is_port_in_use(port):
  function main (line 497) | def main():
  function setup (line 513) | def setup(gpu, args):

FILE: vlm/run_vlm_distributed.py
  function load_and_cache_examples (line 93) | def load_and_cache_examples(args, tokenizer, evaluate=False):
  function set_seed (line 102) | def set_seed(args):
  function mask_tokens (line 108) | def mask_tokens(tokens: torch.Tensor, vokens: torch.Tensor, tokenizer: P...
  function train (line 153) | def train(args, train_dataset: CoLDataset, valid_dataset: CoLDataset,
  function save_model (line 449) | def save_model(args, name, model, tokenizer, optimizer, scheduler):
  function evaluate (line 467) | def evaluate(args, eval_dataset: CoLDataset, model: PreTrainedModel, tok...
  function is_port_in_use (line 531) | def is_port_in_use(port):
  function main (line 537) | def main():
  function setup (line 553) | def setup(gpu, args):

FILE: vlm/show_glue_results_epochs.py
  function print_result (line 29) | def print_result(glue_dir):
  function search (line 73) | def search(path):

FILE: vokenization/create_image_ids.py
  function write_ids (line 22) | def write_ids(img_set, img_ids):

FILE: vokenization/evaluate_diversity.py
  function load_lang_data (line 25) | def load_lang_data(corpus_name, topk=10000):
  function load_cc_data (line 42) | def load_cc_data(img_set):
  function load_lxrt_data (line 54) | def load_lxrt_data(img_set):
  function analyze (line 66) | def analyze(token2info):

FILE: vokenization/evaluate_retrieval.py
  function load_cc_data (line 21) | def load_cc_data(img_set):
  function load_lxrt_data (line 34) | def load_lxrt_data(img_set):

FILE: vokenization/extract_vision_keys.py
  function get_img_path (line 22) | def get_img_path(img_set, img_id):
  function get_img_paths_and_ids (line 38) | def get_img_paths_and_ids(img_set):
  function save_img_paths_and_ids (line 54) | def save_img_paths_and_ids(img_set, img_paths, img_ids, output):
  function extract_vision_feature_keys (line 82) | def extract_vision_feature_keys(model, img_transform, img_sets, output, ...
  function get_visn_arch (line 160) | def get_visn_arch(arch):
  class VisnModel (line 172) | class VisnModel(nn.Module):
    method __init__ (line 173) | def __init__(self, arch='resnet50', pretrained=True):
    method forward (line 188) | def forward(self, img):
  function img_transform_func (line 234) | def img_transform_func(img):

FILE: vokenization/indexing.py
  class GPUIndexer (line 6) | class GPUIndexer(object):
    method __init__ (line 7) | def __init__(self, keys, gpus=(0,), fp16=False):
    method topk (line 14) | def topk(self, query, topk: int = 1):
    method batch_topk (line 17) | def batch_topk(self, query, topk: int = 1):
    method batch_top1 (line 20) | def batch_top1(self, query):
  class TorchGPUIndexer (line 24) | class TorchGPUIndexer(GPUIndexer):
    method __init__ (line 25) | def __init__(self, keys, gpus=(0,), fp16=False):
    method topk (line 33) | def topk(self, query, topk: int = 1):
    method batch_topk (line 43) | def batch_topk(self, query, topk: int = 1):
    method batch_top1 (line 53) | def batch_top1(self, query):
    method batch_top1_l2 (line 63) | def batch_top1_l2(self, query):
  class FaissGPUIndexer (line 76) | class FaissGPUIndexer(GPUIndexer):
    method __init__ (line 77) | def __init__(self, keys, gpus=(0,), fp16=False):
    method batch_topk (line 92) | def batch_topk(self, query, topk: int = 1):
    method batch_top1 (line 102) | def batch_top1(self, query):

FILE: vokenization/revokenization.py
  class ReVokenizer (line 6) | class ReVokenizer:
    method __init__ (line 10) | def __init__(self, forward_tokenizer_name, backward_tokenizer_name, vo...
    method vokenize_sent (line 23) | def vokenize_sent(self, sents, topk=None):
    method vokenize_ids (line 26) | def vokenize_ids(self, input_ids, topk=None, verbose=False):
    method show_alignments (line 50) | def show_alignments(self, sents, forward_inputs, backward_inputs, alig...
    method show_input (line 75) | def show_input(self, sents, forward_inputs, backward_inputs, input_ids):
    method backward_decode (line 94) | def backward_decode(self, input_id):
    method process (line 103) | def process(self, input_ids):
    method _safe_guard (line 143) | def _safe_guard(inputs):
    method _remove_special_tokens (line 150) | def _remove_special_tokens(inputs):
    method _fix_nouns (line 158) | def _fix_nouns(backward_input):
    method _fix_length (line 168) | def _fix_length(backward_input, input_ids):
    method _calibrate_backward_offset (line 185) | def _calibrate_backward_offset(self, backward_input):
    method prepare_for_unicode (line 215) | def prepare_for_unicode(self):
    method show (line 242) | def show(self, ids_list):
    method batch_map_back (line 248) | def batch_map_back(results, alignments):
    method batch_calculate_alignment (line 263) | def batch_calculate_alignment(batch_forward_offsets, batch_backward_of...
  function IoU (line 289) | def IoU(a, b):

FILE: vokenization/revokenize_corpus_mp.py
  function processer (line 30) | def processer(args, input_queue, output_queue):
  function reducer (line 88) | def reducer(output_fname, output_queue, total_tokens):
  function setup_mp (line 141) | def setup_mp(args, tokens, sent_ranges, vokens_path):
  function segment_sent (line 184) | def segment_sent(

FILE: vokenization/vokenization.py
  class Vokenizer (line 22) | class Vokenizer:
    method __init__ (line 23) | def __init__(self, model, tokenizer, keys_dir, img_sets=('coco_minival...
    method img_num (line 82) | def img_num(self):
    method dump_img_ids (line 85) | def dump_img_ids(self, fname):
    method __len__ (line 94) | def __len__(self):
    method indexing (line 97) | def indexing(self):
    method vokenize_sents (line 134) | def vokenize_sents(self, sents, topk=None):
    method vokenize_ids (line 145) | def vokenize_ids(self, input_ids, attention_mask=None, topk=None):
  function memory_safe_apply (line 300) | def memory_safe_apply(func, *args):
  function load_model_and_tokenizer (line 329) | def load_model_and_tokenizer(load, cpu=False):

FILE: vokenization/vokenize_corpus_mp.py
  function processer (line 28) | def processer(args, input_queue, output_queue):
  function reducer (line 86) | def reducer(output_fname, output_queue, total_tokens):
  function setup_mp (line 139) | def setup_mp(args, tokens, sent_ranges, vokens_path):
  function segment_sent (line 182) | def segment_sent(

FILE: xmatching/data.py
  function make_uid (line 38) | def make_uid(img_id, source, sent_id):
  function get_img_path (line 45) | def get_img_path(source, img_id):
  function make_datum (line 56) | def make_datum(source: str, img_id: str, sent_id: int, sent: str):
  class ImgSentDataset (line 75) | class ImgSentDataset:
    method __init__ (line 76) | def __init__(self, img_splits: str, lang_splits: str, tiny=False, fast...
    method __len__ (line 129) | def __len__(self):
    method __getitem__ (line 132) | def __getitem__(self, item):
    method shuffle (line 135) | def shuffle(self):
  class ImgSentTorchDataset (line 140) | class ImgSentTorchDataset(Dataset):
    method __init__ (line 141) | def __init__(self,
    method __len__ (line 152) | def __len__(self):
    method __getitem__ (line 155) | def __getitem__(self, item: int):

FILE: xmatching/frozen_batch_norm.py
  class FrozenBatchNorm2d (line 11) | class FrozenBatchNorm2d(nn.Module):
    method __init__ (line 29) | def __init__(self, num_features, eps=1e-5):
    method forward (line 38) | def forward(self, x):
    method __repr__ (line 60) | def __repr__(self):
    method convert_frozen_batchnorm (line 64) | def convert_frozen_batchnorm(cls, module):

FILE: xmatching/loss.py
  function hinge (line 4) | def hinge(x):
  function paired_hinge_rank_loss (line 8) | def paired_hinge_rank_loss(
  function batchwise_hinge_rank_loss (line 53) | def batchwise_hinge_rank_loss(

FILE: xmatching/main.py
  function is_port_in_use (line 22) | def is_port_in_use(port):
  function main (line 28) | def main():
  function train (line 46) | def train(gpu, args):
  function valid (line 285) | def valid(args, model, criterion, valid_loader, use_tqdm=True):

FILE: xmatching/metric.py
  function batchwise_accuracy (line 4) | def batchwise_accuracy(lang_output, visn_output, lang_mask):
  function batchwise_recall (line 37) | def batchwise_recall(lang_output, visn_output, lang_mask, recalls=(1,)):

FILE: xmatching/model.py
  function get_visn_arch (line 24) | def get_visn_arch(arch):
  class VisnModel (line 32) | class VisnModel(nn.Module):
    method __init__ (line 33) | def __init__(self, dim, arch='resnet50', pretrained=True, finetuning=F...
    method forward (line 76) | def forward(self, img):
  class LangModel (line 92) | class LangModel(nn.Module):
    method __init__ (line 93) | def __init__(self, dim, arch='BERT', layers=(-1,), pretrained=True, fi...
    method forward (line 132) | def forward(self, input_ids, attention_mask, token_type_ids=None):
  class JointModel (line 175) | class JointModel(nn.Module):
    method __init__ (line 176) | def __init__(self, lang_model, visn_model):
    method forward (line 181) | def forward(self, lang_input, visn_input):

FILE: xmatching/param.py
  function get_optimizer (line 12) | def get_optimizer(optim):
  function parse_args (line 34) | def parse_args():

Download .json

Condensed preview — 70 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (317K chars).

[
  {
    "path": "LICENSE",
    "chars": 1064,
    "preview": "MIT License\n\nCopyright (c) 2020 Hao Tan\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof"
  },
  {
    "path": "README.md",
    "chars": 21935,
    "preview": "# Vokenization\n\nPyTorch code for the EMNLP 2020 paper \"[Vokenization: Improving Language Understanding with Contextualiz"
  },
  {
    "path": "data/lxmert/.gitignore",
    "chars": 78,
    "preview": "/mscoco_minival.json\n/mscoco_nominival.json\n/mscoco_train.json\n/vgnococo.json\n"
  },
  {
    "path": "data/mscoco/.gitignore",
    "chars": 8,
    "preview": "/images\n"
  },
  {
    "path": "data/vg/.gitignore",
    "chars": 8,
    "preview": "/images\n"
  },
  {
    "path": "data/wiki/get_data_cased.bash",
    "chars": 2163,
    "preview": "# Copyright (c) 2019-present, Facebook, Inc.\n# Copy frrom https://github.com/facebookresearch/XLM\n# All rights reserved."
  },
  {
    "path": "data/wiki/get_data_cased_untokenized.bash",
    "chars": 2130,
    "preview": "# Copyright (c) 2019-present, Facebook, Inc.\n# Copy frrom https://github.com/facebookresearch/XLM\n# All rights reserved."
  },
  {
    "path": "data/wiki/install-tools.sh",
    "chars": 1907,
    "preview": "# Copyright (c) 2019-present, Facebook, Inc.\n# All rights reserved.\n#\n# This source code is licensed under the license f"
  },
  {
    "path": "data/wiki/tools/remove_accent.py",
    "chars": 1234,
    "preview": "# Copyright (c) 2019-present, Facebook, Inc.\n# All rights reserved.\n#\n# This source code is licensed under the license f"
  },
  {
    "path": "data/wiki/tools/segment_th.py",
    "chars": 355,
    "preview": "# Copyright (c) 2019-present, Facebook, Inc.\n# All rights reserved.\n#\n# This source code is licensed under the license f"
  },
  {
    "path": "data/wiki/tools/tokenize.sh",
    "chars": 1228,
    "preview": "# Copyright (c) 2019-present, Facebook, Inc.\n# All rights reserved.\n#\n# This source code is licensed under the license f"
  },
  {
    "path": "data/wiki103/get_data_cased.sh",
    "chars": 273,
    "preview": "OUTPUT=data/wiki103-cased\nwget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip -P $OUTPUT"
  },
  {
    "path": "data/wiki103/get_data_uncased.sh",
    "chars": 248,
    "preview": "OUTPUT=data/wiki103\n\nwget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip -P $OUTPUT/\nunzip $"
  },
  {
    "path": "requirements.txt",
    "chars": 432,
    "preview": "torch\n#==1.4.0\ntorchvision\n#==0.5.0\ntransformers==3.3.0\ntensorboardX\n\n# For GLUE evaluation\nsklearn\n\n# Fiass supports fa"
  },
  {
    "path": "scripts/base_vlm_wiki.bash",
    "chars": 1248,
    "preview": "# The name of experiment\nGPUS=$1\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/bert/$NAME\nmkdir -p $output/src\ncp -"
  },
  {
    "path": "scripts/base_vlm_wiki_glue.bash",
    "chars": 1350,
    "preview": "# The name of experiment\nGPUS=$1\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/bert/$NAME\nmkdir -p $output/src\ncp -"
  },
  {
    "path": "scripts/base_wiki.bash",
    "chars": 1057,
    "preview": "GPUS=$1\n# The name of experiment\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/bert/$NAME\nmkdir -p $output/src\ncp -"
  },
  {
    "path": "scripts/base_wiki_glue.bash",
    "chars": 1173,
    "preview": "GPUS=$1\n# The name of experiment\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/bert/$NAME\nmkdir -p $output/src\ncp -"
  },
  {
    "path": "scripts/extract_keys.bash",
    "chars": 179,
    "preview": "CUDA_VISIBLE_DEVICES=$1 python vokenization/extract_vision_keys.py \\\n    --image-sets vg_nococo,coco_minival,coco_nomini"
  },
  {
    "path": "scripts/mpvokenize_wiki.bash",
    "chars": 387,
    "preview": "GPU=$1\n\nLOAD=snap/xmatching/$2\nDATA_DIR=data/wiki-cased\nTOKENIZER=bert-base-uncased\n\nfor DATA_NAME in en.valid.raw en.te"
  },
  {
    "path": "scripts/mpvokenize_wiki103.bash",
    "chars": 395,
    "preview": "GPU=$1\n\nLOAD=snap/xmatching/$2\nWIKI_DIR=data/wiki103-cased\nTOKENIZER=bert-base-uncased\n\nfor DATA_NAME in wiki.valid.raw "
  },
  {
    "path": "scripts/run_glue_at_epoch.bash",
    "chars": 766,
    "preview": "export GLUE_DIR=data/glue/\nEPOCHS=$2\nMODEL=$3\nCKPT=$4\n\nfor TASK_NAME in WNLI RTE MRPC STS-B CoLA SST-2 QNLI QQP MNLI\ndo\n"
  },
  {
    "path": "scripts/run_glue_epochs.bash",
    "chars": 90,
    "preview": "GPUS=$1\nMODEL=$2\n \npython vlm/run_glue_epochs.py --gpus $GPUS --load $MODEL \\\n    ${@:3}\n\n"
  },
  {
    "path": "scripts/run_xmatching.bash",
    "chars": 662,
    "preview": "GPUS=$1\n# The name of experiment\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/xmatching/$NAME\nmkdir -p $output/src"
  },
  {
    "path": "scripts/small_vlm_wiki103.bash",
    "chars": 1257,
    "preview": "# The name of experiment\nGPUS=$1\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/vlm/$NAME\nmkdir -p $output/src\ncp -r"
  },
  {
    "path": "scripts/small_vlm_wiki103_glue.bash",
    "chars": 1359,
    "preview": "# The name of experiment\nGPUS=$1\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/vlm/$NAME\nmkdir -p $output/src\ncp -r"
  },
  {
    "path": "scripts/small_wiki103.bash",
    "chars": 974,
    "preview": "# The name of experiment\nGPUS=$1\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/bert/$NAME\nmkdir -p $output/src\ncp -"
  },
  {
    "path": "scripts/small_wiki103_glue.bash",
    "chars": 1075,
    "preview": "# The name of experiment\nGPUS=$1\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/bert/$NAME\nmkdir -p $output/src\ncp -"
  },
  {
    "path": "scripts/xmatching_benchmark.bash",
    "chars": 1079,
    "preview": "# Benchmarking the cross-modal matching model with\n#     1. Retrieval scores.\n#     2. Voken Diversity w.r.t words in sp"
  },
  {
    "path": "snap/bert/.gitkeep",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "snap/vlm/.gitkeep",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "snap/xmatching/.gitkeep",
    "chars": 3,
    "preview": "/*\n"
  },
  {
    "path": "tokenization/to_hdf5.py",
    "chars": 3162,
    "preview": "import h5py\nimport numpy as np\nimport tqdm\n\nfrom transformers import AutoTokenizer\n\n\ndef validate_hdf5(fname, tokenizer_"
  },
  {
    "path": "tokenization/tokenize_dataset.py",
    "chars": 3636,
    "preview": "# coding=utf-8\n# Copyleft 2020 project COL.\n\nimport argparse\nfrom pathlib import Path\n\nfrom transformers import AutoToke"
  },
  {
    "path": "tokenization/tokenize_wiki103_bert.bash",
    "chars": 283,
    "preview": "DATA_DIR=data/wiki103-cased\nTOKENIZER=bert-base-uncased\npython tokenization/tokenize_dataset.py $DATA_DIR wiki.valid.raw"
  },
  {
    "path": "tokenization/tokenize_wiki103_roberta.bash",
    "chars": 278,
    "preview": "DATA_DIR=data/wiki103-cased\nTOKENIZER=roberta-base\npython tokenization/tokenize_dataset.py $DATA_DIR wiki.valid.raw $TOK"
  },
  {
    "path": "tokenization/tokenize_wiki_bert.bash",
    "chars": 274,
    "preview": "DATA_DIR=data/wiki-cased\nTOKENIZER=bert-base-uncased\npython tokenization/tokenize_dataset.py $DATA_DIR en.valid.raw $TOK"
  },
  {
    "path": "tokenization/tokenize_wiki_roberta.bash",
    "chars": 282,
    "preview": "DATA_DIR=data/wiki-cased-untokenized/\nTOKENIZER=roberta-base\npython tokenization/tokenize_dataset.py $DATA_DIR en.valid."
  },
  {
    "path": "vlm/__init__.py",
    "chars": 12,
    "preview": "import data\n"
  },
  {
    "path": "vlm/configs/bert-12L-768H.json",
    "chars": 361,
    "preview": "{\n  \"architectures\": [\n    \"BertForMaskedLM\"\n  ],\n  \"attention_probs_dropout_prob\": 0.1,\n  \"hidden_act\": \"gelu\",\n  \"hidd"
  },
  {
    "path": "vlm/configs/bert-4L-768H.json",
    "chars": 360,
    "preview": "{\n  \"architectures\": [\n    \"BertForMaskedLM\"\n  ],\n  \"attention_probs_dropout_prob\": 0.1,\n  \"hidden_act\": \"gelu\",\n  \"hidd"
  },
  {
    "path": "vlm/configs/bert-6L-512H.json",
    "chars": 359,
    "preview": "{\n  \"architectures\": [\n    \"BertForMaskedLM\"\n  ],\n  \"attention_probs_dropout_prob\": 0.1,\n  \"hidden_act\": \"gelu\",\n  \"hidd"
  },
  {
    "path": "vlm/configs/bert_base.json",
    "chars": 361,
    "preview": "{\n  \"architectures\": [\n    \"BertForMaskedLM\"\n  ],\n  \"attention_probs_dropout_prob\": 0.1,\n  \"hidden_act\": \"gelu\",\n  \"hidd"
  },
  {
    "path": "vlm/data.py",
    "chars": 8248,
    "preview": "import copy\nimport os\nimport random\n\nimport h5py\nimport torch\nfrom torch.utils.data import DataLoader, Dataset\nimport tq"
  },
  {
    "path": "vlm/model.py",
    "chars": 15704,
    "preview": "import math\n\nimport torch\nimport torch.nn.functional as F\nfrom torch.nn import CrossEntropyLoss, MSELoss, SmoothL1Loss\nf"
  },
  {
    "path": "vlm/param.py",
    "chars": 7174,
    "preview": "import argparse\n\n\ndef process_args():\n    parser = argparse.ArgumentParser()\n\n    # Datasets\n    parser.add_argument(\n  "
  },
  {
    "path": "vlm/run_glue.py",
    "chars": 30027,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018,"
  },
  {
    "path": "vlm/run_glue_epochs.py",
    "chars": 3812,
    "preview": "import argparse\nimport math\nimport os\nfrom pathlib import Path\nfrom pprint import pprint\nimport subprocess\nimport thread"
  },
  {
    "path": "vlm/run_lm_distributed.py",
    "chars": 25616,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018,"
  },
  {
    "path": "vlm/run_vlm_distributed.py",
    "chars": 29458,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018,"
  },
  {
    "path": "vlm/show_glue_results_epochs.py",
    "chars": 2476,
    "preview": "import os\nfrom pathlib import Path\n\nroot = Path(\n    'snap'\n)\n\ntask2major = {\n    'QQP': 'acc_and_f1',\n    'STS-B': 'cor"
  },
  {
    "path": "vokenization/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "vokenization/common.py",
    "chars": 1258,
    "preview": "import os\n\n# Name of image sets\nIMAGE_SETS = [\n    'coco_train',\n    'coco_nominival',\n    'coco_minival',\n    'vg_nococ"
  },
  {
    "path": "vokenization/create_image_ids.py",
    "chars": 2483,
    "preview": "import json\nimport os\nfrom pathlib import Path\nimport sys\n\n# sys.path.append(os.path.dirname(os.path.dirname(os.path.abs"
  },
  {
    "path": "vokenization/evaluate_diversity.py",
    "chars": 5008,
    "preview": "import argparse\nfrom collections import defaultdict\nimport json\nimport os\nimport sys\n\nimport numpy as np\nimport tqdm\n\nfr"
  },
  {
    "path": "vokenization/evaluate_retrieval.py",
    "chars": 4016,
    "preview": "import argparse\nfrom collections import defaultdict\nimport json\nimport os\n\nimport tqdm\n\nfrom vokenization import Vokeniz"
  },
  {
    "path": "vokenization/extract_vision_keys.py",
    "chars": 10853,
    "preview": "# In this file, we extract the vision features as the keys in retrieval.\nimport argparse\nimport os\nimport pickle\nimport "
  },
  {
    "path": "vokenization/indexing.py",
    "chars": 4159,
    "preview": "import numpy as np\nimport torch\nimport tqdm\n\n\nclass GPUIndexer(object):\n    def __init__(self, keys, gpus=(0,), fp16=Fal"
  },
  {
    "path": "vokenization/revokenization.py",
    "chars": 13460,
    "preview": "# Copyleft 2020 project COL.\n\nfrom transformers import AutoTokenizer\n\n\nclass ReVokenizer:\n    \"\"\"\n    Convert a\n    \"\"\"\n"
  },
  {
    "path": "vokenization/revokenize_corpus_mp.py",
    "chars": 13486,
    "preview": "# coding=utf-8\n# Copyleft 2020 project COL.\n\nimport argparse\nimport copy\nfrom multiprocessing import Queue, Process\nimpo"
  },
  {
    "path": "vokenization/vokenization.py",
    "chars": 15438,
    "preview": "# coding=utf-8\n# Copyleft 2020 project COL.\n\nfrom collections import defaultdict\nimport math\nimport pickle\nimport os\nimp"
  },
  {
    "path": "vokenization/vokenize_corpus_mp.py",
    "chars": 13348,
    "preview": "# coding=utf-8\n# Copyleft 2020 project COL.\n\nimport argparse\nimport copy\nfrom multiprocessing import Queue, Process\nimpo"
  },
  {
    "path": "xmatching/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "xmatching/data.py",
    "chars": 6069,
    "preview": "# coding=utf-8\nimport json\nfrom pathlib import Path\nimport random\n\nfrom torch.utils.data import Dataset\nfrom torchvision"
  },
  {
    "path": "xmatching/frozen_batch_norm.py",
    "chars": 3880,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved\n\n# Note: This file is copied from https://github."
  },
  {
    "path": "xmatching/loss.py",
    "chars": 4453,
    "preview": "import torch\n\n\ndef hinge(x):\n    return torch.clamp(x, min=0.)\n\n\ndef paired_hinge_rank_loss(\n        lang_output: torch."
  },
  {
    "path": "xmatching/main.py",
    "chars": 12349,
    "preview": "import collections\nimport os\nimport pickle\nimport sys\n\nimport torch\nimport torch.multiprocessing as mp\nimport torchvisio"
  },
  {
    "path": "xmatching/metric.py",
    "chars": 3289,
    "preview": "import torch\n\n\ndef batchwise_accuracy(lang_output, visn_output, lang_mask):\n    \"\"\"\n    Calculate the accuracy of contex"
  },
  {
    "path": "xmatching/model.py",
    "chars": 6639,
    "preview": "import torch\nfrom torch import nn\nimport torchvision.models as models\nfrom transformers import *\n\nfrom .frozen_batch_nor"
  },
  {
    "path": "xmatching/param.py",
    "chars": 4417,
    "preview": "# coding=utf-8\n# Copyleft 2020 project COL.\n# Copyleft 2019 project LXRT.\n\nimport argparse\nimport random\n\nimport numpy a"
  }
]

About this extraction

This page contains the full source code of the airsplay/vokenization GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 70 files (295.5 KB), approximately 75.4k tokens, and a symbol index with 190 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo