Repository: airsplay/vokenization Branch: master Commit: 5601b799184e Files: 70 Total size: 295.5 KB Directory structure: gitextract_6h_zq3l4/ ├── LICENSE ├── README.md ├── data/ │ ├── lxmert/ │ │ └── .gitignore │ ├── mscoco/ │ │ └── .gitignore │ ├── vg/ │ │ └── .gitignore │ ├── wiki/ │ │ ├── get_data_cased.bash │ │ ├── get_data_cased_untokenized.bash │ │ ├── install-tools.sh │ │ └── tools/ │ │ ├── remove_accent.py │ │ ├── segment_th.py │ │ └── tokenize.sh │ └── wiki103/ │ ├── get_data_cased.sh │ └── get_data_uncased.sh ├── requirements.txt ├── scripts/ │ ├── base_vlm_wiki.bash │ ├── base_vlm_wiki_glue.bash │ ├── base_wiki.bash │ ├── base_wiki_glue.bash │ ├── extract_keys.bash │ ├── mpvokenize_wiki.bash │ ├── mpvokenize_wiki103.bash │ ├── run_glue_at_epoch.bash │ ├── run_glue_epochs.bash │ ├── run_xmatching.bash │ ├── small_vlm_wiki103.bash │ ├── small_vlm_wiki103_glue.bash │ ├── small_wiki103.bash │ ├── small_wiki103_glue.bash │ └── xmatching_benchmark.bash ├── snap/ │ ├── bert/ │ │ └── .gitkeep │ ├── vlm/ │ │ └── .gitkeep │ └── xmatching/ │ └── .gitkeep ├── tokenization/ │ ├── to_hdf5.py │ ├── tokenize_dataset.py │ ├── tokenize_wiki103_bert.bash │ ├── tokenize_wiki103_roberta.bash │ ├── tokenize_wiki_bert.bash │ └── tokenize_wiki_roberta.bash ├── vlm/ │ ├── __init__.py │ ├── configs/ │ │ ├── bert-12L-768H.json │ │ ├── bert-4L-768H.json │ │ ├── bert-6L-512H.json │ │ └── bert_base.json │ ├── data.py │ ├── model.py │ ├── param.py │ ├── run_glue.py │ ├── run_glue_epochs.py │ ├── run_lm_distributed.py │ ├── run_vlm_distributed.py │ └── show_glue_results_epochs.py ├── vokenization/ │ ├── __init__.py │ ├── common.py │ ├── create_image_ids.py │ ├── evaluate_diversity.py │ ├── evaluate_retrieval.py │ ├── extract_vision_keys.py │ ├── indexing.py │ ├── revokenization.py │ ├── revokenize_corpus_mp.py │ ├── vokenization.py │ └── vokenize_corpus_mp.py └── xmatching/ ├── __init__.py ├── data.py ├── frozen_batch_norm.py ├── loss.py ├── main.py ├── metric.py ├── model.py └── param.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2020 Hao Tan Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # Vokenization PyTorch code for the EMNLP 2020 paper "[Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision](https://arxiv.org/pdf/2010.06775.pdf)" (Hao Tan and Mohit Bansal). **Outline** * [Contextualized Cross-Modal Matching](#contextualized-cross-modal-matching-xmatching) * [Downloading Image and Captioning Data](#download-image-and-captioning-data) * [Model Training](#training-the-cross-modal-matching-model) * [Benchmark (Optional)](#benchmarking-cross-modal-matching-models-optional) * [Vokenization](#vokenization-vokenization) * [Downloading Pure-Language Data](#downloading-and-pre-processing-pure-language-data) * [Extracting Visual Feature](#extracting-image-features) * [Vokenization Process](#the-vokenization-process) * [Visually-Supervised Language Model](#visually-supervised-language-model-vlm) * [VLM Pre-training](#pre-training-with-vlm) * [GLUE Evaluation](#glue-evaluation) * [MLM Pre-training (as baselines)](#bert-as-baselines) > Note: I recommend to focus on "Wiki103" first and > ingore the code blocks related to "English Wikipedia". > "Eng Wiki" might take too long to complete. ## Installation ```shell script pip install -r requirements.txt ``` Require python 3.6 + (to support huggingface [transformers](https://github.com/huggingface/transformers)). ## Contextualized Cross-Modal Matching (xmatching) In this [module](xmatching) (corresponding to Sec 3.2 of the [paper](https://arxiv.org/pdf/2010.06775.pdf)), we want to learn a token-image matching model from sentence-image aligned data (i.e., image captioning data). The model "contextually" measures the relevance between tokens (i.e., words) and images. The terminology "contextual" emphasize the nature that the sentences (the context) are considered when measuring the token-image relevance score. ### Download Image and Captioning Data 1. Download MS COCO images: ```shell script # MS COCO (Train 13G, Valid 6G) mkdir -p data/mscoco wget http://images.cocodataset.org/zips/train2014.zip -P data/mscoco wget http://images.cocodataset.org/zips/val2014.zip -P data/mscoco unzip data/mscoco/train2014.zip -d data/mscoco/images/ && rm data/mscoco/train2014.zip unzip data/mscoco/val2014.zip -d data/mscoco/images/ && rm data/mscoco/val2014.zip ``` If you already have COCO image on disk. Save them as ``` data |-- mscoco |-- images |-- train2014 |-- COCO_train2014_000000000009.jpg |-- COCO_train2014_000000000025.jpg |-- ...... |-- val2014 |-- COCO_val2014_000000000042.jpg |-- ...... ``` 2. Download captions (split following the LXMERT project): ```shell script mkdir -p data/lxmert wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_train.json -P data/lxmert/ wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_nominival.json -P data/lxmert/ wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/vgnococo.json -P data/lxmert/ wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_minival.json -P data/lxmert/ ``` ### Training the Cross-Modal Matching Model The model is trained on MS COCO with pairwise hinge loss (details in Sec. 3.2 of the [paper](https://arxiv.org/pdf/2010.06775.pdf)). Running Commands: ```bash # Run the cross-modal matching model with single-machine multi-processing distributed training # "0,1" indicates using the GPUs 0 and 1. # "bert_resnext" is the name of this snapshot and would be saved at snap/xmatching/bert_resnext # "--visn resnext101_32x8d" is the vision backbone # "--lang bert" is the langaugae backbone # Speed: 20 min ~ 30 min / 1 Epoch, 20 Epochs by default. bash scripts/run_xmatching.bash 0,1 bert_resnext --visn resnext101_32x8d --lang bert ``` The options `--visn` and `--lang` specify the architecture of the encoder. Tested options ``` --visn $VISN_MODEL VISN_MODEL={resnet18, resnet34, resnet50, resnet101, resnet152, wide_resnet50_2, wide_resnet101_2, resnext101_32x8d (default), ...} --lang $LANG_MODEL LANG_MODEL={bert, roberta, xlnet, bert-large, ...} ``` For visual backbones, the models in [torchvision](https://pytorch.org/docs/stable/torchvision/models.html) are mostly supported. You might need to handle the last FC layer, because it is written differently in different backbones. The language backbones are initialized from huggingface [transformers](https://github.com/huggingface/transformers). > We found that the results with XLNet is pretty low but have not identified > the reason. Results of other backbones are similar. ## Vokenization (vokenization) The vokenization is a bridge between the cross-modality (words-and-image) matching models (xmatching) and visually-supervised lagnauge models (vlm). The final goal is to convert the language tokens to related images (we called them **vokens**). These **vokens** enable the visual supervision of the language model. We mainly provide pr-eprocessing tools (i.e., feature extraction, tokenization, and vokenization) and evaluation tools of previous cross-modal matching models here. Here is a diagram of these processes and we next discuss them one-by-one: ``` Extracting Image Features-----> Benchmakring the Matching Models (Optional) --> Vokenization Downloading Language Data --> Tokenization -->-->--/ ``` ### Downloading and Pre-Processing Pure-Language Data We provide scripts to get the datasets "wiki103" and "wiki". We would note them as "XX-cased" or "XX-uncased" where the suffix "cased" / "uncased" only indicates the property of the raw text. 1. **Wiki103**. The [wiki103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) dataset is a seleted subset of English Wikipedia, containing around 100M tokens. ```shell script bash data/wiki103/get_data_cased.sh ``` 2. **English Wikipedia**. The script to download and process wiki data are modified from [XLM](https://github.com/facebookresearch/XLM). It will download a 17G file. The speed depends on the networking and it usually takes several hours to filter the data. The process ends with around 2.8B tokens. ```shell script bash data/wiki/get_data_cased.bash en ``` Note: For *RoBERTa*, it requires an untokenized version of wiki (o.w. the results would be much lower), so please use the following command: ```shell script bash data/wiki/get_data_cased_untokenized.bash en ``` > Note: I recommend to focus on "Wiki103" first and > ingore the code blocks related to "English Wikipedia". > "Eng Wiki" might take too long to complete. ### Tokenization of Language Data We next tokenize the language corpus. It would locally save three files: "$dataset_name.$tokenizer_name", "$dataset_name.$tokenizer_name.hdf5", and "$dataset_name.$tokenizer_name.line". Taking the wiki103 dataset and BERT tokenizer as an example, we convert the training file into ``` data |-- wiki103-cased |-- wiki.train.raw.bert-base-uncased |-- wiki.train.raw.bert-base-uncased.hdf5 |-- wiki.train.raw.bert-base-uncased.line ``` The txt file `wiki.train.raw.bert-base-uncased` saves the tokens and each line in this file is the tokens of a line in the original file, The hdf5 file `wiki.train.raw.bert-base-uncased.hdf5` stores all the tokens continuously and use `wiki.train.raw.bert-base-uncased.line` to index the starting token index of each line. The ".line" file has `L+1` lines where `L` is the number of lines in the original files. Each line has a range "line[i]" to "line[i+1]" in the hdf5 file. Commands: 1. Wiki103 (around 10 min) ```shell script bash tokenization/tokenize_wiki103_bert.bash ``` 2. English Wikipedia (around 3 hours) ```shell script bash tokenization/tokenize_wiki_bert.bash ``` ### Extracting Image Features The image pre-processing extracts the image features to build the keys in the vokenization retrieval process. #### Download the Visual Genome (VG) images Since MS COCO images are used in training the cross-modal matching model as in [xmatching](#contextualized-cross-modal-matching-xmatching). We will use the [Visual Genome](https://visualgenome.org/) images as candidate vokens for retrievel. We here download the images first. ```shell script wget https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip -P data/vg/ wget https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip -P data/vg/ unzip data/vg/images.zip -d data/vg/images && rm data/vg/images.zip unzip data/vg/images2.zip -d data/vg/images && rm data/vg/images2.zip cd data/vg/images mv VG_100K/* . mv VG_100K_2/* . rm -rf VG_100K VG_100K_2 cd ../../../ ``` If you already have Visual Genome image on disk. Save them as ``` data |-- vg |-- images |-- 1000.jpg |-- 1001.jpg |-- ...... ``` #### Build Universal Image Ids We first build a list of universal image indexes with [vokenization/create_image_ids.py](vokenization/create_image_ids.py). It is used to unify the image ids in different experiments thus the feature array stored in hdf5 could be universally indexed. The image ids are saved under a shared path `LOCAL_DIR` (default to `data/vokenization`) defined in [vokenization/common.py](vokenization/common.py). The image ids are saved under `data/vokenization/images` with format `{IMAGE_SET}_ids.txt`. We will make sure that all the experiments agree with this meta info, so that we would not get different indexing in different retrieval experiments. > Note: The ids created by [create_image_ids.py](vokenization/create_image_ids.py) are only the order of the images. > The actual images in the dictionary are provided by `extract_keys.bash`, thus is corresponding to the > `_paths.txt`, because the `extract_keys` will filter all broken images and non-existing images. Commands: ```bash # Step 1, Build image orders. python vokenization/create_image_ids.py ``` #### Extracting Image Features Extract image features regarding the list built above, using code [vokenization/extract_vision_keys.py](vokenization/extract_vision_keys.py). The code will first read the image ids saved in `data/vokenization/images/{IMAGE_SET}_ids.txt` and locate the images. The features will be saved under `snap/xmatching/bert_resnext/keys/{IMAGE_SET}.hdf5`. It finishes within 1 hour. Commands: ```bash # Step 2, Extract features. # bash scripts/extract_keys.bash $GPU_ID $MODEL_NAME bash scripts/extract_keys.bash 0 bert_resnext ``` ### Benchmarking Cross-Modal Matching Models (Optional) > Before evaluating, please make sure that `extracting_image_features` and `tokenization` are completed. We benchmark the performance of cross-modal matching models from large scale. The evaluation includes two different metrics: diversity and the retrieval performance. Diversity (in [vokenization/evaluate_diversity.py](vokenization/evaluate_diversity.py)) ensures that the same [token type](https://arxiv.org/pdf/1902.06006.pdf) is mapped to diverse images regarding its context (i.e., the sentence). Retrieval (in [vokenization/evaluate_retrieval.py](vokenization/evaluate_retrieval.py)) measures the correspondence of the token and the retrieved images. We gather these two utils into one script and the command here: ```bash bash scripts/xmatching_benchmark.bash 0 bert_resnext ``` ### The Vokenization Process After all these steps, we could start to vokenize the language corpus. It would load the tokens saved in `dataset_name.tokenizer_name.hdf5` and uses the line-split information in `dataset_name.tokenzier_name.line`. The code is optimized and could be continued by just rerunning it. The vokens will be saved in `snap/xmatching/bert_resnext/vokens/wiki.train.raw.vg_nococo.hdf5` by default. The file `snap/xmatching/bert_resnext/vokens/wiki.train.raw.vg_nococo.ids` contains the universal image ids for each voken, e.g., the image id `vg_nococo/8` corresponds to 8-th feature saved in `snap/xmatching/bert_resnext/keys/vg_nococo.hdf5`. > Note: `--tokenizer-name` must be provided in the script. Commands 1. Wiki103 (around 1 hour on 4 Titan V) ```shell script # Note: mp is the abbreviation for "multi-processing" # bash scripts/mpvokenize_wiki103.bash $USE_GPUS $SNAP_NAME bash scripts/mpvokenize_wiki103.bash 0,1,2,3 bert_resnext ``` 2. English Wikipedia (around 1 day on 4 Titan V) ```shell script # bash scripts/mpvokenize_wiki.bash $USE_GPUS $SNAP_NAME bash scripts/mpvokenize_wiki.bash 0,1,2,3 bert_resnext ``` > The script will call > [vokenization/vokenize_corpus_mp.py](vokenization/vokenize_corpus_mp.py) > to vokenize a corpus. > The vokenziation happens in [vokenization/vokenization.py](vokenization/vokenization.py) and > it use [vokenization/indexing.py](vokenization/indexing.py) to do nearest neighbor search > (based on [faiss](https://github.com/facebookresearch/faiss)). ## Visually-Supervised Language Model (vlm) ### Pre-Training with VLM As discussed in Sec. 2 of the [paper](https://arxiv.org/pdf/2010.06775.pdf), we use previous generated vokens to pre-train the model with visual supervision. #### Wiki103 After the [vokenization process](#the-vokenization-process) of wiki103, we could run the model with command: ```shell script # bash scripts/small_vlm_wiki103_glue.bash $GPUs $SNAP_NAME bash scripts/small_vlm_wiki103.bash 0,1,2,3 wiki103_bert_small ``` It will call [vlm/run_vlm_distributed.py](vlm/run_vlm_distributed.py) and run a BERT-6Layers-512Hiddens model on [wiki103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) dataset with the support of voken supervisions. The snapshot will be saved to `snap/vlm/wiki103_bert_small`. We recommend to run this Wiki103 experiment first since it will finish in a reasonable time (20 hours). The pure BERT pre-training option is also available [later](#bert-as-baselines) for comparisons. Note: defautly, the mixed-precision training is not used. To support the mixed precision pre-training, please install the [nvidia/apex](https://github.com/NVIDIA/apex) library with command: ```shell script git clone https://github.com/NVIDIA/apex cd apex pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ ``` After that, you could bring back the option `--fp16` and `--fp16_opt_level O2` in the script `scripts/small_vlm_wiki103.bash`. I recommend to use `--fp16_opt_level O2`. Although the option O2 might be [unstable](https://github.com/NVIDIA/apex/issues/818#issuecomment-639012282), it saves a lot memory: the max per-gpu-batch-size is 32 with O1 but 64 with O2. #### English Wikipedia After the [vokenization process](#the-vokenization-process) of wiki103, we could run the model with command: ```shell script # bash scripts/base_vlm_wiki.bash $GPUs $SNAP_NAME bash scripts/base_vlm_wiki.bash 0,1,2,3 wiki_bert_base ``` It will run a BERT-12Layers-768Hiddens (same as BERT_BASE) model on the English Wikipedia dataset with the support of voken supervisions. The snapshot will be saved to `snap/vlm/wiki_bert_base`. It takes around 3-5 days on 4 Titan V / GTX 2080 and around 5-7 days to finish in 4 Titan Pascal/T4 cards. (This estimation is accurate since I inevitably run experiments on all these servers...). Titan V / 2080 / T4 have native support of mixed precision training (triggered by `--fp16` option and need installing [apex](https://github.com/NVIDIA/apex)). The speed would be much faster. Titan Pascal would also save some memory with the `--fp16` option. ### GLUE Evaluation We defautly use the [GLUE](https://gluebenchmark.com/) benchmark (e.g., [SST](https://nlp.stanford.edu/sentiment/index.html), [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398), [QQP](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs), [MNLI](https://cims.nyu.edu/~sbowman/multinli/), [QNLI](https://rajpurkar.github.io/SQuAD-explorer/),) as downstreaming tasks. Other tasks could be evaluated following the setup [here](https://github.com/huggingface/transformers/tree/28d183c90cbf91e94651cf4a655df91a52ea1033/examples) by changing the option `--model_name_or_path` to the correct snapshot path `snap/bert/wiki103`. #### Download GLUE dataset This downloaindg scrip is copied from [huggingface transformers](https://github.com/huggingface/transformers/tree/master/examples/text-classification) project. Since the [transformers](https://github.com/huggingface/transformers) is still under dense development, the change of APIs might affect the code. I have upgraded the code compaticability to transformers==3.3. ```shell script wget https://raw.githubusercontent.com/huggingface/transformers/master/utils/download_glue_data.py python download_glue_data.py --data_dir data/glue --tasks all ``` #### Finetuning on GLUE Tasks The pre-trained snapshots are evaluated by fine-tuning them on the [GLUE](https://gluebenchmark.com/) benchmark. The code are modified from the huggingface [transformers](https://github.com/huggingface/transformers). Running GLUE evaluation for snapshots from different epochs: ```bash # bash scripts/run_glue_epochs.bash $GPUS #SNAP_PATH --snaps $NUM_OF_SNAPS bash scripts/run_glue_epochs.bash 0,1,2,3 snap/vlm/wiki103_bert_small --snaps 7 ``` It will assess 7 snaps using all 0,1,2,3 GPUs. Setting `snaps=-1` will assess all checkpoints. If you just want to evaluate the last (usually the best) snapshot, please use: ``` bash scripts/run_glue_epochs.bash 0 snap/vlm/wiki103_bert_small --snaps 1 ``` #### Showing the results For all results saved under `snap/` (whatever the dir names), running the folloing command will print out all the results. ```bash python vlm/show_glue_results_epochs.py ``` It will print results like ``` snap/vlm/test_finetune/glueepoch_checkpoint-epoch0019 RTE MRPC STS-B CoLA SST-2 QNLI QQP MNLI MNLI-MM GLUE 54.51 84.72 87.18 52.32 90.02 88.36 87.16 81.92 82.57 78.75 snap/vlm/bert_6L_512H_wiki103_sharedheadctr_noshuffle/glueepoch_checkpoint-epoch0029 RTE MRPC STS-B CoLA SST-2 QNLI QQP MNLI MNLI-MM GLUE 58.12 82.76 84.45 26.74 89.56 84.40 86.52 77.56 77.99 74.23 ``` ### BERT (As baselines) We also provide pure language-model pre-training as baselines. #### Wiki103 ```shell script # bash scripts/small_wiki103.bash $GPUs $SNAP_NAME bash scripts/small_wiki103.bash 0,1,2,3 bert_small ``` It will call [vlm/run_lm_distributed.py](vlm/run_lm_distributed.py) and run a BERT-6Layers-512Hiddens model on [wiki103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) dataset with the masked language model only. The snapshot will be saved to `snap/bert/wiki103_bert_small`. Or you could directly using the script `small_wiki103_glue.bash` to enable GLUE evaluation after finishing pre-training. ```shell script bash scripts/small_wiki103_glue.bash 0,1,2,3 bert_small ``` #### English Wikipedia Command: ```shell script # bash scripts/base_wiki.bash $GPUs $SNAP_NAME bash scripts/base_wiki.bash 0,1,2,3 bert_wiki ``` With GLUE evaluation: ```shell script bash scripts/base_wiki_glue.bash 0,1,2,3 bert_wiki ``` ## Pre-processed Data and Pre-trained Models ### Data Wiki103 (100M tokens) ``` mkdir -p data/wiki103-cased wget https://nlp.cs.unc.edu/data/vokenization/wiki103-cased/wiki.test.raw.bert-base-uncased.hdf5 -P data/wiki103-cased wget https://nlp.cs.unc.edu/data/vokenization/wiki103-cased/wiki.train.raw.bert-base-uncased.hdf5 -P data/wiki103-cased wget https://nlp.cs.unc.edu/data/vokenization/wiki103-cased/wiki.valid.raw.bert-base-uncased.hdf5 -P data/wiki103-cased ``` Wiki (2800 M tokens) ``` mkdir -p data/wiki-cased wget https://nlp.cs.unc.edu/data/vokenization/wiki-cased/en.test.raw.bert-base-uncased.hdf5 -P data/wiki-cased wget https://nlp.cs.unc.edu/data/vokenization/wiki-cased/en.train.raw.bert-base-uncased.hdf5 -P data/wiki-cased wget https://nlp.cs.unc.edu/data/vokenization/wiki-cased/en.valid.raw.bert-base-uncased.hdf5 -P data/wiki-cased ``` ### Models - Cross-Modal Matching model: [https://nlp.cs.unc.edu/data/vokenization/coco_hinge05_dim64_resxt101_bertl4.zip](https://nlp.cs.unc.edu/data/vokenization/coco_hinge05_dim64_resxt101_bertl4.zip) - BERT (on Wiki): [https://nlp.cs.unc.edu/data/vokenization/bert_12L_768H_wiki.zip](https://nlp.cs.unc.edu/data/vokenization/bert_12L_768H_wiki.zip) - BERT + VLM (on Wiki): [https://nlp.cs.unc.edu/data/vokenization/vlm_12L_768H_wiki.zip](https://nlp.cs.unc.edu/data/vokenization/vlm_12L_768H_wiki.zip) - RoBERTa + VLM (on Wiki): [https://nlp.cs.unc.edu/data/vokenization/vlm_roberta_12L_768H_wiki.zip](https://nlp.cs.unc.edu/data/vokenization/vlm_roberta_12L_768H_wiki.zip) ## Reference If you find our project useful, please cite this paper: ``` @inproceedings{tan2020vokenization, title={Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision}, author={Tan, Hao and Bansal, Mohit}, booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing}, year={2020} } ``` ## Acknowledgement I thank the support from [Bloomberg Data Science Ph.D. Fellowship](https://www.techatbloomberg.com/bloomberg-data-science-ph-d-fellowship/). We thank the reviewers and [Yixin Nie](https://easonnie.github.io/) and [Jie Lei](https://www.cs.unc.edu/~jielei/) for their helpful discussions. Part of the code are built based on huggingface [transformers](https://github.com/huggingface/transformers) and facebook [xlm](https://github.com/facebookresearch/XLM) and [faiss](https://github.com/facebookresearch/faiss). 4K3. ================================================ FILE: data/lxmert/.gitignore ================================================ /mscoco_minival.json /mscoco_nominival.json /mscoco_train.json /vgnococo.json ================================================ FILE: data/mscoco/.gitignore ================================================ /images ================================================ FILE: data/vg/.gitignore ================================================ /images ================================================ FILE: data/wiki/get_data_cased.bash ================================================ # Copyright (c) 2019-present, Facebook, Inc. # Copy frrom https://github.com/facebookresearch/XLM # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. # # # Usage: ./get-data-wiki.sh $lg (en) # set -e lg=$1 # input language # data path WIKI_PATH=data/wiki-cased MAIN_PATH=$WIKI_PATH # tools paths TOOLS_PATH=$MAIN_PATH/tools TOKENIZE=$TOOLS_PATH/tokenize.sh REMOVE_ACCENT=$TOOLS_PATH/remove_accent.py # Wiki data WIKI_DUMP_NAME=${lg}wiki-latest-pages-articles.xml.bz2 WIKI_DUMP_LINK=https://dumps.wikimedia.org/${lg}wiki/latest/$WIKI_DUMP_NAME # install tools data/wiki/install-tools.sh $TOOLS_PATH # create Wiki paths mkdir -p $WIKI_PATH/bz2 mkdir -p $WIKI_PATH/txt # download Wikipedia dump echo "Downloading $lg Wikipedia dump from $WIKI_DUMP_LINK ..." wget -c $WIKI_DUMP_LINK -P $WIKI_PATH/bz2/ echo "Downloaded $WIKI_DUMP_NAME in $WIKI_PATH/bz2/$WIKI_DUMP_NAME" # extract and tokenize Wiki data echo "*** Cleaning and tokenizing $lg Wikipedia dump ... ***" #python -m $TOOLS_PATH/wikiextractor/wikiextractor/WikiExtractor $WIKI_PATH/bz2/$WIKI_DUMP_NAME --processes 24 -q -o - \ if [ ! -f $WIKI_PATH/txt/$lg.all.raw ]; then python $TOOLS_PATH/wikiextractor/WikiExtractor.py $WIKI_PATH/bz2/$WIKI_DUMP_NAME --processes 24 -q -o - \ | sed "/^\s*\$/d" \ | grep -v "^\$" \ | $TOKENIZE $lg $TOOLS_PATH \ | python $REMOVE_ACCENT \ > $WIKI_PATH/txt/$lg.all.raw fi echo "*** Tokenized ( + accent-removal) $lg Wikipedia dump to $WIKI_PATH/txt/train.${lg} ***" # split into train / valid / test echo "*** Split into train / valid / test ***" split_data() { NLINES=`wc -l $1 | awk -F " " '{print $1}'`; NTRAIN=$((NLINES - 10000)); NVAL=$((NTRAIN + 5000)); cat $1 | head -$NTRAIN > $2; cat $1 | head -$NVAL | tail -5000 > $3; cat $1 | tail -5000 > $4; } split_data $WIKI_PATH/txt/$lg.all.raw $WIKI_PATH/txt/$lg.train.raw $WIKI_PATH/txt/$lg.valid.raw $WIKI_PATH/txt/$lg.test.raw # File structure mv $WIKI_PATH/txt/* $WIKI_PATH/ rm -rf $WIKI_PATH/bz2 rm -rf $WIKI_PATH/txt ================================================ FILE: data/wiki/get_data_cased_untokenized.bash ================================================ # Copyright (c) 2019-present, Facebook, Inc. # Copy frrom https://github.com/facebookresearch/XLM # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. # # # Usage: ./get-data-wiki.sh $lg (en) # set -e lg=$1 # input language # data path WIKI_PATH=data/wiki-cased-untokenized MAIN_PATH=$WIKI_PATH # tools paths TOOLS_PATH=$MAIN_PATH/tools TOKENIZE=$TOOLS_PATH/tokenize.sh REMOVE_ACCENT=$TOOLS_PATH/remove_accent.py # Wiki data WIKI_DUMP_NAME=${lg}wiki-latest-pages-articles.xml.bz2 WIKI_DUMP_LINK=https://dumps.wikimedia.org/${lg}wiki/latest/$WIKI_DUMP_NAME # install tools data/wiki/install-tools.sh $TOOLS_PATH # create Wiki paths mkdir -p $WIKI_PATH/bz2 mkdir -p $WIKI_PATH/txt # download Wikipedia dump if [ ! -f $WIKI_PATH/bz2/enwiki-latest-pages-articles.xml.bz2 ]; then echo "Downloading $lg Wikipedia dump from $WIKI_DUMP_LINK ..." wget -c $WIKI_DUMP_LINK -P $WIKI_PATH/bz2/ echo "Downloaded $WIKI_DUMP_NAME in $WIKI_PATH/bz2/$WIKI_DUMP_NAME" fi # extract and tokenize Wiki data #cd $MAIN_PATH echo "*** Cleaning and tokenizing $lg Wikipedia dump ... ***" if [ ! -f $WIKI_PATH/txt/$lg.all.raw ]; then python $TOOLS_PATH/wikiextractor/WikiExtractor.py $WIKI_PATH/bz2/$WIKI_DUMP_NAME --processes 24 -q -o - \ | sed "/^\s*\$/d" \ | grep -v "^\$" \ | python $REMOVE_ACCENT \ > $WIKI_PATH/txt/$lg.all.raw fi echo "*** Not Tokenized ( but + accent-removal) $lg Wikipedia dump to $WIKI_PATH/txt/train.${lg} ***" # split into train / valid / test echo "*** Split into train / valid / test ***" split_data() { NLINES=`wc -l $1 | awk -F " " '{print $1}'`; NTRAIN=$((NLINES - 10000)); NVAL=$((NTRAIN + 5000)); cat $1 | head -$NTRAIN > $2; cat $1 | head -$NVAL | tail -5000 > $3; cat $1 | tail -5000 > $4; } split_data $WIKI_PATH/txt/$lg.all.raw $WIKI_PATH/txt/$lg.train.raw $WIKI_PATH/txt/$lg.valid.raw $WIKI_PATH/txt/$lg.test.raw # File structure mv $WIKI_PATH/txt/* $WIKI_PATH/ rm -rf $WIKI_PATH/bz2 rm -rf $WIKI_PATH/txt ================================================ FILE: data/wiki/install-tools.sh ================================================ # Copyright (c) 2019-present, Facebook, Inc. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. # set -e # data path TOOLS_PATH=$1 # tools MOSES_DIR=mosesdecoder FASTBPE_DIR=fastBPE FASTBPE=fast WMT16_SCRIPTS=wmt16-scripts # tools path mkdir -p $TOOLS_PATH # Copy the scripts to TOOLS_PATH cp -r data/wiki/tools/* $TOOLS_PATH # # Download and install tools # old=$(pwd) cd $TOOLS_PATH # Download Moses if [ ! -d "$MOSES_DIR" ]; then echo "Cloning Moses from GitHub repository..." git clone https://github.com/moses-smt/mosesdecoder.git fi # Download fastBPE if [ ! -d "$FASTBPE_DIR" ]; then echo "Cloning fastBPE from GitHub repository..." git clone https://github.com/glample/fastBPE fi # Compile fastBPE if [ ! -f "$FASTBPE_DIR/$FASTBPE" ]; then echo "Compiling fastBPE..." cd fastBPE g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast cd .. fi # Download Sennrich's tools if [ ! -d "$WMT16_SCRIPTS" ]; then echo "Cloning WMT16 preprocessing scripts..." git clone https://github.com/rsennrich/wmt16-scripts.git fi # Download WikiExtractor if [ ! -d wikiextractor ]; then echo "Cloning WikiExtractor from GitHub repository..." git clone https://github.com/attardi/wikiextractor.git cd wikiextractor git checkout e4abb4cbd019b0257824ee47c23dd163919b731b cd .. fi cd $old # # Chinese segmenter # if ! ls $TOOLS_PATH/stanford-segmenter-* 1> /dev/null 2>&1; then # echo "Stanford segmenter not found at $TOOLS_PATH/stanford-segmenter-*" # echo "Please install Stanford segmenter in $TOOLS_PATH" # exit 1 # fi # # # Thai tokenizer # if ! python -c 'import pkgutil; exit(not pkgutil.find_loader("pythainlp"))'; then # echo "pythainlp package not found in python" # echo "Please install pythainlp (pip install pythainlp)" # exit 1 # fi # ================================================ FILE: data/wiki/tools/remove_accent.py ================================================ # Copyright (c) 2019-present, Facebook, Inc. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. # import sys import unicodedata import six def convert_to_unicode(text): """ Converts `text` to Unicode (if it's not already), assuming UTF-8 input. """ # six_ensure_text is copied from https://github.com/benjaminp/six def six_ensure_text(s, encoding='utf-8', errors='strict'): if isinstance(s, six.binary_type): return s.decode(encoding, errors) elif isinstance(s, six.text_type): return s else: raise TypeError("not expecting type '%s'" % type(s)) return six_ensure_text(text, encoding="utf-8", errors="ignore") def run_strip_accents(text): """ Strips accents from a piece of text. """ text = unicodedata.normalize("NFD", text) output = [] for char in text: cat = unicodedata.category(char) if cat == "Mn": continue output.append(char) return "".join(output) for line in sys.stdin: line = convert_to_unicode(line.rstrip()) line = run_strip_accents(line) print(u'%s' % line) ================================================ FILE: data/wiki/tools/segment_th.py ================================================ # Copyright (c) 2019-present, Facebook, Inc. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. # import sys from pythainlp.tokenize import word_tokenize for line in sys.stdin.readlines(): line = line.rstrip('\n') print(' '.join(word_tokenize(line))) ================================================ FILE: data/wiki/tools/tokenize.sh ================================================ # Copyright (c) 2019-present, Facebook, Inc. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. # # Tokenize text data in various languages # Usage: e.g. cat wiki.ar | tokenize.sh ar set -e N_THREADS=8 lg=$1 TOOLS_PATH=$2 # moses MOSES=$TOOLS_PATH/mosesdecoder REPLACE_UNICODE_PUNCT=$MOSES/scripts/tokenizer/replace-unicode-punctuation.perl NORM_PUNC=$MOSES/scripts/tokenizer/normalize-punctuation.perl REM_NON_PRINT_CHAR=$MOSES/scripts/tokenizer/remove-non-printing-char.perl TOKENIZER=$MOSES/scripts/tokenizer/tokenizer.perl # Chinese if [ "$lg" = "zh" ]; then $TOOLS_PATH/stanford-segmenter-*/segment.sh pku /dev/stdin UTF-8 0 | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR # Thai elif [ "$lg" = "th" ]; then cat - | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | python $TOOLS_PATH/segment_th.py # Japanese elif [ "$lg" = "ja" ]; then cat - | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | kytea -notags # other languages else cat - | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | $TOKENIZER -no-escape -threads $N_THREADS -l $lg fi ================================================ FILE: data/wiki103/get_data_cased.sh ================================================ OUTPUT=data/wiki103-cased wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip -P $OUTPUT/ unzip $OUTPUT/wikitext-103-raw-v1.zip -d $OUTPUT mv $OUTPUT/wikitext-103-raw/* $OUTPUT rm -rf $OUTPUT/wikitext-103-raw-v1.zip $OUTPUT/wikitext-103-raw ================================================ FILE: data/wiki103/get_data_uncased.sh ================================================ OUTPUT=data/wiki103 wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip -P $OUTPUT/ unzip $OUTPUT/wikitext-103-v1.zip -d $OUTPUT mv $OUTPUT/wikitext-103/* $OUTPUT rm -rf $OUTPUT/wikitext-103-v1.zip $OUTPUT/wikitext-103 ================================================ FILE: requirements.txt ================================================ torch #==1.4.0 torchvision #==0.5.0 transformers==3.3.0 tensorboardX # For GLUE evaluation sklearn # Fiass supports fast indexing. # The code has a torch-implemented GPU indexing, so do not worry if you could not install faiss. faiss-gpu>=1.6.3 # Spacy is used in sentence segmentation where the sentences are the input the cross-modality matching model. spacy # A higher h5py version to support h5py.VirtualLayout h5py>=2.10.0 ================================================ FILE: scripts/base_vlm_wiki.bash ================================================ # The name of experiment GPUS=$1 NAME=$2 # Create dirs and make backup output=snap/bert/$NAME mkdir -p $output/src cp -r vlm $output/src/ cp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash cp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash cp $0 $output/run.bash export TRAIN_FILE=data/wiki-cased/en.train.raw export TEST_FILE=data/wiki-cased/en.valid.raw # Pre-training CUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \ --output_dir=$output \ --overwrite_output_dir \ --config_name=vlm/configs/bert-12L-768H.json \ --tokenizer_name=bert-base-uncased \ --model_type=bert \ --block_size=126 \ --per_gpu_train_batch_size=32 \ --per_gpu_eval_batch_size=32 \ --gradient_accumulation_steps=2 \ --max_steps=200000 \ --learning_rate=2e-4 \ --weight_decay=0.01 \ --warmup_steps=5000 \ --mlm_probability 0.15 \ --mlm_ratio 1.0 \ --do_train \ --train_data_file=$TRAIN_FILE \ --do_eval \ --eval_data_file=$TEST_FILE \ --col_data \ --split_sent \ --do_voken_cls \ --voken_labels all \ --voken_dir snap/xmatching/bert_resnext/vokens \ --voken_suffix vg_nococo \ --mlm ${@:3} | tee $output/log.log #--fp16 \ #--fp16_opt_level O2 \ ================================================ FILE: scripts/base_vlm_wiki_glue.bash ================================================ # The name of experiment GPUS=$1 NAME=$2 # Create dirs and make backup output=snap/bert/$NAME mkdir -p $output/src cp -r vlm $output/src/ cp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash cp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash cp $0 $output/run.bash export TRAIN_FILE=data/wiki-cased/en.train.raw export TEST_FILE=data/wiki-cased/en.valid.raw # Pre-training CUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \ --output_dir=$output \ --overwrite_output_dir \ --config_name=vlm/configs/bert-12L-768H.json \ --tokenizer_name=bert-base-uncased \ --model_type=bert \ --block_size=126 \ --per_gpu_train_batch_size=32 \ --per_gpu_eval_batch_size=32 \ --gradient_accumulation_steps=2 \ --max_steps=200000 \ --learning_rate=2e-4 \ --weight_decay=0.01 \ --warmup_steps=5000 \ --mlm_probability 0.15 \ --mlm_ratio 1.0 \ --do_train \ --train_data_file=$TRAIN_FILE \ --do_eval \ --eval_data_file=$TEST_FILE \ --col_data \ --split_sent \ --do_voken_cls \ --voken_labels all \ --voken_dir snap/xmatching/bert_resnext/vokens \ --voken_suffix vg_nococo \ --mlm ${@:3} | tee $output/log.log #--fp16 \ #--fp16_opt_level O2 \ # Wait for clearing the GPU cache sleep 30 bash scripts/run_glue_epochs.bash $GPUS $output --snaps 4 ================================================ FILE: scripts/base_wiki.bash ================================================ GPUS=$1 # The name of experiment NAME=$2 # Create dirs and make backup output=snap/bert/$NAME mkdir -p $output/src cp -r vlm/*.py $output/src/ cp $0 $output/run.bash cp run_glue_epochs.bash $output/run_glue_epochs.bash cp run_glue_at_epoch.bash $output/run_glue_at_epoch.bash export TRAIN_FILE=data/wiki-cased/en.train.raw export TEST_FILE=data/wiki-cased/en.valid.raw # Pre-training CUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \ --output_dir=$output \ --overwrite_output_dir \ --config_name=vlm/configs/bert-12L-768H.json \ --tokenizer_name=bert-base-uncased \ --model_type=bert \ --block_size=126 \ --per_gpu_train_batch_size=64 \ --per_gpu_eval_batch_size=64 \ --gradient_accumulation_steps=1 \ --max_steps 220000 \ --learning_rate=2e-4 \ --weight_decay=0.01 \ --warmup_steps=5000 \ --do_train \ --train_data_file=$TRAIN_FILE \ --do_eval \ --eval_data_file=$TEST_FILE \ --col_data \ --split_sent \ --mlm ${@:3} | tee $output/log.log #--fp16 \ #--fp16_opt_level O2 \ ================================================ FILE: scripts/base_wiki_glue.bash ================================================ GPUS=$1 # The name of experiment NAME=$2 # Create dirs and make backup output=snap/bert/$NAME mkdir -p $output/src cp -r vlm/*.py $output/src/ cp $0 $output/run.bash cp run_glue_epochs.bash $output/run_glue_epochs.bash cp run_glue_at_epoch.bash $output/run_glue_at_epoch.bash export TRAIN_FILE=data/wiki-cased/en.train.raw export TEST_FILE=data/wiki-cased/en.valid.raw # Pre-training CUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \ --output_dir=$output \ --overwrite_output_dir \ --config_name=vlm/configs/bert-12L-768H.json \ --tokenizer_name=bert-base-uncased \ --model_type=bert \ --block_size=126 \ --per_gpu_train_batch_size=64 \ --per_gpu_eval_batch_size=64 \ --gradient_accumulation_steps=1 \ --max_steps 220000 \ --learning_rate=2e-4 \ --weight_decay=0.01 \ --warmup_steps=5000 \ --do_train \ --train_data_file=$TRAIN_FILE \ --do_eval \ --eval_data_file=$TEST_FILE \ --col_data \ --split_sent \ --mlm ${@:3} | tee $output/log.log #--fp16 \ #--fp16_opt_level O2 \ #--shuffle \ # Wait for clearing the GPU cache sleep 30 bash scripts/run_glue_epochs.bash $GPUS $output --snaps -1 ================================================ FILE: scripts/extract_keys.bash ================================================ CUDA_VISIBLE_DEVICES=$1 python vokenization/extract_vision_keys.py \ --image-sets vg_nococo,coco_minival,coco_nominival,coco_train,cc_valid \ --load-dir snap/xmatching/$2 ================================================ FILE: scripts/mpvokenize_wiki.bash ================================================ GPU=$1 LOAD=snap/xmatching/$2 DATA_DIR=data/wiki-cased TOKENIZER=bert-base-uncased for DATA_NAME in en.valid.raw en.test.raw en.train.raw do CUDA_VISIBLE_DEVICES=$GPU python vokenization/vokenize_corpus_mp.py \ --load $LOAD \ --corpus=$DATA_DIR/$DATA_NAME \ --tokenizer-name $TOKENIZER \ --image-sets vg_nococo \ --max-img-num 50000 done ================================================ FILE: scripts/mpvokenize_wiki103.bash ================================================ GPU=$1 LOAD=snap/xmatching/$2 WIKI_DIR=data/wiki103-cased TOKENIZER=bert-base-uncased for DATA_NAME in wiki.valid.raw wiki.test.raw wiki.train.raw do CUDA_VISIBLE_DEVICES=$GPU python vokenization/vokenize_corpus_mp.py \ --load $LOAD \ --corpus=$WIKI_DIR/$DATA_NAME \ --tokenizer-name $TOKENIZER \ --image-sets vg_nococo \ --max-img-num 50000 done ================================================ FILE: scripts/run_glue_at_epoch.bash ================================================ export GLUE_DIR=data/glue/ EPOCHS=$2 MODEL=$3 CKPT=$4 for TASK_NAME in WNLI RTE MRPC STS-B CoLA SST-2 QNLI QQP MNLI do CUDA_VISIBLE_DEVICES=$1 python vlm/run_glue.py \ --model_type bert \ --tokenizer_name=bert-base-uncased \ --model_name_or_path $MODEL/$CKPT \ --task_name $TASK_NAME \ --do_train \ --do_eval \ --do_lower_case \ --data_dir $GLUE_DIR/$TASK_NAME \ --save_steps -1 \ --max_seq_length 126 \ --per_gpu_eval_batch_size=32 \ --per_gpu_train_batch_size=32 \ --learning_rate 1e-4 \ --warmup_steps 0.1 \ --num_train_epochs $EPOCHS.0 \ --output_dir $MODEL/glueepoch_$CKPT/$TASK_NAME done #--overwrite_output_dir \ ================================================ FILE: scripts/run_glue_epochs.bash ================================================ GPUS=$1 MODEL=$2 python vlm/run_glue_epochs.py --gpus $GPUS --load $MODEL \ ${@:3} ================================================ FILE: scripts/run_xmatching.bash ================================================ GPUS=$1 # The name of experiment NAME=$2 # Create dirs and make backup output=snap/xmatching/$NAME mkdir -p $output/src/ cp -r xmatching $output/src/ cp $0 $output/run.bash # Pre-training CUDA_VISIBLE_DEVICES=$GPUS unbuffer python xmatching/main.py \ --train-imgs mscoco_train,mscoco_nominival --valid-imgs mscoco_minival \ --train-langs mscoco --valid-langs mscoco \ --max-len 20 --dim 64 \ --lang-layers 4,3,2,1 \ --lang-pretrained --visn-pretrained \ --num-workers 8 --batchSize 256 --optim adam --lr 1e-3 --epochs 20 \ --nodes 1 --nr 0 \ --output $output ${@:3} | tee $output/log.log #--visn resnext101_32x8d --lang bert \ ================================================ FILE: scripts/small_vlm_wiki103.bash ================================================ # The name of experiment GPUS=$1 NAME=$2 # Create dirs and make backup output=snap/vlm/$NAME mkdir -p $output/src cp -r vlm $output/src/ cp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash cp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash cp $0 $output/run.bash export TRAIN_FILE=data/wiki103-cased/wiki.train.raw export TEST_FILE=data/wiki103-cased/wiki.valid.raw # Pre-training CUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \ --output_dir=$output \ --overwrite_output_dir \ --config_name=vlm/configs/bert-6L-512H.json \ --tokenizer_name=bert-base-uncased \ --model_type=bert \ --block_size=126 \ --per_gpu_train_batch_size=32 \ --per_gpu_eval_batch_size=32 \ --gradient_accumulation_steps=2 \ --num_train_epochs=40 \ --learning_rate=2e-4 \ --weight_decay=0.01 \ --warmup_steps=10000 \ --mlm_probability 0.15 \ --mlm_ratio 1.0 \ --do_train \ --train_data_file=$TRAIN_FILE \ --do_eval \ --eval_data_file=$TEST_FILE \ --col_data \ --split_sent \ --do_voken_cls \ --voken_labels all \ --voken_dir snap/xmatching/bert_resnext/vokens \ --voken_suffix vg_nococo \ --mlm ${@:3} | tee $output/log.log #--fp16 \ #--fp16_opt_level O2 \ ================================================ FILE: scripts/small_vlm_wiki103_glue.bash ================================================ # The name of experiment GPUS=$1 NAME=$2 # Create dirs and make backup output=snap/vlm/$NAME mkdir -p $output/src cp -r vlm $output/src/ cp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash cp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash cp $0 $output/run.bash export TRAIN_FILE=data/wiki103-cased/wiki.train.raw export TEST_FILE=data/wiki103-cased/wiki.valid.raw # Pre-training CUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \ --output_dir=$output \ --overwrite_output_dir \ --config_name=vlm/configs/bert-6L-512H.json \ --tokenizer_name=bert-base-uncased \ --model_type=bert \ --block_size=126 \ --per_gpu_train_batch_size=32 \ --per_gpu_eval_batch_size=32 \ --gradient_accumulation_steps=2 \ --num_train_epochs=40 \ --learning_rate=2e-4 \ --weight_decay=0.01 \ --warmup_steps=10000 \ --mlm_probability 0.15 \ --mlm_ratio 1.0 \ --do_train \ --train_data_file=$TRAIN_FILE \ --do_eval \ --eval_data_file=$TEST_FILE \ --col_data \ --split_sent \ --do_voken_cls \ --voken_labels all \ --voken_dir snap/xmatching/bert_resnext/vokens \ --voken_suffix vg_nococo \ --mlm ${@:3} | tee $output/log.log #--fp16 \ #--fp16_opt_level O2 \ # Wait for clearing the GPU cache sleep 30 bash scripts/run_glue_epochs.bash $GPUS $output --snaps 4 ================================================ FILE: scripts/small_wiki103.bash ================================================ # The name of experiment GPUS=$1 NAME=$2 # Create dirs and make backup output=snap/bert/$NAME mkdir -p $output/src cp -r vlm/*.py $output/src/ cp $0 $output/run.bash export TRAIN_FILE=data/wiki103-cased/wiki.train.raw export TEST_FILE=data/wiki103-cased/wiki.valid.raw # Pre-training CUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \ --output_dir=$output \ --overwrite_output_dir \ --config_name=vlm/configs/bert-6L-512H.json \ --tokenizer_name=bert-base-uncased \ --model_type=bert \ --block_size=126 \ --per_gpu_train_batch_size=64 \ --per_gpu_eval_batch_size=64 \ --gradient_accumulation_steps=1 \ --num_train_epochs=44 \ --learning_rate=2e-4 \ --weight_decay=0.01 \ --warmup_steps=10000 \ --do_train \ --train_data_file=$TRAIN_FILE \ --do_eval \ --eval_data_file=$TEST_FILE \ --col_data \ --split_sent \ --shuffle \ --mlm ${@:3} | tee $output/log.log #--fp16 \ #--fp16_opt_level O2 \ ================================================ FILE: scripts/small_wiki103_glue.bash ================================================ # The name of experiment GPUS=$1 NAME=$2 # Create dirs and make backup output=snap/bert/$NAME mkdir -p $output/src cp -r vlm/*.py $output/src/ cp $0 $output/run.bash export TRAIN_FILE=data/wiki103-cased/wiki.train.raw export TEST_FILE=data/wiki103-cased/wiki.valid.raw # Pre-training CUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \ --output_dir=$output \ --overwrite_output_dir \ --config_name=vlm/configs/bert-6L-512H.json \ --tokenizer_name=bert-base-uncased \ --model_type=bert \ --block_size=126 \ --per_gpu_train_batch_size=64 \ --per_gpu_eval_batch_size=64 \ --gradient_accumulation_steps=1 \ --num_train_epochs=44 \ --learning_rate=2e-4 \ --weight_decay=0.01 \ --warmup_steps=10000 \ --do_train \ --train_data_file=$TRAIN_FILE \ --do_eval \ --eval_data_file=$TEST_FILE \ --col_data \ --split_sent \ --shuffle \ --mlm ${@:3} | tee $output/log.log #--fp16 \ #--fp16_opt_level O2 \ # Wait for clearing the GPU cache sleep 30 bash scripts/run_glue_epochs.bash $GPUS $output --snaps 4 ================================================ FILE: scripts/xmatching_benchmark.bash ================================================ # Benchmarking the cross-modal matching model with # 1. Retrieval scores. # 2. Voken Diversity w.r.t words in specific language corpus. # Please run this after image_key_retrivel and tokenization. # i.e., step 1 and step2 in readme.md MODEL=$2 MODELPATH=snap/xmatching/$MODEL rm -rf $MODELPATH/analysis.log # Retrieval scores CUDA_VISIBLE_DEVICES=$1 unbuffer python vokenization/evaluate_retrieval.py \ --load $MODELPATH \ --image-sets coco_minival,cc_valid \ | tee -a $MODELPATH/analysis.log # Diversity # Test diversity of vision-and-language (captioning) datasets CUDA_VISIBLE_DEVICES=$1 unbuffer python vokenization/evaluate_diversity.py \ --load $MODELPATH \ --image-sets vg_nococo \ --corpus coco_minival,cc_valid \ | tee -a $MODELPATH/analysis.log # Test diversity of pure-language corpus CUDA_VISIBLE_DEVICES=$1 unbuffer python vokenization/evaluate_diversity.py \ --load $MODELPATH \ --image-sets vg_nococo \ --corpus data/wiki103-cased/wiki.valid.raw \ --maxsents 95000 \ | tee -a $MODELPATH/analysis.log ================================================ FILE: snap/bert/.gitkeep ================================================ ================================================ FILE: snap/vlm/.gitkeep ================================================ ================================================ FILE: snap/xmatching/.gitkeep ================================================ /* ================================================ FILE: tokenization/to_hdf5.py ================================================ import h5py import numpy as np import tqdm from transformers import AutoTokenizer def validate_hdf5(fname, tokenizer_name): print("--------------------------------------------") print("Start to valid the hdf5 file", fname + '.' + tokenizer_name + '.hdf5') with open(fname) as f: lines = [] for line in f: if 'wiki' in fname: # Wiki103: remove document title if line.startswith(' = '): continue # Full Wiki: Remove the too short lines. if len(line.strip().split(' ')) < 5: continue if len(line.strip()) == 0: # Always drop empty line continue lines.append(line) # Use the slow tokenizer to validate the results of the fast tokenizer. tokenizer = AutoTokenizer.from_pretrained(tokenizer_name) h5_file = h5py.File(fname + '.' + tokenizer_name + '.hdf5', 'r') tokens = h5_file['tokens'] print("Start to check the first 10 lines:") ids = [] for line in lines[:10]: ids.extend(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(line))) ids = np.array(ids) first_tokens = np.array(tokens[:len(ids)]) if np.array_equal(ids, first_tokens): print("PASS") else: print(' '.join(tokenizer.convert_ids_to_tokens(ids))) print() print(' '.join(tokenizer.convert_ids_to_tokens(first_tokens))) assert False, "FAIL" print("Start to check the last 10 lines:") ids = [] for line in lines[-10:]: ids.extend(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(line))) ids = np.array(ids) last_tokens = np.array(tokens[-len(ids):]) if np.array_equal(ids, last_tokens): print("PASS") else: print(' '.join(tokenizer.convert_ids_to_tokens(ids))) print(' '.join(tokenizer.convert_ids_to_tokens(last_tokens))) assert False, "FAIL" print("--------------------------------------------") def to_hdf5(fname, tokenizer_name, validate=True): print("Process %s" % fname) h5_file = h5py.File(fname + '.' + tokenizer_name + '.hdf5', 'w') dset = h5_file.create_dataset("tokens", (0,), maxshape=(None,), dtype='int32') dump_interval = 1000000 dump_iter = 0 with open('%s.%s' % (fname, tokenizer_name)) as f: lines = 0 tokens = [] for line in tqdm.tqdm(f): for token in map(int, line.split(' ')): tokens.append(token) if len(tokens) >= dump_interval: dset.resize((dump_iter + len(tokens),)) dset[dump_iter: dump_iter + len(tokens)] = tokens dump_iter += len(tokens) tokens = [] lines += 1 dset.resize((dump_iter + len(tokens),)) dset[dump_iter: dump_iter + len(tokens)] = tokens dump_iter += len(tokens) assert len(dset) == dump_iter h5_file.close() if validate: validate_hdf5(fname, tokenizer_name) print() ================================================ FILE: tokenization/tokenize_dataset.py ================================================ # coding=utf-8 # Copyleft 2020 project COL. import argparse from pathlib import Path from transformers import AutoTokenizer import time from to_hdf5 import to_hdf5 def tokenize_dataset(data_dir, fname, tokenizer_name, lines_are_sents=False): data_path = Path(data_dir) tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, use_fast=True) f = open(data_path / fname) g = open((data_path / ('%s.%s' % (fname, tokenizer_name))), 'w') # Statistics dcmt_cnt = 0 token_cnt = 0 line_cnt = 0 line_starts = [] # Logging and dumping hyper-parameters cache = '' log_interval = log_iter = 1000000 dump_interval = dump_iter = 100000 start_time = time.time() for i, line in enumerate(f): # Identify the start of documents, ignore it. if 'wiki103' in data_dir: if line.startswith(' = '): dcmt_cnt += 1 continue elif 'wiki' in data_dir: if len(line.strip().split(' ')) == 1: dcmt_cnt += 1 continue if 'wiki' in data_dir: # Remove too short lines. Book corpus does not need this. if len(line.strip().split(' ')) < 5: continue # Drop empty line (1) if len(line.strip()) == 0: continue tokenized_line = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(line)) # tokenized_line = tokenizer.encode(line, add_special_tokens=False) if len(tokenized_line) == 0: # Drop empty line (2) continue line_cnt += 1 line_starts.append(token_cnt) if i < 5: print() print('Line:', line) print('Tokens:', ' '.join(tokenizer.convert_ids_to_tokens(tokenized_line))) token_cnt += len(tokenized_line) cache += ' '.join(map(str, tokenized_line)) + '\n' if (token_cnt + 1) > dump_iter: g.write(cache) cache = '' dump_iter += dump_interval if (token_cnt + 1) > log_iter: used_time = time.time() - start_time print("Process %d tokens in %d seconds, %0.4f tokens per second." % ( token_cnt, used_time, token_cnt / used_time)) log_iter += log_interval # Deal with the last remaining tokens. line_starts.append(token_cnt) g.write(cache) # Dump Line starts identifier = 'sent' if lines_are_sents else 'line' with open(data_path / ('%s.%s.%s' % (fname, tokenizer_name, identifier)), 'w') as f: for line_start in line_starts: f.write(str(line_start) + "\n") f.close() g.close() print(f"Documents: {dcmt_cnt}, Lines: {line_cnt}, Words: {token_cnt} in dataset {fname}") to_hdf5(str(data_path / fname), tokenizer_name) if __name__ == "__main__": parser = argparse.ArgumentParser() # Required parameters parser.add_argument( "datadir", default=None, type=str, help="The input training data file (a text file)." ) parser.add_argument( "fname", default=None, type=str, help="The input training data file (a text file)." ) parser.add_argument( "tokenizer_name", default=None, type=str, help="The input training data file (a text file)." ) parser.add_argument( "--lines-are-sents", action='store_true', help="Add this if the line are already segmented to sentences, instead of paragraphs." ) param = parser.parse_args() tokenize_dataset( param.datadir, param.fname, param.tokenizer_name, param.lines_are_sents, ) ================================================ FILE: tokenization/tokenize_wiki103_bert.bash ================================================ DATA_DIR=data/wiki103-cased TOKENIZER=bert-base-uncased python tokenization/tokenize_dataset.py $DATA_DIR wiki.valid.raw $TOKENIZER python tokenization/tokenize_dataset.py $DATA_DIR wiki.test.raw $TOKENIZER python tokenization/tokenize_dataset.py $DATA_DIR wiki.train.raw $TOKENIZER ================================================ FILE: tokenization/tokenize_wiki103_roberta.bash ================================================ DATA_DIR=data/wiki103-cased TOKENIZER=roberta-base python tokenization/tokenize_dataset.py $DATA_DIR wiki.valid.raw $TOKENIZER python tokenization/tokenize_dataset.py $DATA_DIR wiki.test.raw $TOKENIZER python tokenization/tokenize_dataset.py $DATA_DIR wiki.train.raw $TOKENIZER ================================================ FILE: tokenization/tokenize_wiki_bert.bash ================================================ DATA_DIR=data/wiki-cased TOKENIZER=bert-base-uncased python tokenization/tokenize_dataset.py $DATA_DIR en.valid.raw $TOKENIZER python tokenization/tokenize_dataset.py $DATA_DIR en.test.raw $TOKENIZER python tokenization/tokenize_dataset.py $DATA_DIR en.train.raw $TOKENIZER ================================================ FILE: tokenization/tokenize_wiki_roberta.bash ================================================ DATA_DIR=data/wiki-cased-untokenized/ TOKENIZER=roberta-base python tokenization/tokenize_dataset.py $DATA_DIR en.valid.raw $TOKENIZER python tokenization/tokenize_dataset.py $DATA_DIR en.test.raw $TOKENIZER python tokenization/tokenize_dataset.py $DATA_DIR en.train.raw $TOKENIZER ================================================ FILE: vlm/__init__.py ================================================ import data ================================================ FILE: vlm/configs/bert-12L-768H.json ================================================ { "architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 12, "type_vocab_size": 2, "vocab_size": 30522 } ================================================ FILE: vlm/configs/bert-4L-768H.json ================================================ { "architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 4, "type_vocab_size": 2, "vocab_size": 30522 } ================================================ FILE: vlm/configs/bert-6L-512H.json ================================================ { "architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 512, "initializer_range": 0.02, "intermediate_size": 2048, "max_position_embeddings": 512, "num_attention_heads": 8, "num_hidden_layers": 6, "type_vocab_size": 2, "vocab_size": 30522 } ================================================ FILE: vlm/configs/bert_base.json ================================================ { "architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 12, "type_vocab_size": 2, "vocab_size": 30522 } ================================================ FILE: vlm/data.py ================================================ import copy import os import random import h5py import torch from torch.utils.data import DataLoader, Dataset import tqdm class CoLDataset(Dataset): IGNORE_ID = -100 sent_strategy = 'first' def __init__(self, file_path, tokenizer_name, tokenizer, block_size=512, split_sent=False, voken_dir=None, suffix=None, verbose=False, voken_ablation=None): # Open token's hdf5 token_path = file_path + '.' + tokenizer_name + '.hdf5' assert os.path.isfile(token_path) if verbose: print("-------- Load Data -------") print("Load tokens from", token_path) self.token_hdf5 = h5py.File(token_path, 'r') self.tokenizer = tokenizer self.tokens = self.token_hdf5['tokens'] self.verbose = verbose self.voken_ablation = voken_ablation self._iter_cnt = 0 # Open voken's hdf5 and load voken ids if voken_dir is not None: assert suffix is not None, 'Please provide suffix of the voken, e.g., vg_nococo.5000.' self.sent_level = 'sent' in voken_dir dset_fname = os.path.split(file_path)[-1] voken_path = os.path.join(voken_dir, f"{dset_fname}.{suffix}.hdf5") voken_ids_path = os.path.join(voken_dir, f"{dset_fname}.{suffix}.ids") if verbose: print("Load vokens from", voken_path) self.voken_hdf5 = h5py.File(voken_path, 'r') self.vokens = self.voken_hdf5['vokens'] assert len(self.vokens) == len(self.tokens) self._voken_ids = list( map(lambda x: x.strip(), open(voken_ids_path).readlines()) ) if verbose: print("\t with voken size", self.voken_size) print("\t top 5 voken ids are:", self._voken_ids[:5]) else: self.vokens = None # Split for every block_size tokens # The last block without full length will be dropped. num_tokens = len(self.tokens) self.starts = list(range(0, num_tokens, block_size)) self.batches = list(zip(self.starts[:-1], self.starts[1:])) manual_filtered =False if "en.train.raw" in file_path and tokenizer_name == "bert-base-uncased": self.batches = manual_filter(self.batches) if verbose: print("Data: Mannually filter the range for counties.") manual_filtered = True # batch_info if verbose: print("Split sent with block size", block_size) print(f"Total batches: {len(self.batches)}") print(f"Total tokens: {len(self.tokens)}") if voken_dir is not None: print(f"Total vokens: {len(self.vokens)}") if voken_ablation is not None: print("The model will process voken ablation strategy:", voken_ablation) print() block_check(self.batches, block_size, fixed_size=True, manual_filtered=manual_filtered) if self.voken_ablation == 'token': self._voken_ids = list(range(30522)) @property def voken_size(self): return len(self._voken_ids) @property def voken_ids(self): return copy.copy(self._voken_ids) def assert_equal_vokens(self, dataset): assert self.voken_size == dataset.voken_size for vid, vid1 in zip(self.voken_ids, dataset.voken_ids): assert vid == vid1 def __len__(self): return len(self.batches) - 1 def __getitem__(self, item): token_start, token_end = self.batches[item] if self._iter_cnt < 5 and self.verbose: print(f"Data Loader: data iteration {self._iter_cnt}, with range {token_start} to {token_end}.") self._iter_cnt += 1 tokens = list(self.tokens[token_start: token_end]) token_tensor = torch.tensor( self.tokenizer.build_inputs_with_special_tokens(tokens), dtype=torch.long) if self.vokens is not None: vokens = list(self.vokens[token_start: token_end]) vokens = self.maybe_do_sent_level(vokens) vokens = self.maybe_do_ablation_study(vokens, tokens) voken_tensor = torch.tensor( [self.IGNORE_ID] + vokens + [self.IGNORE_ID], dtype=torch.long ) return token_tensor, voken_tensor else: return token_tensor def maybe_do_sent_level(self, vokens): if not self.sent_level: return vokens else: if self.sent_strategy == 'all': vokens = [ (-voken-1 if voken < 0 else voken) for voken in vokens ] elif self.sent_strategy == 'first': vokens = [ (self.IGNORE_ID if voken < 0 else voken) for voken in vokens ] return vokens def maybe_do_ablation_study(self, vokens, tokens): if self.voken_ablation is None: return vokens else: if self._iter_cnt < 5 and self.verbose: print("Before voken ablation: ", vokens) if self.voken_ablation == 'random': vokens = [random.randint(0, self.voken_size - 1) for _ in range(len(vokens))] elif self.voken_ablation == 'shuffle': random.shuffle(vokens) elif self.voken_ablation == 'reverse': vokens = vokens[::-1] elif self.voken_ablation == 'token': vokens = tokens if self._iter_cnt < 5 and self.verbose: print("After voken ablation: ", vokens) return vokens def get_item_info(self, item): token_start = self.batches[item] token_end = self.batches[item + 1] return token_start, token_end def __del__(self): self.token_hdf5.close() if self.vokens is not None: self.voken_hdf5.close() FORBIDDEN_RANGE = ( 119314944, # Start of iter 3700 187053048 # End of iter 5800 ) def intersect(x, y): x1, x2 = x y1, y2 = y if x2 <= y1 or x2 >= y2: # Case 1: [ x )[ y ) # Case 2: [ y )[ x ) return False return True def manual_filter(batches): batches = list(filter( lambda x: not intersect(x, FORBIDDEN_RANGE), batches )) return batches def block_check(batches, block_size, fixed_size=False, manual_filtered=False): """ Check whether the batches satisfy following requirements. 1. Monotonic 2. Mutually exclusive 3. Range < block_size """ last_end = 0 for start_token, end_token in batches: assert last_end <= start_token if fixed_size: assert (end_token - start_token) == block_size, 'len([%d, %d)) != %d' % (start_token, end_token, block_size) else: assert (end_token - start_token) <= block_size, 'len([%d, %d)) > %d' % (start_token, end_token, block_size) if manual_filtered: assert not intersect((start_token, end_token), FORBIDDEN_RANGE) last_end = end_token def get_voken_feats(dataset: CoLDataset, feat_dir: str): """ Load pre-extracted visual features regarding img_ids of vokens. """ set2id2feat = {} voken_feats = [] for voken_id in dataset.voken_ids: voken_img_set, voken_img_id = voken_id.split('/') if voken_img_set not in set2id2feat: img_ids = list(map( lambda x: x.rstrip(), open(os.path.join(feat_dir, f"{voken_img_set}.ids")) )) img_feats = h5py.File( os.path.join(feat_dir, f"{voken_img_set}.hdf5"), 'r' )['keys'][:] id2feat = {} assert len(img_ids) == len(img_feats) for img_id, img_feat in zip(img_ids, img_feats): id2feat[img_id] = img_feat set2id2feat[voken_img_set] = id2feat voken_feats.append(set2id2feat[voken_img_set][voken_img_id]) return voken_feats ================================================ FILE: vlm/model.py ================================================ import math import torch import torch.nn.functional as F from torch.nn import CrossEntropyLoss, MSELoss, SmoothL1Loss from torch import nn from transformers import ( BertConfig, BertForMaskedLM, ) from transformers.modeling_bert import BertOnlyMLMHead BertLayerNorm = torch.nn.LayerNorm # The GLUE function is copied from huggingface transformers: # https://github.com/huggingface/transformers/blob/c6acd246ec90857b70f449dcbcb1543f150821fc/src/transformers/activations.py def _gelu_python(x): """ Original Implementation of the gelu activation function in Google Bert repo when initially created. For information: OpenAI GPT's gelu is slightly different (and gives slightly different results): 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) Also see https://arxiv.org/abs/1606.08415 """ return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0))) if torch.__version__ < "1.4.0": gelu = _gelu_python else: gelu = F.gelu class CoLBertConfig(BertConfig): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.voken_size = None self.voken_dim = None self.do_voken_cls = False self.do_voken_reg = False self.do_voken_ctr = False self.shared_head = False self.verbose = False class BertSharedHead(BertOnlyMLMHead): """Bert Head for masked language modeling.""" def __init__(self, config): super().__init__(config) self.do_voken_cls = config.do_voken_cls self.do_voken_ctr = config.do_voken_ctr assert int(self.do_voken_cls) + int(self.do_voken_ctr) == 1 if self.do_voken_cls: self.visn_decoder = nn.Linear(config.hidden_size, config.voken_size, bias=True) if self.do_voken_ctr: self.visn_decoder = nn.Linear(config.voken_dim, config.hidden_size, bias=True) def forward(self, features, **kwargs): """ :param features: [batch, length, dim] :return: lang_scores [batch, length, vocab_size], visn_scores [batch, length, voken_size] """ x = self.predictions.transform(features) # batch_size, length, dim lang_scores = self.predictions.decoder(x) + self.predictions.bias if self.do_voken_cls: visn_scores = self.visn_decoder(x) elif self.do_voken_ctr: voken_feats = kwargs['voken_feats'] y = self.visn_decoder(voken_feats) # voken_size, dim visn_scores = torch.einsum('bik,jk->bij', x, y) else: assert False return lang_scores, visn_scores class BertVLMClassificationHead(nn.Module): """Bert Head for masked language modeling.""" def __init__(self, config): super().__init__() self.dense = nn.Linear(config.hidden_size, config.hidden_size) self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps) self.decoder = nn.Linear(config.hidden_size, config.voken_size, bias=True) # self.decoder = nn.Sequential( # nn.Linear(config.hidden_size, 256, bias=True), # nn.Linear(256, config.voken_size, bias=True), # ) if config.verbose: print(f"VLM Classification Head: Build model with voken_size {config.voken_size}") def forward(self, features, **kwargs): x = self.dense(features) x = gelu(x) x = self.layer_norm(x) x = self.decoder(x) return x class BertVLMContrastiveHeadNew(nn.Module): """Bert Head for masked language modeling.""" def __init__(self, config): super().__init__() self.joint_dim = 512 print(f"Contrastive Head: Using joint dim {self.joint_dim}") self.voken_size = config.voken_size self.dense = nn.Linear(config.hidden_size, self.joint_dim) self.layer_norm_x = BertLayerNorm(self.joint_dim, eps=config.layer_norm_eps) self.decoder_voken_feat = nn.Linear(config.voken_dim, self.joint_dim, bias=False) self.layer_norm_y = BertLayerNorm(self.joint_dim, eps=config.layer_norm_eps) def forward(self, bert_output, voken_feats, **kwargs): # Process the bert output x = self.dense(bert_output) x = gelu(x) x = self.layer_norm_x(x) # Process the pre-trained voken feats. y = self.decoder_voken_feat(voken_feats) # [v, f] --> [v, 64] y = self.layer_norm_y(y) score = torch.einsum('ijf,vf->ijv', x, y) / math.sqrt(self.joint_dim) assert score.dim() == 3 and score.shape[2] == self.voken_size return score class BertVLMContrastiveHead(nn.Module): """Bert Head for masked language modeling.""" def __init__(self, config): super().__init__() self.voken_size = config.voken_size self.dense = nn.Linear(config.hidden_size, config.hidden_size) self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps) self.joint_dim = 64 self.decoder_bert_output = nn.Linear(config.hidden_size, self.joint_dim, bias=False) self.decoder_voken_feat = nn.Linear(config.voken_dim, self.joint_dim, bias=False) def forward(self, bert_output, voken_feats, **kwargs): # Process the bert output x = self.dense(bert_output) x = gelu(x) x = self.layer_norm(x) x = self.decoder_bert_output(x) # [b, l, f] --> [b, l, 64] # Process the pre-trained voken feats. y = self.decoder_voken_feat(voken_feats) # [v, f] --> [v, 64] score = torch.einsum('ijf,vf->ijv', x, y) / math.sqrt(self.joint_dim) assert score.dim() == 3 and score.shape[2] == self.voken_size return score class BertVLMRegressionHead(nn.Module): """Bert Head for masked language modeling.""" def __init__(self, config): super().__init__() self.dense = nn.Linear(config.hidden_size, config.hidden_size) self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps) self.decoder = nn.Linear(config.hidden_size, config.voken_dim, bias=True) def forward(self, features, **kwargs): x = self.dense(features) x = gelu(x) x = self.layer_norm(x) # project back to size of vocabulary with bias x = self.decoder(x) return x class CoLwithBert(BertForMaskedLM): config_class = CoLBertConfig def __init__(self, config): super().__init__(config) self.do_voken_cls = config.do_voken_cls self.do_voken_reg = config.do_voken_reg self.do_voken_ctr = config.do_voken_ctr self.shared_head = config.shared_head self.verbose = config.verbose if self.verbose: print(f"Model: do voken cls -- {self.do_voken_cls}, do_voken_reg -- {self.do_voken_reg}," f" do voken ctr -- {self.do_voken_ctr}") self.token_cls_loss_fct = CrossEntropyLoss() if self.shared_head: if self.verbose: print("Model: Using shared head for Voken and Token predictions.") self.cls = BertSharedHead(config) # Reinit the weight of the new head. self.init_weights() else: # Voken Classification if config.do_voken_cls: self.visual_cls_head = BertVLMClassificationHead(config) # Voken Regression if config.do_voken_reg: assert config.voken_dim is not None, "you need to set voken dim in the config." self.visual_reg_head = BertVLMRegressionHead(config) # Voken Constrastive if config.do_voken_ctr: assert config.voken_dim is not None, "you need to set voken dim in the config." self.visual_ctr_head = BertVLMContrastiveHeadNew(config) # Build voken features embeddings if needed. if self.do_voken_ctr or self.do_voken_reg: # The voken emb will be preloaded by func "init_voken_feat_emb" self.voken_feat_emb = nn.Embedding( config.voken_size, config.voken_dim ) # Freeze this embedding for p in self.voken_feat_emb.parameters(): p.requires_grad = False # Build Loss functions if config.do_voken_cls: # Voken Classification self.voken_cls_loss_fct = CrossEntropyLoss() if config.do_voken_reg: # Voken Regression self.voken_reg_loss_fct = SmoothL1Loss(reduction='none') # self.voken_reg_loss_fct = torch.nn.L1Loss(reduction='none') if config.do_voken_ctr: # Voken Constrastive self.voken_ctr_loss_fct = CrossEntropyLoss() def init_voken_feat_emb(self, feats): if self.verbose: print(f"Model: load the voken features with shape {feats.shape}") print("\tBefore Loading, std and mean are: ", self.voken_feat_emb.weight.std(), self.voken_feat_emb.weight.mean()) assert feats.shape == (self.config.voken_size, self.config.voken_dim) self.voken_feat_emb.weight.data[:] = torch.Tensor(feats) self.original_voken_feats = torch.Tensor(feats).clone() self.original_voken_feats = self.original_voken_feats.half() if self.verbose: print("\tAfter Loading, std and mean are: ", self.voken_feat_emb.weight.std(), self.voken_feat_emb.weight.mean()) print("\tThe 1st, 2nd, and last voken feats are: ") print("\t", self.voken_feat_emb.weight[0]) print("\t", self.voken_feat_emb.weight[1]) print("\t", self.voken_feat_emb.weight[-1]) assert not self.voken_feat_emb.weight.requires_grad # print(self.voken_feat_emb.weight.dtype) # assert torch.all(torch.eq(self.voken_feat_emb.weight.cuda(), # self.original_voken_feats)), "The voken feats have been updated during training." def to(self, *args): if self.do_voken_ctr or self.do_voken_reg: self.original_voken_feats = self.original_voken_feats.to(*args) return super().to(*args) def forward( self, input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, masked_lm_labels=None, encoder_hidden_states=None, encoder_attention_mask=None, lm_labels=None, voken_labels=None, ): outputs = self.bert( input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, position_ids=position_ids, head_mask=head_mask, inputs_embeds=inputs_embeds, encoder_hidden_states=encoder_hidden_states, encoder_attention_mask=encoder_attention_mask, ) sequence_output = outputs[0] if not self.shared_head: voken_loss = 0. if self.do_voken_cls: assert voken_labels is not None voken_scores = self.visual_cls_head(sequence_output) voken_cls_loss = self.voken_cls_loss_fct(voken_scores.view(-1, self.config.voken_size), voken_labels.view(-1)) voken_loss += voken_cls_loss if self.do_voken_reg: assert voken_labels is not None voken_prediction = self.visual_reg_head(sequence_output) # Get the mask and pre-trained features voken_label_mask = (voken_labels != -100) # Get a mask of [0, 1, 1, ...., 1, 0], [b, len] safe_voken_labels = voken_labels.clone() safe_voken_labels[~voken_label_mask] = 0 voken_feats = self.voken_feat_emb(safe_voken_labels) # [b, len] --> [b, len, f] # Loss voken_reg_loss = self.voken_reg_loss_fct(voken_prediction, voken_feats) # [b, len, f] # [b, l, f] * ([b,l] --> [b, l, 1]) = [b, l, f] voken_reg_loss = (voken_reg_loss * voken_label_mask.float().unsqueeze(-1)) # [b, l, f] --sum-> [b, l] --mean-> [1,] voken_reg_loss = voken_reg_loss.sum(-1).mean() voken_loss += voken_reg_loss if self.do_voken_ctr: assert torch.all(torch.eq(self.voken_feat_emb.weight, self.original_voken_feats)), "The voken feats have been updated during training." voken_scores = self.visual_ctr_head( sequence_output, self.voken_feat_emb.weight ) voken_ctr_loss = self.voken_ctr_loss_fct( voken_scores.view(-1, self.config.voken_size), voken_labels.view(-1) ) voken_loss += voken_ctr_loss if masked_lm_labels is not None: prediction_scores = self.cls(sequence_output) token_loss = self.token_cls_loss_fct( prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1)) else: token_loss = torch.tensor(0.) else: voken_loss, token_loss = self.calculate_shared_loss( sequence_output, masked_lm_labels, voken_labels, ) return voken_loss, token_loss def calculate_shared_loss(self, sequence_output, masked_lm_labels, voken_labels): if self.do_voken_cls: lang_scores, visn_scores = self.cls(sequence_output) else: lang_scores, visn_scores = self.cls( sequence_output, voken_feats=self.voken_feat_emb.weight ) assert voken_labels is not None voken_loss_func = self.voken_cls_loss_fct if self.do_voken_cls else self.voken_ctr_loss_fct voken_loss = voken_loss_func( visn_scores.view(-1, self.config.voken_size), voken_labels.view(-1) ) if masked_lm_labels is not None: token_loss = self.token_cls_loss_fct( lang_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1) ) else: token_loss = torch.tensor(0.) return voken_loss, token_loss class SimpleBertForMaskedLM(BertForMaskedLM): def __init__(self, config): super().__init__(config) def forward( self, input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, masked_lm_labels=None, encoder_hidden_states=None, encoder_attention_mask=None, lm_labels=None, ): outputs = self.bert( input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, position_ids=position_ids, head_mask=head_mask, inputs_embeds=inputs_embeds, encoder_hidden_states=encoder_hidden_states, encoder_attention_mask=encoder_attention_mask, ) sequence_output = outputs[0] prediction_scores = self.cls(sequence_output) loss_fct = CrossEntropyLoss() token_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1)) return token_loss, ================================================ FILE: vlm/param.py ================================================ import argparse def process_args(): parser = argparse.ArgumentParser() # Datasets parser.add_argument( "--train_data_file", default=None, type=str, help="The input training data file (a text file).") parser.add_argument( "--eval_data_file", default=None, type=str, help="An optional input evaluation data file to evaluate the perplexity on (a text file).") parser.add_argument("--do_train", action="store_true", help="Whether to run training.") parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.") # Data loader parser.add_argument("--col_data", action="store_true", help="Using the specific dataset object in data.py") parser.add_argument("--split_sent", action="store_true", help="Overwrite the cached training and evaluation sets") parser.add_argument("--shuffle", action="store_true", help="Shuffle the training dataset") parser.add_argument( "--block_size", default=-1, type=int, help="Optional input sequence length after tokenization." "The training dataset will be truncated in block of this size for training." "Default to the model max input length for single sentence inputs (take into account special tokens).", ) # Logging and Saving parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.") parser.add_argument( "--output_dir", type=str, help="The output directory where the model predictions and checkpoints will be written.",) parser.add_argument( "--overwrite_output_dir", action="store_true", help="Overwrite the content of the output directory") # Model types parser.add_argument( "--model_type", type=str, help="The model architecture to be trained or fine-tuned.",) parser.add_argument( "--should_continue", action="store_true", help="Whether to continue from latest checkpoint in output_dir") parser.add_argument( "--model_name_or_path", default=None, type=str, help="The model checkpoint for weights initialization. Leave None if you want to train a model from scratch.",) parser.add_argument( "--config_name", default=None, type=str, help="Optional pretrained config name or path if not the same as model_name_or_path. If both are None, initialize a new config.",) parser.add_argument( "--tokenizer_name", default=None, type=str, help="Optional pretrained tokenizer name or path if not the same as model_name_or_path. If both are None, initialize a new tokenizer.",) parser.add_argument( "--cache_dir", default=None, type=str, help="Optional directory to store the pre-trained models downloaded from s3 (instead of the default one)",) parser.add_argument( "--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets") # MLM tasks parser.add_argument( "--mlm", action="store_true", help="Train with masked-language modeling loss instead of language modeling.") parser.add_argument( "--mlm_probability", type=float, default=0.15, help="Ratio of tokens to mask for masked language modeling loss") parser.add_argument( "--mlm_ratio", type=float, default=1., help="The ratio of mlm loss in the total loss.") # VLM related params parser.add_argument("--voken_dir", type=str, default='snap1/coco_hinge05_dim64_resxt101_robertal4/vokens', help='Where the vokens are saved') parser.add_argument("--voken_suffix", type=str, default='vg_nococo.10000', help='The suffix after the voken file, e.g., en.train.raw.{suffix} where suffix==vgcoco.1000') parser.add_argument("--voken_labels", type=str, default='all', help='all: Calculate voken loss for all tokens;' 'mask: Calculate voken loss for masked tokens.' 'nonmask: Calculate voken loss for non-masked tokens.') parser.add_argument("--voken_feat_dir", type=str, default=None, help='Where the vokens are saved') parser.add_argument("--do_voken_cls", action='store_true', help='Will do voken classification task') parser.add_argument("--do_voken_reg", action='store_true', help='Will do voken regression task (not used in this paper)') parser.add_argument("--do_voken_ctr", action='store_true', help='Will do voken contrastive task (not used in this paper)') parser.add_argument("--shared_head", action='store_true', help='Share the head if more than one tasks (e.g., cls, reg, ctr) are used (not used in this paper)') # Batch Size and Training Steps parser.add_argument("--seed", type=int, default=95, help="random seed for initialization") parser.add_argument("--per_gpu_train_batch_size", default=4, type=int, help="Batch size per GPU/CPU for training.") parser.add_argument("--per_gpu_eval_batch_size", default=4, type=int, help="Batch size per GPU/CPU for evaluation.") parser.add_argument("--gradient_accumulation_steps", type=int, default=1, help="Number of updates steps to accumulate before performing a backward/update pass.",) parser.add_argument("--num_train_epochs", default=1.0, type=float, help="Total number of training epochs to perform.") parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",) # Optimizer parser.add_argument("--lamb", action="store_true", help='Use the LAMB optimizer in apex') parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.") parser.add_argument("--warmup_ratio", default=0., type=float, help="Linear warmup over warmup_steps.") parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") # Distributed Training parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank") parser.add_argument("--nodes", type=int, default=1) parser.add_argument("--nr", type=int, default=0) # Half Precision parser.add_argument( "--fp16", action="store_true", help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",) parser.add_argument( "--fp16_opt_level", type=str, default="O1", help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." "See details at https://nvidia.github.io/apex/amp.html",) # Ablation Study parser.add_argument("--voken_ablation", default=None, help="random, shuffle, reverse, token") args = parser.parse_args() return args ================================================ FILE: vlm/run_glue.py ================================================ # coding=utf-8 # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """ Finetuning the library models for sequence classification on GLUE (Bert, XLM, XLNet, RoBERTa, Albert, XLM-RoBERTa).""" import argparse import glob import json import logging import os import random import numpy as np import torch from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset from torch.utils.data.distributed import DistributedSampler from tqdm import tqdm, trange from transformers import ( MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING, WEIGHTS_NAME, AdamW, AutoConfig, AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, glue_compute_metrics as compute_metrics, glue_convert_examples_to_features as convert_examples_to_features, glue_output_modes as output_modes, glue_processors as processors, ) # from transformers import glue_compute_metrics as compute_metrics # from transformers import glue_convert_examples_to_features as convert_examples_to_features # from transformers import glue_output_modes as output_modes # from transformers import glue_processors as processors try: from torch.utils.tensorboard import SummaryWriter except ImportError: from tensorboardX import SummaryWriter logger = logging.getLogger(__name__) #MODEL_CONFIG_CLASSES = list(MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys()) #MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES) #ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in MODEL_CONFIG_CLASSES), (),) def set_seed(args): random.seed(args.seed) np.random.seed(args.seed) torch.manual_seed(args.seed) if args.n_gpu > 0: torch.cuda.manual_seed_all(args.seed) def train(args, train_dataset, model, tokenizer): """ Train the model """ # if args.local_rank in [-1, 0]: # tb_writer = SummaryWriter() args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu) train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset) train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size) if args.max_steps > 0: t_total = args.max_steps args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1 else: t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs # Prepare optimizer and schedule (linear warmup and decay) no_decay = ["bias", "LayerNorm.weight"] optimizer_grouped_parameters = [ { "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], "weight_decay": args.weight_decay, }, {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0}, ] optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon) num_warmup_steps = int(t_total * args.warmup_steps) scheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=t_total ) # Check if saved optimizer or scheduler states exist #if os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt")) and os.path.isfile( #os.path.join(args.model_name_or_path, "scheduler.pt") #): ## Load in optimizer and scheduler states #optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt"))) #scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt"))) if args.fp16: try: from apex import amp except ImportError: raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.") model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level) # multi-gpu training (should be after apex fp16 initialization) if args.n_gpu > 1: model = torch.nn.DataParallel(model) # Distributed training (should be after apex fp16 initialization) if args.local_rank != -1: model = torch.nn.parallel.DistributedDataParallel( model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True, ) # Train! logger.info("***** Running training *****") logger.info(" Num examples = %d", len(train_dataset)) logger.info(" Num Epochs = %d", args.num_train_epochs) logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size) logger.info( " Total train batch size (w. parallel, distributed & accumulation) = %d", args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1), ) logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps) logger.info(" Total optimization steps = %d", t_total) global_step = 0 epochs_trained = 0 steps_trained_in_current_epoch = 0 # Check if continuing training from a checkpoint #if os.path.exists(args.model_name_or_path): # set global_step to global_step of last saved checkpoint from model path #try: #global_step = int(args.model_name_or_path.split("-")[-1].split("/")[0]) #except ValueError: #global_step = 0 #epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps) #steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps) #logger.info(" Continuing training from checkpoint, will skip to saved global_step") #logger.info(" Continuing training from epoch %d", epochs_trained) #logger.info(" Continuing training from global step %d", global_step) #logger.info(" Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch) tr_loss, logging_loss = 0.0, 0.0 model.zero_grad() train_iterator = trange( epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0], ) set_seed(args) # Added here for reproductibility for _ in train_iterator: epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0]) for step, batch in enumerate(epoch_iterator): # Skip past any already trained steps if resuming training if steps_trained_in_current_epoch > 0: steps_trained_in_current_epoch -= 1 continue model.train() batch = tuple(t.to(args.device) for t in batch) inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]} if args.model_type != "distilbert": inputs["token_type_ids"] = ( batch[2] if args.model_type in ["bert", "xlnet", "albert"] else None ) # XLM, DistilBERT, RoBERTa, and XLM-RoBERTa don't use segment_ids outputs = model(**inputs) loss = outputs[0] # model outputs are always tuple in transformers (see doc) if args.n_gpu > 1: loss = loss.mean() # mean() to average on multi-gpu parallel training if args.gradient_accumulation_steps > 1: loss = loss / args.gradient_accumulation_steps if args.fp16: with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward() else: loss.backward() tr_loss += loss.item() if (step + 1) % args.gradient_accumulation_steps == 0 or ( # last step in epoch but step is always smaller than gradient_accumulation_steps len(epoch_iterator) <= args.gradient_accumulation_steps and (step + 1) == len(epoch_iterator) ): if args.fp16: torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm) else: torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm) optimizer.step() scheduler.step() # Update learning rate schedule model.zero_grad() global_step += 1 if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0: logs = {} if ( args.local_rank == -1 and args.evaluate_during_training ): # Only evaluate when single GPU otherwise metrics may not average well results = evaluate(args, model, tokenizer) for key, value in results.items(): eval_key = "eval_{}".format(key) logs[eval_key] = value loss_scalar = (tr_loss - logging_loss) / args.logging_steps learning_rate_scalar = scheduler.get_lr()[0] logs["learning_rate"] = learning_rate_scalar logs["loss"] = loss_scalar logging_loss = tr_loss #for key, value in logs.items(): #tb_writer.add_scalar(key, value, global_step) print(json.dumps({**logs, **{"step": global_step}})) if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0: # Save model checkpoint output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step)) if not os.path.exists(output_dir): os.makedirs(output_dir) model_to_save = ( model.module if hasattr(model, "module") else model ) # Take care of distributed/parallel training model_to_save.save_pretrained(output_dir) tokenizer.save_pretrained(output_dir) torch.save(args, os.path.join(output_dir, "training_args.bin")) logger.info("Saving model checkpoint to %s", output_dir) torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt")) torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt")) logger.info("Saving optimizer and scheduler states to %s", output_dir) if args.max_steps > 0 and global_step > args.max_steps: epoch_iterator.close() break if args.max_steps > 0 and global_step > args.max_steps: train_iterator.close() break #if args.local_rank in [-1, 0]: #tb_writer.close() return global_step, tr_loss / global_step def evaluate(args, model, tokenizer, prefix=""): # Loop to handle MNLI double evaluation (matched, mis-matched) eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,) eval_outputs_dirs = (args.output_dir, args.output_dir + "-MM") if args.task_name == "mnli" else (args.output_dir,) results = {} for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs): eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True) if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]: os.makedirs(eval_output_dir) args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu) # Note that DistributedSampler samples randomly eval_sampler = SequentialSampler(eval_dataset) eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size) # multi-gpu eval if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel): model = torch.nn.DataParallel(model) # Eval! logger.info("***** Running evaluation {} *****".format(prefix)) logger.info(" Num examples = %d", len(eval_dataset)) logger.info(" Batch size = %d", args.eval_batch_size) eval_loss = 0.0 nb_eval_steps = 0 preds = None out_label_ids = None for batch in tqdm(eval_dataloader, desc="Evaluating"): model.eval() batch = tuple(t.to(args.device) for t in batch) with torch.no_grad(): inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]} if args.model_type != "distilbert": inputs["token_type_ids"] = ( batch[2] if args.model_type in ["bert", "xlnet", "albert"] else None ) # XLM, DistilBERT, RoBERTa, and XLM-RoBERTa don't use segment_ids outputs = model(**inputs) tmp_eval_loss, logits = outputs[:2] eval_loss += tmp_eval_loss.mean().item() nb_eval_steps += 1 if preds is None: preds = logits.detach().cpu().numpy() out_label_ids = inputs["labels"].detach().cpu().numpy() else: preds = np.append(preds, logits.detach().cpu().numpy(), axis=0) out_label_ids = np.append(out_label_ids, inputs["labels"].detach().cpu().numpy(), axis=0) eval_loss = eval_loss / nb_eval_steps if args.output_mode == "classification": preds = np.argmax(preds, axis=1) elif args.output_mode == "regression": preds = np.squeeze(preds) result = compute_metrics(eval_task, preds, out_label_ids) results.update(result) print(eval_output_dir, prefix) output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt") with open(output_eval_file, "w") as writer: logger.info("***** Eval results {} *****".format(prefix)) for key in sorted(result.keys()): logger.info(" %s = %s", key, str(result[key])) writer.write("%s = %s\n" % (key, str(result[key]))) return results def load_and_cache_examples(args, task, tokenizer, evaluate=False): if args.local_rank not in [-1, 0] and not evaluate: torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache processor = processors[task]() output_mode = output_modes[task] # Load data features from cache or dataset file cached_features_file = os.path.join( args.data_dir, "cached_{}_{}_{}_{}".format( "dev" if evaluate else "train", #list(filter(None, args.model_name_or_path.split("/"))).pop(), args.tokenizer_name, str(args.max_seq_length), str(task), ), ) if os.path.exists(cached_features_file) and not args.overwrite_cache: logger.info("Loading features from cached file %s", cached_features_file) features = torch.load(cached_features_file) else: logger.info("Creating features from dataset file at %s", args.data_dir) label_list = processor.get_labels() if task in ["mnli", "mnli-mm"] and args.model_type in ["roberta", "xlmroberta"]: # HACK(label indices are swapped in RoBERTa pretrained model) label_list[1], label_list[2] = label_list[2], label_list[1] examples = ( processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir) ) features = convert_examples_to_features( examples, tokenizer, label_list=label_list, max_length=args.max_seq_length, output_mode=output_mode, # pad_on_left=bool(args.model_type in ["xlnet"]), # pad on the left for xlnet # pad_token=tokenizer.pad_token_id, # pad_token_segment_id=tokenizer.pad_token_type_id, ) if args.local_rank in [-1, 0]: logger.info("Saving features into cached file %s", cached_features_file) torch.save(features, cached_features_file) for i in range(3): print('ids:', features[i].input_ids) print('tokens:', tokenizer.convert_ids_to_tokens(features[i].input_ids)) print('att:', features[i].attention_mask) if args.local_rank == 0 and not evaluate: torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache # Convert to Tensors and build dataset all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long) all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long) all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long) if output_mode == "classification": all_labels = torch.tensor([f.label for f in features], dtype=torch.long) elif output_mode == "regression": all_labels = torch.tensor([f.label for f in features], dtype=torch.float) dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels) return dataset def main(): parser = argparse.ArgumentParser() # Required parameters parser.add_argument( "--data_dir", default=None, type=str, required=True, help="The input data dir. Should contain the .tsv files (or other data files) for the task.", ) parser.add_argument( "--model_type", default=None, type=str, required=True, #help="Model type selected in the list: " + ", ".join(MODEL_TYPES), ) parser.add_argument( "--model_name_or_path", default=None, type=str, required=True, #help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS), ) parser.add_argument( "--task_name", default=None, type=str, required=True, help="The name of the task to train selected in the list: " + ", ".join(processors.keys()), ) parser.add_argument( "--output_dir", default=None, type=str, required=True, help="The output directory where the model predictions and checkpoints will be written.", ) # Other parameters parser.add_argument( "--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name", ) parser.add_argument( "--tokenizer_name", default="", type=str, help="Pretrained tokenizer name or path if not the same as model_name", ) parser.add_argument( "--cache_dir", default="", type=str, help="Where do you want to store the pre-trained models downloaded from s3", ) parser.add_argument( "--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer " "than this will be truncated, sequences shorter will be padded.", ) parser.add_argument("--do_train", action="store_true", help="Whether to run training.") parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.") parser.add_argument( "--evaluate_during_training", action="store_true", help="Run evaluation during training at each logging step.", ) parser.add_argument( "--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model.", ) parser.add_argument( "--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.", ) parser.add_argument( "--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for evaluation.", ) parser.add_argument( "--gradient_accumulation_steps", type=int, default=1, help="Number of updates steps to accumulate before performing a backward/update pass.", ) parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") parser.add_argument( "--num_train_epochs", default=3.0, type=float, help="Total number of training epochs to perform.", ) parser.add_argument( "--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.", ) parser.add_argument("--warmup_steps", default=0, type=float, help="Linear warmup over warmup_steps.") parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.") parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.") parser.add_argument( "--eval_all_checkpoints", action="store_true", help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number", ) parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available") parser.add_argument("--from_scratch", action="store_true", help="Avoid using CUDA when available") parser.add_argument( "--overwrite_output_dir", action="store_true", help="Overwrite the content of the output directory", ) parser.add_argument( "--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets", ) parser.add_argument( "--nopooler", action="store_true", help="Do not load the pooler", ) parser.add_argument("--seed", type=int, default=9595, help="random seed for initialization") parser.add_argument( "--fp16", action="store_true", help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit", ) parser.add_argument( "--fp16_opt_level", type=str, default="O1", help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." "See details at https://nvidia.github.io/apex/amp.html", ) parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank") parser.add_argument("--server_ip", type=str, default="", help="For distant debugging.") parser.add_argument("--server_port", type=str, default="", help="For distant debugging.") args = parser.parse_args() if ( os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir ): raise ValueError( "Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format( args.output_dir ) ) # Setup distant debugging if needed if args.server_ip and args.server_port: # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script import ptvsd print("Waiting for debugger attach") ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True) ptvsd.wait_for_attach() # Setup CUDA, GPU & distributed training if args.local_rank == -1 or args.no_cuda: device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu") args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count() else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs torch.cuda.set_device(args.local_rank) device = torch.device("cuda", args.local_rank) torch.distributed.init_process_group(backend="nccl") args.n_gpu = 1 args.device = device # Setup logging logging.basicConfig( format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S", level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN, ) logger.warning( "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s", args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16, ) # Set seed set_seed(args) # Prepare GLUE task args.task_name = args.task_name.lower() if args.task_name not in processors: raise ValueError("Task not found: %s" % (args.task_name)) processor = processors[args.task_name]() args.output_mode = output_modes[args.task_name] label_list = processor.get_labels() num_labels = len(label_list) # Load pretrained model and tokenizer if args.local_rank not in [-1, 0]: torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab args.model_type = args.model_type.lower() config = AutoConfig.from_pretrained( args.config_name if args.config_name else args.model_name_or_path, num_labels=num_labels, finetuning_task=args.task_name, cache_dir=args.cache_dir if args.cache_dir else None, ) tokenizer = AutoTokenizer.from_pretrained( args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case, cache_dir=args.cache_dir if args.cache_dir else None, ) model = AutoModelForSequenceClassification.from_pretrained( args.model_name_or_path, from_tf=bool(".ckpt" in args.model_name_or_path), config=config, cache_dir=args.cache_dir if args.cache_dir else None, ) if args.nopooler: model.bert.pooler.apply(model._init_weights) if args.local_rank == 0: torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab model.to(args.device) logger.info("Training/evaluation parameters %s", args) # Training if args.do_train: train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False) global_step, tr_loss = train(args, train_dataset, model, tokenizer) logger.info(" global_step = %s, average loss = %s", global_step, tr_loss) # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained() if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0): # Create output directory if needed if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]: os.makedirs(args.output_dir) logger.info("Saving model checkpoint to %s", args.output_dir) # Save a trained model, configuration and tokenizer using `save_pretrained()`. # They can then be reloaded using `from_pretrained()` model_to_save = ( model.module if hasattr(model, "module") else model ) # Take care of distributed/parallel training model_to_save.save_pretrained(args.output_dir) tokenizer.save_pretrained(args.output_dir) # Good practice: save your training arguments together with the trained model torch.save(args, os.path.join(args.output_dir, "training_args.bin")) # Load a trained model and vocabulary that you have fine-tuned model = AutoModelForSequenceClassification.from_pretrained(args.output_dir) tokenizer = AutoTokenizer.from_pretrained(args.output_dir) model.to(args.device) # Evaluation results = {} if args.do_eval and args.local_rank in [-1, 0]: tokenizer = AutoTokenizer.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) checkpoints = [args.output_dir] if args.eval_all_checkpoints: checkpoints = list( os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True)) ) logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging logger.info("Evaluate the following checkpoints: %s", checkpoints) for checkpoint in checkpoints: global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else "" prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else "" prefix = prefix if 'checkpoint' in prefix else '' model = AutoModelForSequenceClassification.from_pretrained(checkpoint) model.to(args.device) result = evaluate(args, model, tokenizer, prefix=prefix) result = dict((k + "_{}".format(global_step), v) for k, v in result.items()) results.update(result) return results if __name__ == "__main__": main() ================================================ FILE: vlm/run_glue_epochs.py ================================================ import argparse import math import os from pathlib import Path from pprint import pprint import subprocess import threading import time import torch parser = argparse.ArgumentParser() parser.add_argument( "--load", default=None, type=str, help="The model loaded, e.g., snap/vlm/wiki103_small" ) parser.add_argument( "--gpus", default=None, type=str, help="The list of GPU ids, separated by comma, e.g., '2,3'" ) parser.add_argument( "--snaps", default=1, type=int, help="The number of snaps evaluated with GLUE benchmark. " "-1 means all." ) parser.add_argument( "--start-from", default=0, type=int ) args = parser.parse_args() if args.gpus is None: # Get all gpus available in this server. num_gpus = torch.cuda.device_count() # The device id are labeled from 0 to num_gpus-1. available_gpus = list(range(num_gpus)) else: available_gpus = [int(gpu_id) for gpu_id in args.gpus.split(",")] num_gpus = len(available_gpus) resource = threading.Semaphore(num_gpus) def get_snap_paths(load): load_path = Path(load) paths = [] for dir_path in load_path.iterdir(): if dir_path.name.startswith("checkpoint-"): paths.append(dir_path) return paths def sorted_paths(paths): pathXkey = [] for path in paths: name = path.name identifier = name[len("checkpoint-"):] if identifier == 'last': continue if 'epoch' in identifier: key = identifier else: key = int(identifier) pathXkey.append((path, key)) pathXkey = sorted(pathXkey, key=lambda x: x[1]) paths = list(map(lambda x: x[0], pathXkey)) return paths def get_test_paths(paths, snaps): """ Return $snaps paths to be tested on GLUE """ if snaps == -1: return paths interval = len(paths) * 1. / snaps test_paths = [] for i in range(1, snaps+1): idx = int(math.ceil(interval * i)) - 1 test_paths.append(paths[idx]) return test_paths # Get all paths needs to be processed paths = get_snap_paths(args.load) paths = sorted_paths(paths) paths = paths[args.start_from:] paths = get_test_paths(paths, args.snaps) paths = paths[::-1] # Run the last epochs first. path_lock = threading.Lock() def run_glue(): while True: # Only have one atomic operation (list.pop) here, do not need lock. # A Semaphore is enough to control the resources. resource.acquire() gpu_id = available_gpus.pop(0) # Involve multiple atomic operations (list.__len__, list.pop), # thus introduce a lock here. path_lock.acquire() if len(paths) > 0: path = paths.pop(0) else: path_lock.release() break path_lock.release() model = path.parent ckpt = path.name print(gpu_id, model, ckpt) process = subprocess.Popen( ['bash', 'scripts/run_glue_at_epoch.bash', str(gpu_id), # Use GPU '3', # Number of epochs model, ckpt ], stdout=subprocess.PIPE, stderr=subprocess.PIPE) stdout, stderr = process.communicate() available_gpus.append(gpu_id) resource.release() # Sleep here allows the script (run_glue_at_epoch.bash) to finish # thus all memory in GPU will be cleared. time.sleep(5) return # Allocate #threads which equals to #GPUs threads = [] for _ in range(num_gpus): threads.append( threading.Thread(target=run_glue) ) for thread in threads: thread.start() # Join to the main thread, thus the main thread will wait for all the threads. for thread in threads: thread.join() ================================================ FILE: vlm/run_lm_distributed.py ================================================ # coding=utf-8 # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """ Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned using a masked language modeling (MLM) loss. """ import argparse import glob import json import logging import os import pickle import random import re import shutil import sys from typing import Dict, List, Tuple from datetime import datetime import numpy as np import torch from torch.nn.utils.rnn import pad_sequence import torch.multiprocessing as mp from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler from torch.utils.data.distributed import DistributedSampler from tqdm import tqdm, trange from transformers import ( WEIGHTS_NAME, AdamW, BertConfig, BertForMaskedLM, BertTokenizer, CamembertConfig, CamembertForMaskedLM, CamembertTokenizer, DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer, GPT2Config, GPT2LMHeadModel, GPT2Tokenizer, OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer, PreTrainedModel, PreTrainedTokenizer, RobertaConfig, RobertaForMaskedLM, RobertaTokenizer, get_linear_schedule_with_warmup, ) sys.path.append( os.path.dirname(os.path.dirname(os.path.abspath(__file__))) ) from vlm.data import CoLDataset from vlm.param import process_args from vlm.model import SimpleBertForMaskedLM try: from torch.utils.tensorboard import SummaryWriter except ImportError: from tensorboardX import SummaryWriter logger = logging.getLogger(__name__) MODEL_CLASSES = { "gpt2": (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer), "openai-gpt": (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer), "bert": (BertConfig, SimpleBertForMaskedLM, BertTokenizer), "roberta": (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer), "distilbert": (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer), "camembert": (CamembertConfig, CamembertForMaskedLM, CamembertTokenizer), } class TextDataset(Dataset): def __init__(self, tokenizer: PreTrainedTokenizer, args, file_path: str, block_size=512): assert os.path.isfile(file_path) block_size = block_size - (tokenizer.max_len - tokenizer.max_len_single_sentence) directory, filename = os.path.split(file_path) cached_features_file = os.path.join( directory, args.model_type + "_cached_lm_" + str(block_size) + "_" + filename ) if os.path.exists(cached_features_file) and not args.overwrite_cache: logger.info("Loading features from cached file %s", cached_features_file) with open(cached_features_file, "rb") as handle: self.examples = pickle.load(handle) else: logger.info("Creating features from dataset file at %s", directory) self.examples = [] with open(file_path, encoding="utf-8") as f: text = f.read() tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text)) for i in range(0, len(tokenized_text) - block_size + 1, block_size): # Truncate in block of block_size self.examples.append(tokenizer.build_inputs_with_special_tokens(tokenized_text[i : i + block_size])) # Note that we are loosing the last truncated example here for the sake of simplicity (no padding) # If your dataset is small, first you should loook for a bigger one :-) and second you # can change this behavior by adding (model specific) padding. logger.info("Saving features into cached file %s", cached_features_file) with open(cached_features_file, "wb") as handle: pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL) def __len__(self): return len(self.examples) def __getitem__(self, item): return torch.tensor(self.examples[item], dtype=torch.long) class LineByLineTextDataset(Dataset): def __init__(self, tokenizer: PreTrainedTokenizer, args, file_path: str, block_size=512): assert os.path.isfile(file_path) # Here, we do not cache the features, operating under the assumption # that we will soon use fast multithreaded tokenizers from the # `tokenizers` repo everywhere =) logger.info("Creating features from dataset file at %s", file_path) with open(file_path, encoding="utf-8") as f: lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())] self.examples = tokenizer.batch_encode_plus(lines, add_special_tokens=True, max_length=block_size)["input_ids"] def __len__(self): return len(self.examples) def __getitem__(self, i): return torch.tensor(self.examples[i], dtype=torch.long) def load_and_cache_examples(args, tokenizer, evaluate=False): file_path = args.eval_data_file if evaluate else args.train_data_file if args.col_data: return CoLDataset(file_path, args.tokenizer_name, tokenizer, args.block_size, split_sent=args.split_sent, verbose=(args.gpu == 0)) elif args.line_by_line: return LineByLineTextDataset(tokenizer, args, file_path=file_path, block_size=args.block_size) else: return TextDataset(tokenizer, args, file_path=file_path, block_size=args.block_size) def set_seed(args): random.seed(args.seed) np.random.seed(args.seed) torch.manual_seed(args.seed) def mask_tokens(inputs: torch.Tensor, tokenizer: PreTrainedTokenizer, args) -> Tuple[torch.Tensor, torch.Tensor]: """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """ if tokenizer.mask_token is None: raise ValueError( "This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer." ) labels = inputs.clone() # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa) probability_matrix = torch.full(labels.shape, args.mlm_probability) special_tokens_mask = [ tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist() ] probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0) if tokenizer._pad_token is not None: padding_mask = labels.eq(tokenizer.pad_token_id) probability_matrix.masked_fill_(padding_mask, value=0.0) masked_indices = torch.bernoulli(probability_matrix).bool() labels[~masked_indices] = -100 # We only compute loss on masked tokens # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK]) indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token) # 10% of the time, we replace masked input tokens with random word indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long) inputs[indices_random] = random_words[indices_random] # The rest of the time (10% of the time) we keep the masked input tokens unchanged return inputs, labels def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]: set_seed(args) # Added here for reproducibility """ Train the model """ if args.gpu == 0: current_time = datetime.now().strftime('%b%d_%H-%M-%S') tb_writer = SummaryWriter(args.output_dir + '/runs/' + current_time) args.train_batch_size = args.per_gpu_train_batch_size def collate(examples: List[torch.Tensor]): if tokenizer._pad_token is None: return pad_sequence(examples, batch_first=True) return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id) if args.shuffle: logger.info(f"Shuffle the dataset in training," f"GPU: {args.gpu}," f"Rank: {args.rank}," f"Total: {args.world_size}") train_sampler = DistributedSampler( train_dataset, num_replicas=args.world_size, rank=args.rank, shuffle=args.shuffle, ) train_dataloader = DataLoader( train_dataset, sampler=train_sampler, shuffle=False, num_workers=0, batch_size=args.train_batch_size, collate_fn=collate, pin_memory=True ) if args.max_steps > 0: t_total = args.max_steps args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1 else: t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs # Prepare optimizer and schedule (linear warmup and decay) no_decay = ["bias", "LayerNorm.weight"] optimizer_grouped_parameters = [ { "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], "weight_decay": args.weight_decay, }, {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0}, ] optimizer = AdamW(optimizer_grouped_parameters, # betas=(0.9, 0.98), lr=args.learning_rate, eps=args.adam_epsilon) if args.warmup_ratio > 0.: assert args.warmup_steps == 0 args.warmup_steps = int(t_total * args.warmup_ratio) if args.gpu == 0: print("Optimized with lr %f, steps %d, warmup steps %d, and use beta, epsilon %0.8f." % ( args.learning_rate, t_total, args.warmup_steps, optimizer.defaults['eps'] ), optimizer.defaults['betas']) scheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total ) # Check if saved optimizer or scheduler states exist if ( args.model_name_or_path and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt")) and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt")) ): # Load in optimizer and scheduler states optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt"))) scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt"))) if args.fp16: try: from apex import amp except ImportError: raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.") model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level, verbosity=0) from apex.parallel import DistributedDataParallel as DDP model = DDP(model) else: model = torch.nn.parallel.DistributedDataParallel( model, device_ids=[args.gpu], find_unused_parameters=True ) # Train! logger.info("***** Running training *****") logger.info(" Num examples = %d", len(train_dataset)) logger.info(" Num Epochs = %d", args.num_train_epochs) logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size) logger.info( " Total train batch size (w. distributed & accumulation) = %d", args.train_batch_size * args.gradient_accumulation_steps * args.world_size ) logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps) logger.info(" Total optimization steps = %d", t_total) global_step = 0 epochs_trained = 0 # Check if continuing training from a checkpoint # if args.model_name_or_path and os.path.exists(args.model_name_or_path): # try: # # set global_step to gobal_step of last saved checkpoint from model path # checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0] # epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps) # steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps) # logger.info(" Continuing training from checkpoint, will skip to saved global_step") # logger.info(" Continuing training from epoch %d", epochs_trained) # except ValueError: # logger.info(" Do not load model from %s, restart training" % args.model_name_or_path) # model_to_resize = model.module if hasattr(model, "module") else model # Take care of distributed/parallel training # model_to_resize.resize_token_embeddings(len(tokenizer)) model.zero_grad() train_iterator = trange( epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.gpu != 0 ) for epoch in train_iterator: epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.gpu != 0) tr_loss, logging_loss = 0.0, 0.0 model.zero_grad() # Support of accumulating gradients for step, batch in enumerate(epoch_iterator): inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch) inputs = inputs.to(args.device) labels = labels.to(args.device) # If some of the input is padded, then the attention mask is needed attention_mask = (inputs != tokenizer.pad_token_id) # word_tokens --> 1, pad_token --> 0 if attention_mask.all(): attention_mask = None if epoch == 0 and step < 3 and args.gpu == 0: print(inputs.shape) print(inputs[0]) print(tokenizer.convert_ids_to_tokens(inputs[0].cpu().numpy())) print(labels[0]) print(attention_mask) model.train() outputs = model(inputs, attention_mask=attention_mask, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels) loss = outputs[0] # model outputs are always tuple in transformers (see doc) if args.gradient_accumulation_steps > 1: loss = loss / args.gradient_accumulation_steps if args.fp16: with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward() else: loss.backward() tr_loss += loss.item() if (step + 1) % args.gradient_accumulation_steps == 0: if args.max_grad_norm > 0.: if args.fp16: total_norm = torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm) else: total_norm =torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm) optimizer.step() scheduler.step() # Update learning rate schedule model.zero_grad() global_step += 1 if args.gpu == 0 and args.logging_steps > 0 and (step + 1) % args.logging_steps == 0: # Log metrics tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step) if args.fp16: try: from apex.amp import _amp_state tb_writer.add_scalar("loss_scale", _amp_state.loss_scalers[0]._loss_scale, global_step) tb_writer.add_scalar("scaled_loss", scaled_loss.item(), global_step) except ImportError: logger.warning("Cannot import apex.amp._amp_state, " "would not state the loss_scale in the log") if args.max_grad_norm > 0.: # Only clip the grad when it is valid tb_writer.add_scalar("grad_norm", total_norm, global_step) tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step) logging_loss = tr_loss if args.max_steps > 0 and global_step >= args.max_steps: break # Save it each epoch if args.gpu == 0: # Save checkpoints checkpoint_name = "checkpoint-epoch%04d" % epoch save_model(args, checkpoint_name, model, tokenizer, optimizer, scheduler) last_path = os.path.join(args.output_dir, 'checkpoint-last') # if os.path.exists(last_path): # print(last_path) # os.remove(last_path) # os.symlink(os.path.join(args.output_dir, checkpoint_name), last_path) # Evaluate the model logger.info(" Training loss of Epoch %d: %0.4f" % (epoch, tr_loss / step)) logger.info(" Evaluation Results of Epoch %d: " % epoch) results = evaluate(args, model, tokenizer) for key, value in results.items(): tb_writer.add_scalar("eval_{}".format(key), value, global_step) logger.info("\t %s: %0.4f" % (key, value)) output_eval_file = os.path.join(args.output_dir, checkpoint_name, "eval_results.json") json.dump(results, open(output_eval_file, 'w'), sort_keys=True, indent=4) if args.max_steps > 0 and global_step >= args.max_steps: epoch_iterator.close() train_iterator.close() break if args.gpu == 0: tb_writer.close() def save_model(args, name, model, tokenizer, optimizer, scheduler): # Save model checkpoint output_dir = os.path.join(args.output_dir, name) os.makedirs(output_dir, exist_ok=True) model_to_save = ( model.module if hasattr(model, "module") else model ) # Take care of distributed/parallel training model_to_save.save_pretrained(output_dir) tokenizer.save_pretrained(output_dir) torch.save(args, os.path.join(output_dir, "training_args.bin")) logger.info("Saving model checkpoint to %s", output_dir) # torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt")) # torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt")) # logger.info("Saving optimizer and scheduler states to %s", output_dir) def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, prefix="") -> Dict: # Loop to handle MNLI double evaluation (matched, mis-matched) eval_dataset = load_and_cache_examples(args, tokenizer, evaluate=True) args.eval_batch_size = args.per_gpu_eval_batch_size # Note that DistributedSampler samples randomly def collate(examples: List[torch.Tensor]): if tokenizer._pad_token is None: return pad_sequence(examples, batch_first=True) return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id) eval_sampler = SequentialSampler(eval_dataset) eval_dataloader = DataLoader( eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate ) # Eval! logger.info("***** Running evaluation {} *****".format(prefix)) logger.info(" Num examples = %d", len(eval_dataset)) logger.info(" Batch size = %d", args.eval_batch_size) eval_loss = 0.0 nb_eval_steps = 0 model.eval() for batch in tqdm(eval_dataloader, desc="Evaluating"): inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch) inputs = inputs.to(args.device) labels = labels.to(args.device) # If some of the input is padded, then the attention mask is needed attention_mask = (inputs != tokenizer.pad_token_id) # word_tokens --> 1, pad_token --> 0 if attention_mask.all(): attention_mask = None with torch.no_grad(): outputs = model(inputs, attention_mask=attention_mask, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels) lm_loss = outputs[0] eval_loss += lm_loss.mean().item() nb_eval_steps += 1 eval_loss = eval_loss / nb_eval_steps perplexity = torch.exp(torch.tensor(eval_loss)).item() result = {"perplexity": perplexity} return result def is_port_in_use(port): import socket with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: return s.connect_ex(('localhost', port)) == 0 def main(): args = process_args() os.environ['MASTER_ADDR'] = '127.0.0.1' port = 9595 while is_port_in_use(port): port += 1 print("Use port", port) os.environ['MASTER_PORT'] = str(port) # Using all available gpus for multi-processing distributed args.gpus = torch.cuda.device_count() print("Use gpus ", list(range(args.gpus))) args.world_size = args.gpus * args.nodes mp.spawn(setup, nprocs=args.gpus, args=(args,)) def setup(gpu, args): if args.should_continue: args.model_name_or_path = 'checkpoint-last' # Setup CUDA, GPU & distributed training torch.cuda.set_device(gpu) device = torch.device("cuda", gpu) args.gpu = gpu # Local device id. args.device = device # Local device object. args.rank = args.nr * args.gpus + gpu # The gpu id in the world. torch.distributed.init_process_group( backend="nccl", init_method='env://', world_size=args.world_size, rank=args.rank ) # Setup logging logging.basicConfig( format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S", level=logging.INFO if args.gpu == 0 else logging.WARN, ) logger.warning( "Process GPU: %s, num_of_total_GPUs: %s, distributed training: True, 16-bits training: %s", args.gpu, args.gpus, args.fp16, ) # Set seed set_seed(args) # Load pretrained model and token # Barrier to make sure only the first process in distributed training # download model & vocabizer if gpu != 0: torch.distributed.barrier() config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type] # Get Config if args.config_name: config = config_class.from_pretrained(args.config_name, cache_dir=args.cache_dir) elif args.model_name_or_path: config = config_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir) else: raise ValueError( "Why do you want the default config?? Please use --config_name or --model_name_or_path" ) # Get Tokenizer if args.tokenizer_name: tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir) # BERT always needs lower cased tokens. assert tokenizer.init_kwargs.get("do_lower_case", False) elif args.model_name_or_path: tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir) else: raise ValueError( "You are instantiating a new {} tokenizer. This is not supported, " "but you can do it from another script, save it," "and load it from here, using --tokenizer_name".format(tokenizer_class.__name__) ) assert args.block_size <= tokenizer.max_len if args.model_name_or_path: model = model_class.from_pretrained( args.model_name_or_path, from_tf=bool(".ckpt" in args.model_name_or_path), config=config, cache_dir=args.cache_dir, ) else: logger.info("Training new model from scratch") model = model_class(config=config) model.to(args.device) # End of barrier to make sure only the first process waiting other processes if gpu == 0: torch.distributed.barrier() logger.info("Training/evaluation parameters %s", args) # Training if args.do_train: # Barrier to make sure only the first process in distributed training process the dataset, # and the others will use the cache if gpu != 0: torch.distributed.barrier() train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False) if gpu == 0: torch.distributed.barrier() train(args, train_dataset, model, tokenizer) # Evaluation if args.do_eval and gpu == 0: result = evaluate(args, model, tokenizer) if __name__ == "__main__": main() ================================================ FILE: vlm/run_vlm_distributed.py ================================================ # coding=utf-8 # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """ Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned using a masked language modeling (MLM) loss. """ from datetime import datetime import json import logging import os import random import sys import time from typing import Dict, List, Tuple import numpy as np import torch from torch.nn.utils.rnn import pad_sequence import torch.multiprocessing as mp from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler from torch.utils.data.distributed import DistributedSampler from tqdm import tqdm, trange from transformers import ( MODEL_WITH_LM_HEAD_MAPPING, WEIGHTS_NAME, AdamW, AutoConfig, AutoModelWithLMHead, AutoTokenizer, BertConfig, BertForMaskedLM, BertTokenizer, CamembertConfig, CamembertForMaskedLM, CamembertTokenizer, DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer, GPT2Config, GPT2LMHeadModel, GPT2Tokenizer, OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer, PreTrainedModel, PreTrainedTokenizer, RobertaConfig, RobertaForMaskedLM, RobertaTokenizer, get_linear_schedule_with_warmup, ) sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) from vlm.data import CoLDataset, get_voken_feats from vlm.param import process_args from vlm.model import CoLBertConfig, CoLwithBert try: from torch.utils.tensorboard import SummaryWriter except ImportError: from tensorboardX import SummaryWriter logger = logging.getLogger(__name__) MODEL_CLASSES = { "gpt2": (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer), "openai-gpt": (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer), "bert": (CoLBertConfig, CoLwithBert, BertTokenizer), "roberta": (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer), "distilbert": (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer), "camembert": (CamembertConfig, CamembertForMaskedLM, CamembertTokenizer), } def load_and_cache_examples(args, tokenizer, evaluate=False): file_path = args.eval_data_file if evaluate else args.train_data_file return CoLDataset(file_path, args.tokenizer_name, tokenizer, args.block_size, split_sent=args.split_sent, voken_dir=args.voken_dir, suffix=args.voken_suffix, verbose=(args.gpu == 0), voken_ablation=args.voken_ablation) def set_seed(args): random.seed(args.seed) np.random.seed(args.seed) torch.manual_seed(args.seed) def mask_tokens(tokens: torch.Tensor, vokens: torch.Tensor, tokenizer: PreTrainedTokenizer, args) \ -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: """ Notice that this function would have a side affect of manipulating the Tensor tokens. Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """ if tokenizer.mask_token is None: raise ValueError( "This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer." ) labels = tokens.clone() # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa) probability_matrix = torch.full(labels.shape, args.mlm_probability) special_tokens_mask = [ tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist() ] probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0) if tokenizer._pad_token is not None: padding_mask = labels.eq(tokenizer.pad_token_id) probability_matrix.masked_fill_(padding_mask, value=0.0) masked_indices = torch.bernoulli(probability_matrix).bool() labels[~masked_indices] = -100 # We only compute loss on masked tokens if args.voken_labels == 'mask': vokens[~masked_indices] = -100 elif args.voken_labels == 'nonmask': vokens[masked_indices] = -100 elif args.voken_labels == 'all': pass else: assert "Do not support the voken loss of type %s" % args.voken_labels # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK]) indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices tokens[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token) # 10% of the time, we replace masked input tokens with random word indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long) tokens[indices_random] = random_words[indices_random] # The rest of the time (10% of the time) we keep the masked input tokens unchanged return tokens, labels, vokens def train(args, train_dataset: CoLDataset, valid_dataset: CoLDataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]: set_seed(args) # Added here for reproducibility """ Train the model """ if args.gpu == 0: current_time = datetime.now().strftime('%b%d_%H-%M-%S') tb_writer = SummaryWriter(args.output_dir + '/runs/' + current_time) args.train_batch_size = args.per_gpu_train_batch_size def col_collate(examples): tokens, vokens = zip(*examples) if tokenizer._pad_token is None: tokens = pad_sequence(tokens, batch_first=True) else: tokens = pad_sequence(tokens, batch_first=True, padding_value=tokenizer.pad_token_id) vokens = pad_sequence(vokens, batch_first=True, padding_value=-100) return tokens, vokens if args.shuffle: logger.info(f"Shuffle the dataset in training," f"GPU: {args.gpu}," f"Rank: {args.rank}," f"Total: {args.world_size}") train_sampler = DistributedSampler( train_dataset, num_replicas=args.world_size, rank=args.rank, shuffle=args.shuffle, ) train_dataloader = DataLoader( train_dataset, sampler=train_sampler, shuffle=False, num_workers=0, batch_size=args.train_batch_size, collate_fn=col_collate, pin_memory=True ) if args.max_steps > 0: t_total = args.max_steps args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1 # args.num_train_epochs = 9595 else: t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs # Prepare optimizer and schedule (linear warmup and decay) if args.lamb: no_decay = ['bias', 'gamma', 'beta', 'LayerNorm'] else: no_decay = ["bias", "LayerNorm.weight"] optimizer_grouped_parameters = [ { "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], "weight_decay": args.weight_decay, }, { "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0, }, ] if args.lamb: logger.info(f"Using LAMB Optimizer with max grad norm {args.max_grad_norm}") import apex optimizer = apex.optimizers.FusedLAMB( optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon, max_grad_norm=args.max_grad_norm ) else: optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, #betas=(0.9, 0.98), eps=args.adam_epsilon) if args.gpu == 0: print(f"Optimized with lr: {optimizer.defaults['lr']}, total steps: {t_total}," f" warmup steps: {args.warmup_steps}, epsilon {optimizer.defaults['eps']}," f" beta: {optimizer.defaults['betas']}, weight decay {args.weight_decay}.") scheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total ) # Check if saved optimizer or scheduler states exist if ( args.model_name_or_path and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt")) and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt")) ): # Load in optimizer and scheduler states optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt"))) scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt"))) if args.fp16: try: from apex import amp except ImportError: raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.") model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level) from apex.parallel import DistributedDataParallel as DDP model = DDP(model) else: model = torch.nn.parallel.DistributedDataParallel( model, device_ids=[args.gpu], find_unused_parameters=True ) # Allow not calculating the lm heads. if args.mlm_ratio == 0.: model.lm_head = None # Train! logger.info("***** Running training *****") logger.info(" Num examples = %d", len(train_dataset)) logger.info(" Num Epochs = %d", args.num_train_epochs) logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size) logger.info( " Total train batch size (w. distributed & accumulation) = %d", args.train_batch_size * args.gradient_accumulation_steps * args.world_size ) logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps) logger.info(" Total optimization steps = %d", t_total) global_step = 0 epochs_trained = 0 # Check if continuing training from a checkpoint # if args.model_name_or_path and os.path.exists(args.model_name_or_path): # try: # # set global_step to gobal_step of last saved checkpoint from model path # checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0] # epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps) # steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps) # logger.info(" Continuing training from checkpoint, will skip to saved global_step") # logger.info(" Continuing training from epoch %d", epochs_trained) # except ValueError: # logger.info(" Do not load model from %s, restart training" % args.model_name_or_path) model_to_resize = model.module if hasattr(model, "module") else model # Take care of distributed/parallel training assert model_to_resize.config.vocab_size == len(tokenizer) # model_to_resize.resize_token_embeddings(len(tokenizer)) model.zero_grad() train_iterator = trange( epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.gpu != 0 ) set_seed(args) # Added here for reproducibility LOSS_NAMES = ['token_loss', 'voken_loss', 'total_loss'] for epoch in train_iterator: epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.gpu != 0) tr_loss, logging_loss = np.zeros(len(LOSS_NAMES)), 0.0 model.zero_grad() for step, (tokens, vokens) in enumerate(epoch_iterator): token_inputs, token_labels, voken_labels = mask_tokens(tokens, vokens, tokenizer, args) token_inputs = token_inputs.to(args.device) token_labels = token_labels.to(args.device) if args.mlm_ratio != 0. else None voken_labels = voken_labels.to(args.device) # If some of the input is padded, then the attention mask is needed attention_mask = (token_inputs != tokenizer.pad_token_id) # word_tokens --> 1, pad_token --> 0 if attention_mask.all(): attention_mask = None if epoch == 0 and step < 3 and args.gpu == 0: print() print("Token inputs:", token_inputs.shape, token_inputs[0]) print("Token inputs (in str): ", tokenizer.convert_ids_to_tokens(token_inputs[0].cpu().numpy())) print("Attention Mask:", attention_mask) print("Token Labels: ", token_labels[0] if token_labels is not None else token_labels) print("Token Labels (in str): ", tokenizer.convert_ids_to_tokens(token_labels[0].cpu().numpy()) if token_labels is not None else token_labels) print("Voken Labels: ", voken_labels[0]) print() model.train() outputs = model(token_inputs, attention_mask=attention_mask, masked_lm_labels=token_labels, voken_labels=voken_labels) voken_loss = outputs[0] token_loss = outputs[1] if args.mlm_ratio == 0.: loss = voken_loss else: loss = voken_loss + args.mlm_ratio * token_loss if args.gradient_accumulation_steps > 1: loss = loss / args.gradient_accumulation_steps if args.fp16: with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward() else: loss.backward() # print(f"GPU: {args.gpu}, Global Step: {global_step + 1}, " # f"Step: {step}, " # f"Range: {train_dataset.get_item_info(step * args.world_size + args.gpu)}, " # f"Loss: {loss.item()}, " # f"Scaled Loss: {scaled_loss.item()}") tr_loss += np.array((token_loss.item() / args.gradient_accumulation_steps, voken_loss.item() / args.gradient_accumulation_steps, loss.item())) if (step + 1) % args.gradient_accumulation_steps == 0: if args.max_grad_norm > 0. and not args.lamb: # Only clip the grad when it is valid and not using LAMB optimizer, # because the LAMB optimizer already apply grad clipping if args.fp16: total_norm = torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm) else: total_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm) elif args.max_grad_norm <= 0. and step <= args.gradient_accumulation_steps: logger.warning("Have not clipped the gradient because " "the max_grad_norm is set to %0.2f" % args.max_grad_norm) optimizer.step() scheduler.step() # Update learning rate schedule model.zero_grad() global_step += 1 if args.gpu == 0 and args.logging_steps > 0 and (step + 1) % args.logging_steps == 0: # Log metrics tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step) if args.fp16: try: from apex.amp import _amp_state tb_writer.add_scalar("loss_scale", _amp_state.loss_scalers[0]._loss_scale, global_step) tb_writer.add_scalar("scaled_loss", scaled_loss.item(), global_step) except ImportError: logger.warning("Cannot import apex.amp._amp_state, " "would not state the loss_scale in the log") if args.max_grad_norm > 0. and not args.lamb: # Only clip the grad when it is valid tb_writer.add_scalar("grad_norm", total_norm, global_step) interval_loss = (tr_loss - logging_loss) / args.logging_steps for loss_idx, loss_name in enumerate(LOSS_NAMES): tb_writer.add_scalar(loss_name, interval_loss[loss_idx], global_step) logging_loss = tr_loss.copy() if args.max_steps > 0 and global_step >= args.max_steps: break # if step == 200: # break # # Save it each epoch if args.gpu == 0: # Save checkpoints checkpoint_name = "checkpoint-epoch%04d" % epoch save_model(args, checkpoint_name, model, tokenizer, optimizer, scheduler) # last_path = os.path.join(args.output_dir, 'checkpoint-last') # if os.path.exists(last_path): # os.remove(last_path) # os.symlink(os.path.join(args.output_dir, checkpoint_name), last_path) # Evaluate the model for loss_idx, loss_name in enumerate(LOSS_NAMES): logger.info(" Training %s of Epoch %d: %0.4f" % ( loss_name, epoch, tr_loss[loss_idx] / len(train_dataloader))) if args.do_eval: logger.info(" Evaluation Results of Epoch %d: " % epoch) old_eval_batch_size = args.per_gpu_eval_batch_size while args.per_gpu_eval_batch_size > 0: try: results = evaluate(args, valid_dataset, model, tokenizer) break except RuntimeError as e: args.per_gpu_eval_batch_size = int(args.per_gpu_eval_batch_size / 2) print("HALVE THE BATCH SIZE in EVAL.") if args.per_gpu_eval_batch_size == 0: raise e time.sleep(5) args.per_gpu_eval_batch_size = old_eval_batch_size for key, value in results.items(): tb_writer.add_scalar("eval_{}".format(key), value, global_step) logger.info("\t %s: %0.4f" % (key, value)) tb_writer.add_scalar("epoch", epoch, global_step) output_eval_file = os.path.join(args.output_dir, checkpoint_name, "eval_results.json") json.dump(results, open(output_eval_file, 'w'), sort_keys=True, indent=4) # Currently, only GPU 0 is responsible for the evaluation. # torch.cuda.empty_cache() # torch.distributed.barrier() else: pass # torch.cuda.empty_cache() # torch.distributed.barrier() if args.max_steps > 0 and global_step >= args.max_steps: epoch_iterator.close() train_iterator.close() break if args.gpu == 0: tb_writer.close() def save_model(args, name, model, tokenizer, optimizer, scheduler): # Save model checkpoint output_dir = os.path.join(args.output_dir, name) os.makedirs(output_dir, exist_ok=True) model_to_save = ( model.module if hasattr(model, "module") else model ) # Take care of distributed/parallel training model_to_save.save_pretrained(output_dir) tokenizer.save_pretrained(output_dir) torch.save(args, os.path.join(output_dir, "training_args.bin")) logger.info("Saving model checkpoint to %s", output_dir) # torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt")) # torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt")) # logger.info("Saving optimizer and scheduler states to %s", output_dir) def evaluate(args, eval_dataset: CoLDataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, prefix="") -> Dict: torch.cuda.empty_cache() # # Loop to handle MNLI double evaluation (matched, mis-matched) # eval_dataset = load_and_cache_examples(args, tokenizer, evaluate=True) args.eval_batch_size = args.per_gpu_eval_batch_size # Note that DistributedSampler samples randomly def col_collate(examples): tokens, vokens = zip(*examples) if tokenizer._pad_token is None: tokens = pad_sequence(tokens, batch_first=True) else: tokens = pad_sequence(tokens, batch_first=True, padding_value=tokenizer.pad_token_id) vokens = pad_sequence(vokens, batch_first=True, padding_value=-100) return tokens, vokens eval_sampler = SequentialSampler(eval_dataset) eval_dataloader = DataLoader( eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=col_collate ) # Eval! logger.info("***** Running evaluation {} *****".format(prefix)) logger.info(" Num examples = %d", len(eval_dataset)) logger.info(" Batch size = %d", args.eval_batch_size) total_token_loss = 0.0 total_voken_loss = 0.0 nb_eval_steps = 0 model.eval() for tokens, vokens in tqdm(eval_dataloader, desc="Evaluating"): token_inputs, token_labels, voken_labels = mask_tokens(tokens, vokens, tokenizer, args) token_inputs = token_inputs.to(args.device) token_labels = token_labels.to(args.device) if args.mlm_ratio != 0 else None voken_labels = voken_labels.to(args.device) # If some of the input is padded, then the attention mask is needed attention_mask = (token_inputs != tokenizer.pad_token_id) # word_tokens --> 1, pad_token --> 0 if attention_mask.all(): attention_mask = None with torch.no_grad(): outputs = model(token_inputs, attention_mask=attention_mask, masked_lm_labels=token_labels, voken_labels=voken_labels) voken_loss = outputs[0] token_loss = outputs[1] total_voken_loss += voken_loss.item() total_token_loss += token_loss.item() nb_eval_steps += 1 total_token_loss = total_token_loss / nb_eval_steps perplexity = torch.exp(torch.tensor(total_token_loss)).item() result = {"perplexity": perplexity, "voken_loss": total_voken_loss / nb_eval_steps} torch.cuda.empty_cache() return result def is_port_in_use(port): import socket with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: return s.connect_ex(('localhost', port)) == 0 def main(): args = process_args() os.environ['MASTER_ADDR'] = '127.0.0.1' port = 9595 while is_port_in_use(port): port += 1 print("Use port", port) os.environ['MASTER_PORT'] = str(port) # Using all available gpus for multi-processing distributed args.gpus = torch.cuda.device_count() print("Use gpus ", list(range(args.gpus))) args.world_size = args.gpus * args.nodes mp.spawn(setup, nprocs=args.gpus, args=(args,)) def setup(gpu, args): if args.should_continue: args.model_name_or_path = 'checkpoint-last' # Setup CUDA, GPU & distributed training torch.cuda.set_device(gpu) device = torch.device("cuda", gpu) args.gpu = gpu # Local device id. args.device = device # Local device object. args.rank = args.nr * args.gpus + gpu # The gpu id in the world. torch.distributed.init_process_group( backend="nccl", init_method='env://', world_size=args.world_size, rank=args.rank ) # Setup logging logging.basicConfig( format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S", level=logging.INFO if args.gpu == 0 else logging.WARN, ) logger.warning( "Process GPU: %s, num_of_total_GPUs: %s, distributed training: True, 16-bits training: %s", args.gpu, args.gpus, args.fp16, ) # Set seed set_seed(args) # Load pretrained model and token # Barrier to make sure only the first process in distributed training # download model & vocabizer if gpu != 0: torch.distributed.barrier() # Use self-defined models, thus avoiding Auto***. config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type] # Next, we will initialize the training process in the following order: # 1. tokenizer --> 2. dataset --> 3. config --> 4. model. # because A) dataset relies on the tokenizer.special_tokens. # B) config relies on the dataset.voken_size. # Get Tokenizer if args.tokenizer_name: tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir) elif args.model_name_or_path: tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir) else: raise ValueError( "You are instantiating a new {} tokenizer. This is not supported, " "but you can do it from another script, save it," "and load it from here, using --tokenizer_name".format(tokenizer_class.__name__) ) assert args.block_size <= tokenizer.max_len # Barrier to make sure only the first process in distributed training process the dataset, # and the others will use the cache if gpu != 0: torch.distributed.barrier() train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False) valid_dataset = load_and_cache_examples(args, tokenizer, evaluate=True) if gpu == 0: torch.distributed.barrier() # Assert the vokens are equal in valid and eval. valid_dataset.assert_equal_vokens(train_dataset) config_kwargs = {} if args.do_voken_reg or args.do_voken_ctr: assert args.voken_feat_dir is not None voken_feats = get_voken_feats(train_dataset, args.voken_feat_dir) config_kwargs['voken_dim'] = len(voken_feats[0]) if gpu == 0: logger.info(f"Load voken feats from {args.voken_feat_dir}" f"with {len(voken_feats)} features and dimension {len(voken_feats[0])}") # Get Config if args.config_name: config = config_class.from_pretrained( args.config_name, cache_dir=args.cache_dir, voken_size=train_dataset.voken_size, do_voken_cls=args.do_voken_cls, do_voken_reg=args.do_voken_reg, do_voken_ctr=args.do_voken_ctr, shared_head=args.shared_head, verbose=(args.gpu == 0), **config_kwargs ) elif args.model_name_or_path: config = config_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir) else: raise ValueError( "Why do you want the default config?? Please use --config_name or --model_name_or_path" ) if args.model_name_or_path: logger.info(f"Training model from the weight {args.model_name_or_path}.") model = model_class.from_pretrained( args.model_name_or_path, from_tf=bool(".ckpt" in args.model_name_or_path), config=config, cache_dir=args.cache_dir, ) else: logger.info("Training new model from scratch") model = model_class(config=config) if args.do_voken_reg or args.do_voken_ctr: voken_feats = torch.tensor(voken_feats) model.init_voken_feat_emb(voken_feats) model.to(args.device) # End of barrier to make sure only the first process waiting other processes if gpu == 0: torch.distributed.barrier() if args.model_name_or_path: if gpu == 0: logger.info("Evaluate the performance of the loaded model.") results = evaluate(args, valid_dataset, model, tokenizer) for key, value in results.items(): logger.info("\t %s: %0.4f" % (key, value)) torch.distributed.barrier() else: torch.distributed.barrier() logger.info("Training/evaluation parameters %s", args) # Training if args.do_train: train(args, train_dataset, valid_dataset, model, tokenizer) # Evaluation if args.do_eval and gpu == 0: results = evaluate(args, valid_dataset, model, tokenizer) for key, value in results.items(): logger.info("\t %s: %0.4f" % (key, value)) if __name__ == "__main__": main() ================================================ FILE: vlm/show_glue_results_epochs.py ================================================ import os from pathlib import Path root = Path( 'snap' ) task2major = { 'QQP': 'acc_and_f1', 'STS-B': 'corr', 'MRPC': 'acc_and_f1', } # The tasks sorted by the amount of data all_tasks = [ # 'WNLI', 'RTE', 'MRPC', 'STS-B', 'CoLA', 'SST-2', 'QNLI', 'QQP', 'MNLI', 'MNLI-MM', ] def print_result(glue_dir): print(glue_dir) results = {} for task in glue_dir.iterdir(): if task.is_dir(): eval_fpath = task / 'eval_results.txt' task_name = task.name if eval_fpath.exists(): with eval_fpath.open() as f: for line in f: metric, value = line.split('=') metric = metric.strip() value = float(value.strip()) if task_name in task2major: if metric == task2major[task_name]: results[task_name] = value else: results[task_name] = value if len(results) > 0: # sorted_keys = sorted(list(results.keys())) # for key in sorted_keys: # print("%8s" % key, end='') # print("%8s" % 'GLUE', end='') # print() # for key in sorted_keys: # print("%8.2f" % (results[key] * 100.), end='') # print("%8.2f" % (sum(results.values()) * 100. / len(results)), end='') # print() for task in all_tasks: print("%8s" % task, end='') print("%8s" % 'GLUE', end='') print() for task in all_tasks: if task in results: result = results[task] print("%8.2f" % (result * 100), end='') else: print(" " * 8, end='') mean = lambda x: sum(x) / max(len(x), 1) avg_result = mean([value for key, value in results.items() if key in all_tasks]) print("%8.2f" % (avg_result * 100.), end='') print() def search(path): def sorted_key(path): try: return path.stat().st_mtime except Exception: return 0. path_list = sorted( path.iterdir(), key=sorted_key # x.name ) for subdir in path_list: if subdir.is_dir(): if 'glueepoch_' in subdir.name: print_result(subdir) else: search(subdir) search(root) ================================================ FILE: vokenization/__init__.py ================================================ ================================================ FILE: vokenization/common.py ================================================ import os # Name of image sets IMAGE_SETS = [ 'coco_train', 'coco_nominival', 'coco_minival', 'vg_nococo', 'cc_train', 'cc_valid', ] # Root of each dataset # CC_ROOT, COCO_ROOT, VG_ROOT should contain the `images` folder # CC_ROOT -- images # |-- training # |-- training_00009486 # Jpeg files but does not have the extension. # |-- .... # |-- validation # |-- validation_00009486 # |-- ... # CC_ROOT = os.getenv('CC_ROOT', 'data/cc') # COCO_ROOT = os.getenv('COCO_ROOT', 'data/mscoco') # VG_ROOT = os.getenv('VG_ROOT', 'data/vg') # LXRT_ROOT = os.getenv('LXRT_ROOT', 'data/lxrt') CC_ROOT = 'data/cc' COCO_ROOT = 'data/mscoco' VG_ROOT = 'data/vg' LXRT_ROOT = 'data/lxmert' # THe local directory to save essential image infos # (e.g., image ids for the vokenizer, image paths in this server) # LOCAL_DIR # |- images # |- coco_train_ids.txt # |- coco_train_paths.txt # |- cc_train_ids.txt # |- cc_train_paths.txt # |- .............. # Running create_image_ids.py will build *_ids.txt # Running extract_vision_keys.py will build *_paths.txt LOCAL_DIR = 'data/vokenization' ================================================ FILE: vokenization/create_image_ids.py ================================================ import json import os from pathlib import Path import sys # sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) import common imgset2lxrtfname = { 'coco_train': 'mscoco_train.json', 'coco_nominival': 'mscoco_nominival.json', 'coco_minival': 'mscoco_minival.json', 'vg_nococo': 'vgnococo.json', } imgset2ccfname = { 'cc_train': 'training.tsv', 'cc_valid': 'validation.tsv' } def write_ids(img_set, img_ids): """ Write the indexed image ids 'img_ids' for image set 'img_set' to the local file. """ info_dir = os.path.join(common.LOCAL_DIR, 'images') os.makedirs(info_dir, exist_ok=True) print("Write %d image ids for image set %s to %s." % ( len(img_ids), img_set, os.path.join(info_dir, img_set + '.ids'))) ids_path = os.path.join(info_dir, img_set + '.ids') if os.path.exists(ids_path): # If there is an existing ids_path, make sure that they are the same. print(f"Already exist the image ids for image set {img_set} at path {ids_path}.") print("Now, we want to make sure that they are equal:") with open(ids_path, 'r') as f: exist_img_ids = list(map(lambda x: x.strip(), f.readlines())) success = True for i, (exist_img_id, img_id) in zip(exist_img_ids): if exist_img_id != img_id: print(f"The image id at line {i} is different:") print(f"\tIn the file: {exist_img_id}, In this script: {img_id}") success = False if success: print("PASS!") else: with open(ids_path, 'w') as f: for img_id in img_ids: f.write(img_id + '\n') for img_set in common.IMAGE_SETS: if img_set in imgset2lxrtfname: lxrt_path = Path(common.LXRT_ROOT) img_ids = [] fname = imgset2lxrtfname[img_set] for datum in json.load((lxrt_path / fname).open()): img_id = datum['img_id'] img_ids.append(img_id) write_ids(img_set, img_ids) if img_set in imgset2ccfname: cc_path = Path(common.CC_ROOT) img_ids = [] fname = imgset2ccfname[img_set] if not (cc_path / fname).exists(): print("No such file", cc_path / fname) continue for i, line in enumerate((cc_path / fname).open()): sent, img_id = line.split('\t') img_ids.append(img_id.strip()) write_ids(img_set, img_ids) ================================================ FILE: vokenization/evaluate_diversity.py ================================================ import argparse from collections import defaultdict import json import os import sys import numpy as np import tqdm from vokenization import Vokenizer, load_model_and_tokenizer import common imgset2fname = { 'coco_train': 'mscoco_train.json', 'coco_nominival': 'mscoco_nominival.json', 'coco_minival': 'mscoco_minival.json', 'vg_nococo': 'vgnococo.json', 'cc_train': 'training.tsv', 'cc_valid': 'validation.tsv', } tokenizer_name = 'bert-base-uncased' def load_lang_data(corpus_name, topk=10000): """ Load {topk} sentences from the corpus named by {corpus_name}. """ fpath = corpus_name + '.' + tokenizer_name tokens = [] with open(fpath) as f: for i, line in enumerate(f): tokens.append(list(map(int, line.split(' ')))) if (i + 1) == topk: break print("Read %d sentences from the corpus %s located at %s." % ( len(tokens), corpus_name, fpath )) return tokens def load_cc_data(img_set): fname = os.path.join(common.CC_ROOT, imgset2fname[img_set]) sents = [] with open(fname) as f: for line in f: sent, _ = line.split('\t') sents.append(sent) print("Load the %d sentences for image set %s from %s" % ( len(sents), img_set, fname)) return sents def load_lxrt_data(img_set): fname = os.path.join(common.LXRT_ROOT, imgset2fname[img_set]) sents = [] with open(fname) as f: data = json.load(f) for datum in data: sents.extend(datum['sentf']['mscoco']) print("Load the %d sentences for image set %s from %s" % ( len(sents), img_set, fname)) return sents def analyze(token2info): """ :param token2info: token2info: token --> (img_id --> cnt) :return: """ names = ['Num Images', 'Max Cnt', 'Avg Cnt', 'Std Cnt'] results = np.zeros(4) num_tokens = 0 for token in token2info: img2cnt = token2info[token] cnts = np.array(list(img2cnt.values())) num_imgs = len(cnts) max_cnt = cnts.max() avg_cnt = cnts.mean() std_cnt = cnts.std() results += (num_imgs, max_cnt, avg_cnt, std_cnt) num_tokens += 1 print("With %d tokens, " % num_tokens) results /= num_tokens for name, result in zip(names, results): print("Average of %s is %0.2f" % (name, result)) corpus_info = defaultdict(lambda: 0) for info in token2info.values(): for img, cnt in info.items(): corpus_info[img] += cnt print("Cover %d images" % len(corpus_info)) # load = '/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_bertl4' parser = argparse.ArgumentParser() parser.add_argument('--load', type=str, default='/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_robertal4', help='The directory saved the model (containing' 'BEST.pth.model).') parser.add_argument('--image-sets', type=str, default='coco_minival', help='The splits of images to be extracted') parser.add_argument('--corpus', type=str, default='wiki103', help='Evaluated corpus') parser.add_argument('--maxsents', type=int, default=10000, help='The maximum sentences to be evaluated in the corpus') args = parser.parse_args() keys_path = os.path.join(args.load, 'keys') print("Evaluate for model %s on image sets %s" % (args.load, args.image_sets)) model, tokenizer = load_model_and_tokenizer(args.load) img_sets = args.image_sets.split(',') vokenizer = Vokenizer(model, tokenizer, keys_path, img_sets) corpus_list = args.corpus.split(',') for corpus in corpus_list: corpus = corpus.strip() print("\nProcessing corpus %s for diversity test:" % corpus) # token2info: token --> (img_id --> cnt) token2info = defaultdict(lambda: defaultdict(lambda: 0)) if corpus in imgset2fname: if 'cc' in corpus: sents = load_cc_data(corpus) else: sents = load_lxrt_data(corpus) batch_size = 32 for start_id in tqdm.tqdm(range(0, len(sents), batch_size)): batch_sents = sents[start_id: start_id + batch_size] scores, ids, tokens, paths = vokenizer.vokenize_sents(batch_sents, topk=None) for i in range(len(paths)): for token, path in zip(tokens[i][1:-1], paths[i][1:-1]): token2info[token][path] += 1 else: tokens_list = load_lang_data(corpus, args.maxsents) batch_size = 16 for start_id in tqdm.tqdm(range(0, len(tokens_list), batch_size)): batch_tokens = tokens_list[start_id: start_id + batch_size] scores, ids, tokens, paths = vokenizer.vokenize_ids(batch_tokens, topk=None) for i in range(len(paths)): for token, path in zip(tokens[i][1:-1], paths[i][1:-1]): token2info[token][path] += 1 analyze(token2info) ================================================ FILE: vokenization/evaluate_retrieval.py ================================================ import argparse from collections import defaultdict import json import os import tqdm from vokenization import Vokenizer, load_model_and_tokenizer import common imgset2fname = { 'coco_train': 'mscoco_train.json', 'coco_nominival': 'mscoco_nominival.json', 'coco_minival': 'mscoco_minival.json', 'vg_nococo': 'vg_nococo.json', 'cc_train': 'training.tsv', 'cc_valid': 'validation.tsv', } def load_cc_data(img_set): fname = os.path.join(common.CC_ROOT, imgset2fname[img_set]) sentXimgname = [] with open(fname) as f: for line in f: sent, gt_img_name = line.split('\t') gt_img_name = gt_img_name.strip() sentXimgname.append((sent, gt_img_name)) print("Load the %d (img, sent) pairs for image set %s from %s" % ( len(sentXimgname), img_set, fname)) return sentXimgname def load_lxrt_data(img_set): fname = os.path.join(common.LXRT_ROOT, imgset2fname[img_set]) sentXimgname = [] with open(fname) as f: data = json.load(f) for datum in data: gt_img_name = datum['img_id'] + '.jpg' sents = datum['sentf']['mscoco'] for sent in sents: sentXimgname.append((sent, gt_img_name)) print("Load the %d (img, sent) pairs for image set %s from %s" % ( len(sentXimgname), img_set, fname)) return sentXimgname # load = '/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_bertl4' parser = argparse.ArgumentParser() parser.add_argument('--load', type=str, default='/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_robertal4', help='The directory saved the model (containing' 'BEST.pth.model).') parser.add_argument('--image-sets', type=str, default='coco_minival', help='The splits of images to be extracted') args = parser.parse_args() keys_path = os.path.join(args.load, 'keys') print("Evaluate for model %s on image sets %s" % (args.load, args.image_sets)) model, tokenizer = load_model_and_tokenizer(args.load) img_sets = args.image_sets.split(',') sent_level = 'sent' in args.load for img_set in img_sets: vokenizer = Vokenizer(model, tokenizer, keys_path, [img_set], sent_level=sent_level) if 'cc' in img_set: sentXimgname = load_cc_data(img_set) else: sentXimgname = load_lxrt_data(img_set) topks = [1, 5, 10] print("\nEvaluate image set", img_set, "for topk retrieval:", topks) total = 0 arg_topk = None if max(topks) == 1 else max(topks) results = defaultdict(lambda: 0) batch_size = 32 for start_id in tqdm.tqdm(range(0, len(sentXimgname), batch_size)): batch_sentXimg = sentXimgname[start_id: start_id + batch_size] sents, gt_img_names = zip(*batch_sentXimg) sents = list(sents) scores, ids, tokens, paths_list = vokenizer.vokenize_sents(sents, topk=arg_topk) if sent_level: paths_list = [x[:3] for x in paths_list] # Only eval the first vokens. if arg_topk is None: paths_list = [[[img_id] for img_id in sent] for sent in paths_list] for paths, gt_img_name in zip(paths_list, gt_img_names): # for each sent in batch for topk_paths in paths[1:-1]: # for each token in sent for k, kth_path in enumerate(topk_paths): # for each img_path in topk image paths of a token img_name = os.path.split(kth_path)[-1] if img_name == gt_img_name: results[k + 1] += 1 total += sum(map(lambda x: len(x) - 2, paths_list)) accumulate = 0 for i in range(1, max(topks)+1): accumulate += results[i] if i in topks: print("R%d: %0.2f%%, (Random: %0.4f%%)" % ( i, accumulate / total * 100., i / vokenizer.img_num * 100. )) del vokenizer ================================================ FILE: vokenization/extract_vision_keys.py ================================================ # In this file, we extract the vision features as the keys in retrieval. import argparse import os import pickle import shutil import sys import h5py import torch from torchvision import transforms from torchvision.datasets.folder import default_loader import tqdm from transformers import BertTokenizer from PIL import Image import common # Load all images Image.MAX_IMAGE_PIXELS = None def get_img_path(img_set, img_id): """ Get the paths regarding the img_set and img_id. THIS FUNCTION MIGHT NEED TO BE MODIFIED. """ source, tag = img_set.split('_') if source == 'cc': split_tag, _ = img_id.split('_') return "%s/images/%s/%s" % (common.CC_ROOT, split_tag, img_id) elif 'COCO' in img_id: _, split_tag, _ = img_id.split('_') return "%s/images/%s/%s" % (common.COCO_ROOT, split_tag, img_id + '.jpg') else: # VG images return "%s/images/%s.jpg" % (common.VG_ROOT, img_id) def get_img_paths_and_ids(img_set): """ Return a list of images paths and image ids in this 'img_set'. """ # Load the image ids from the common local dir, # thus make sure that the order of the images are the same. info_dir = os.path.join(common.LOCAL_DIR, 'images') img_paths = [] with open(os.path.join(info_dir, img_set + '.ids')) as f: img_ids = list(map(lambda x: x.strip(), f.readlines())) for img_id in img_ids: img_paths.append(get_img_path(img_set, img_id)) return img_paths, img_ids def save_img_paths_and_ids(img_set, img_paths, img_ids, output): info_dir = os.path.join(common.LOCAL_DIR, 'images') # Save Image Paths curr_paths_fname = os.path.join(output, img_set + '.path') print("\tSave img paths to ", curr_paths_fname) with open(curr_paths_fname, 'w') as f: for path in img_paths: f.write(path + "\n") # Save Image Ids curr_ids_fname = os.path.join(output, img_set + '.ids') print("\tSave img ids to ", curr_ids_fname) with open(curr_ids_fname, 'w') as f: for idx in img_ids: f.write(idx + "\n") common_paths_fname = os.path.join(info_dir, img_set + '.path') if os.path.exists(common_paths_fname): with open(common_paths_fname) as f: common_img_paths = f.readlines() common_img_paths = [img_path.strip() for img_path in common_img_paths] # All feature extractor should extract for the same image set. assert common_img_paths == img_paths else: shutil.copy(curr_paths_fname, common_paths_fname) def extract_vision_feature_keys(model, img_transform, img_sets, output, batch_size): """ :param model: The visn_model which takes an image [b, channel, H, W] as input, and output with [b, f] :param img_transform: The transformation of images, compatible with training. :param img_sets: The sets of images to be extracted. :param output: The directory to save the extracted keys. :return: """ last_dim = -1 for img_set in img_sets: print("Extracting feature keys for image set %s" % img_set) img_paths, img_ids = get_img_paths_and_ids(img_set) saved_img_paths = [] saved_img_ids = [] img_keys = [] tensor_imgs = [] for i, img_path in enumerate(tqdm.tqdm(img_paths)): try: pil_img = default_loader(img_path) except Exception as e: print(e) print("Skip image %s" % img_path) continue saved_img_paths.append(img_path) saved_img_ids.append(img_ids[i]) tensor_imgs.append(img_transform(pil_img)) if len(tensor_imgs) == batch_size: visn_input = torch.stack(tensor_imgs).cuda() with torch.no_grad(): visn_output = model(visn_input) # Check sizes of features are equal. if last_dim == -1: last_dim = visn_output.shape[-1] assert last_dim == visn_output.shape[-1] last_dim = visn_output.shape[-1] # Saved the features in hdf5 img_keys.extend(visn_output.detach().cpu().numpy()) tensor_imgs = [] if len(tensor_imgs) > 0: visn_input = torch.stack(tensor_imgs).cuda() with torch.no_grad(): visn_output = model(visn_input) # Saved the features in hdf5 img_keys.extend(visn_output.detach().cpu().numpy()) assert len(img_keys) == len(saved_img_paths) h5_path = os.path.join(output, img_set + '.hdf5') print(f"\tSave features (keys) to {h5_path} with hdf5 dataset 'Keys'.") h5_file = h5py.File(h5_path, 'w') dset = h5_file.create_dataset("keys", (len(saved_img_paths), last_dim)) for i, img_key in enumerate(img_keys): dset[i] = img_key save_img_paths_and_ids(img_set, saved_img_paths, saved_img_ids, output) h5_file.close() # This default transformation is used by PyTorch ResNet on ImageNet. normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) default_transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), normalize ]) import torch from torch import nn import torchvision.models as models def get_visn_arch(arch): try: return getattr(models, arch) except AttributeError as e: print(e) print("There is no arch %s in torchvision." % arch) # __all__ = ['ResNet', 'resnet18', 'resnet34', 'resnet50', 'resnet101', # 'resnet152', 'resnext50_32x4d', 'resnext101_32x8d', # 'wide_resnet50_2', 'wide_resnet101_2'] class VisnModel(nn.Module): def __init__(self, arch='resnet50', pretrained=True): """ :param dim: dimension of the output :param arch: backbone architecture, :param pretrained: load feature with pre-trained vector :param finetuning: finetune the model """ super().__init__() # Setup Backbone resnet = get_visn_arch(arch)(pretrained=pretrained) for param in resnet.parameters(): param.requires_grad = False resnet.fc = nn.Identity() self.backbone = resnet def forward(self, img): """ :param img: a tensor of shape [batch_size, H, W, C] :return: a tensor of [batch_size, d] """ x = self.backbone(img) x = x.detach() # x = x / x.norm(2, dim=-1, keepdim=True) return x if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument('--load-dir', type=str, default=None, help='The directory saved the model (containing' 'BEST.pth.model).') parser.add_argument('--torchvision-model', type=str, default=None, help='The directory saved the model (containing' 'BEST.pth.model).') parser.add_argument('--image-sets', type=str, default='coco_minival', help='The splits of images to be extracted') parser.add_argument('--output-dir', type=str, default=None, help='The directory to save the extracted feature keys') parser.add_argument('--batch-size', type=int, default=32) args = parser.parse_args() img_sets = [img_set.strip() for img_set in args.image_sets.split(',')] if args.torchvision_model is not None: assert args.load_dir is None, ("either load from torch model using option 'torchvision_model'" "or from pre-trained CoX model with option 'load_dir'") visn_model = VisnModel(arch=args.torchvision_model).eval().cuda() if args.batch_size > 1: # for multi-batch extraction, must use the same image size img_transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), normalize ]) else: # For single-batch extraction, we want to extract high-quality features, with two processes: # 1. Use large image sizes (400 - 600) # 2. Keep the aspect ratio MIN_SIZE = 400. MAX_SIZE = 600. def img_transform_func(img): img_w, img_h = img.size # PiL Image's size order is w, h assert img_w > 0 and img_h > 0 scale = min( MIN_SIZE / min(img_w, img_h), MAX_SIZE / max(img_w, img_h), ) # Keep the aspect ratio want_w, want_h = int(img_w * scale), int(img_h * scale) _img_transform = transforms.Compose([ transforms.Resize((want_h, want_w)), # PyTorch use size order h, w transforms.ToTensor(), normalize ]) return _img_transform(img) img_transform = img_transform_func else: # Load the model if os.path.exists(args.load_dir + '/BEST.pth.model'): print("Load model from %s." % (args.load_dir + '/BEST.pth.model')) sys.path.append(args.load_dir + '/src') for dirc in os.listdir(args.load_dir + '/src'): sys.path.append(args.load_dir + '/src/' + dirc) # import model # The pickle has some issues... thus must load the library joint_model = torch.load(args.load_dir + '/BEST.pth.model') joint_model.eval() # DO NOT FORGET THIS!!! visn_model = joint_model.visn_model else: print(f"No snapshot {args.load_dir + '/BEST.pth.model'}. Exit.") exit() # Load the img-preprocessing transformation, which used in training CoX model. if os.path.exists(args.load_dir + '/img_transform.pkl'): print("Load img transformation from %s." % (args.load_dir + '/img_transform.pkl')) with open(args.load_dir + '/img_transform.pkl', 'rb') as f: img_transform = pickle.load(f) else: print("Using default image transformatioin") img_transform = default_transform # Feature output directory output_dir = args.output_dir if args.output_dir is None: output_dir = args.load_dir + '/keys' # Save the keys with the model dict os.makedirs(output_dir, exist_ok=True) extract_vision_feature_keys( visn_model, img_transform, img_sets, output_dir, args.batch_size ) ================================================ FILE: vokenization/indexing.py ================================================ import numpy as np import torch import tqdm class GPUIndexer(object): def __init__(self, keys, gpus=(0,), fp16=False): self.gpus = gpus self.gpu = gpus[0] self.keys = keys self.fp16 = fp16 self.dim = len(self.keys[0]) def topk(self, query, topk: int = 1): raise NotImplementedError def batch_topk(self, query, topk: int = 1): raise NotImplementedError def batch_top1(self, query): raise NotImplementedError class TorchGPUIndexer(GPUIndexer): def __init__(self, keys, gpus=(0,), fp16=False): super().__init__(keys, gpus, fp16) self.gpu_keys = torch.tensor(keys).cuda(self.gpu) print(f"Build torch indexer on GPU {self.gpu}") if self.fp16: self.gpu_keys = self.gpu_keys.half() def topk(self, query, topk: int = 1): if not type(query) is torch.Tensor: query = torch.tensor(query) query = query.cuda(self.gpu) if self.fp16: query = query.half() score = (self.gpu_keys * query).sum(-1) topk_score, topk_idx = score.topk(topk) return topk_score, topk_idx def batch_topk(self, query, topk: int = 1): if not type(query) is torch.Tensor: query = torch.tensor(query) query = query.cuda(self.gpu) if self.fp16: query = query.half() score = (self.gpu_keys.unsqueeze(0) * query.unsqueeze(1)).sum(-1) topk_score, topk_idx = score.topk(topk, dim=1) return topk_score, topk_idx def batch_top1(self, query): if not type(query) is torch.Tensor: query = torch.tensor(query) query = query.cuda(self.gpu) if self.fp16: query = query.half() score = (self.gpu_keys.unsqueeze(0) * query.unsqueeze(1)).sum(-1) topk_score, topk_idx = score.max(dim=1) return topk_score, topk_idx def batch_top1_l2(self, query): if not type(query) is torch.Tensor: query = torch.tensor(query) query = query.cuda(self.gpu) if self.fp16: query = query.half() # print(query.norm(dim=-1) - 1.) # print(self.gpu_keys.norm(dim=-1) - 1.) score = ((self.gpu_keys.unsqueeze(0) - query.unsqueeze(1)) ** 2).sum(-1) topk_score, topk_idx = score.min(dim=1) return topk_score, topk_idx class FaissGPUIndexer(GPUIndexer): def __init__(self, keys, gpus=(0,), fp16=False): try: import faiss except Exception as e: print("Faiss is not installed! Please see https://github.com/facebookresearch/faiss/blob/master/INSTALL.md.") raise e super().__init__(keys, gpus, fp16) res = faiss.StandardGpuResources() index_flat = faiss.IndexFlatL2(self.dim) # index_flat = faiss.IndexFlatIP(self.dim) print(f"Build faiss indexer on GPU {self.gpu}") print(keys.shape) self.gpu_index_flat = faiss.index_cpu_to_gpu(res, self.gpu, index_flat) self.gpu_index_flat.add(keys) def batch_topk(self, query, topk: int = 1): if type(query) is torch.Tensor: query = query.cpu().numpy() D, I = self.gpu_index_flat.search(query, topk) D = D I = I D = torch.from_numpy(D) I = torch.from_numpy(I) return D, I def batch_top1(self, query): """ :param query: shape of [b, f] """ if type(query) is torch.Tensor: query = query.cpu().numpy() D, I = self.gpu_index_flat.search(query, 1) D = D[:, 0] I = I[:, 0] D = torch.from_numpy(D) I = torch.from_numpy(I) return D, I if __name__ == '__main__': # 100k keys keys = np.random.uniform(size=(1000000, 64)) * 0.01 querys = np.random.uniform(size=(1000000, 64)) * 0.01 indexer = GPUIndexer(keys, [0], fp16=True) batch_size = 64 for start in tqdm.tqdm(range(0, len(querys), batch_size)): query = querys[start: start + batch_size] # indexer.batch_topk(query, 1) top_score, top_idx = indexer.batch_top1(query) ================================================ FILE: vokenization/revokenization.py ================================================ # Copyleft 2020 project COL. from transformers import AutoTokenizer class ReVokenizer: """ Convert a """ def __init__(self, forward_tokenizer_name, backward_tokenizer_name, vokenizer): """ :args forward_tokenizer: :args backward_tokenizer: :args vokenizer: """ self.forward_tokenizer = AutoTokenizer.from_pretrained(forward_tokenizer_name, use_fast=True) self.backward_tokenizer = AutoTokenizer.from_pretrained(backward_tokenizer_name, use_fast=True) self.slow_backward_tokenizer = AutoTokenizer.from_pretrained(backward_tokenizer_name) self.vokenizer = vokenizer self.prepare_for_unicode() def vokenize_sent(self, sents, topk=None): pass def vokenize_ids(self, input_ids, topk=None, verbose=False): """ backward_input <-- Backward Tokenizer <-- Sentence --> Forward Tokenizer --> forward_input --> Vokenizer --> forward_results """ sents, forward_input, backward_input = self.process(input_ids) alignments = self.batch_calculate_alignment( forward_input['offset_mapping'], backward_input['offset_mapping'], ) forward_results = self.vokenizer.vokenize_ids( forward_input['input_ids'], topk ) backward_results = self.batch_map_back(forward_results, alignments) if verbose: # if True: self.show_alignments( sents, forward_input, backward_input, alignments, input_ids, backward_results) return backward_results def show_alignments(self, sents, forward_inputs, backward_inputs, alignments, input_ids, backward_results): forward_ids = forward_inputs['input_ids'] forward_offsets = forward_inputs['offset_mapping'] backward_ids = backward_inputs['input_ids'] backward_offsets = backward_inputs['offset_mapping'] _, _, backward_result_tokens, _ = backward_results for sent, forward_id, backward_id, forward_offset, backward_offset, alignment, input_id, backward_result_token in zip( sents, forward_ids, backward_ids, forward_offsets, backward_offsets, alignments, input_ids, backward_result_tokens ): print(sent) for backward_idx, forward_idx in enumerate(alignment): def get_str(l, r): return sent[l: r] print("%2d %2d %7s %7s %7s | %7s %7s %7s" % ( backward_idx, forward_idx, self.backward_tokenizer._convert_id_to_token(input_id[backward_idx]), self.backward_tokenizer._convert_id_to_token(backward_id[backward_idx]), get_str(*backward_offset[backward_idx]), self.forward_tokenizer._convert_id_to_token(forward_id[forward_idx]), backward_result_token[backward_idx + 1], get_str(*forward_offset[forward_idx]), )) print() def show_input(self, sents, forward_inputs, backward_inputs, input_ids): forward_ids = forward_inputs['input_ids'] forward_offsets = forward_inputs['offset_mapping'] backward_ids = backward_inputs['input_ids'] backward_offsets = backward_inputs['offset_mapping'] for sent, forward_id, backward_id, forward_offset, backward_offset, input_id in zip( sents, forward_ids, backward_ids, forward_offsets, backward_offsets, input_ids ): print(sent) for i, (backward_i, bo, input_i) in enumerate(zip(backward_id, backward_offset, input_id)): print("%7s %7s" % ( self.backward_tokenizer._convert_id_to_token(backward_i), self.backward_tokenizer._convert_id_to_token(input_i), # self.forward_tokenizer._convert_id_to_token(forward_i), ), bo, sent[bo[0]: bo[1]] if bo is not None else '') print() def backward_decode(self, input_id): # return u''.join(self.backward_tokenizer.convert_ids_to_tokens(input_id)).replace('Ġ', ' ') # return self.backward_tokenizer.decode(input_id) tokens = self.slow_backward_tokenizer.convert_ids_to_tokens(input_id, skip_special_tokens=True) # print(tokens) return self.slow_backward_tokenizer.convert_tokens_to_string( tokens ) def process(self, input_ids): """ :return: two dicts (forward_input, backward_input) with keys "input_ids" "offset_mapping" """ sents = [self.backward_decode(input_id) for input_id in input_ids] tokenizer_kwargs = { 'return_token_type_ids': False, 'return_attention_mask': False, 'return_offsets_mapping': True, } # 'add_special_tokens': False, forward_input = self.forward_tokenizer.batch_encode_plus( sents, **tokenizer_kwargs ) backward_input = self.backward_tokenizer.batch_encode_plus( sents, **tokenizer_kwargs ) # Avoid batch-1 self._safe_guard(forward_input) self._safe_guard(backward_input) # Remove and self._remove_special_tokens(forward_input) self._remove_special_tokens(backward_input) # postprocessing of the backwards self._calibrate_backward_offset(backward_input) # self._fix_nouns(backward_input) self._fix_length(backward_input, input_ids) assert list(map(len, backward_input['input_ids'])) == \ list(map(len, input_ids)), (list(map(len, backward_input['input_ids'])), list(map(len, input_ids))) return sents, forward_input, backward_input @staticmethod def _safe_guard(inputs): ids = inputs['input_ids'] if type(ids[0]) is int: for key, value in inputs.items(): inputs[key] = [value] @staticmethod def _remove_special_tokens(inputs): if type(inputs) is dict: for key in inputs: inputs[key] = ReVokenizer._remove_special_tokens(inputs[key]) return inputs return [input[1:-1] for input in inputs] @staticmethod def _fix_nouns(backward_input): backward_offsets = backward_input['offset_mapping'] for backward_offset in backward_offsets: last_not_noun_idx = -1 while backward_offset[last_not_noun_idx] is None: last_not_noun_idx -= 1 for noun_idx in range(last_not_noun_idx + 1, 0): backward_offset[noun_idx] = backward_offset[last_not_noun_idx] @staticmethod def _fix_length(backward_input, input_ids): backward_ids = backward_input['input_ids'] backward_offsets = backward_input['offset_mapping'] for i in range(len(backward_ids)): desired_length = len(input_ids[i]) if len(backward_ids[i]) > desired_length: backward_ids[i] = backward_ids[i][:desired_length] backward_offsets[i] = backward_offsets[i][:desired_length] while len(backward_ids[i]) < desired_length: backward_ids[i].append(backward_ids[i][-1]) backward_offsets[i].append(backward_offsets[i][-1]) # print(desired_length) # print(len(backward_ids[i])) assert desired_length == len(backward_ids[i]) == len(backward_offsets[i]) def _calibrate_backward_offset(self, backward_input): batch_input_ids = backward_input['input_ids'] batch_new_offset = [] for input_ids in batch_input_ids: now = 0 byte_list = [] new_offset = [] for input_id in input_ids: token = self.backward_tokenizer._convert_id_to_token(input_id) start = now unicode_complete_flag = True for char in token: byte = self.c2b[char] byte_list.append(byte) try: unicode_char = bytes(byte_list).decode('utf-8') byte_list = [] now += 1 unicode_complete_flag = True except UnicodeDecodeError as e: unicode_complete_flag = False if unicode_complete_flag: left, right = start, now else: left, right = start, now + 1 new_offset.append((left, right)) # print(token, sent[left: right].replace(' ', 'Ġ')) batch_new_offset.append(new_offset) backward_input['offset_mapping'] = batch_new_offset def prepare_for_unicode(self): def bytes_to_unicode(): """ Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to whitespace/control characters the bpe code barfs on. The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab if you want to avoid UNKs. When you're at something like a 10B token dataset you end up needing around 5K for decent coverage. This is a signficant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup tables between utf-8 bytes and unicode strings. """ bs = ( list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list( range(ord("®"), ord("ÿ") + 1)) ) cs = bs[:] n = 0 for b in range(2 ** 8): if b not in bs: bs.append(b) cs.append(2 ** 8 + n) n += 1 cs = [chr(n) for n in cs] return dict(zip(bs, cs)) self.b2c = bytes_to_unicode() self.c2b = {c: b for b, c in self.b2c.items()} def show(self, ids_list): print( [self.backward_tokenizer.convert_ids_to_tokens(ids) for ids in ids_list] ) @staticmethod def batch_map_back(results, alignments): if type(results) is tuple: # Handle multiple output by the vokenizer # i.e., input_ids, input_scores, ... return [ReVokenizer.batch_map_back(one_results, alignments) for one_results in results] new_results = [] for result, alignment in zip(results, alignments): # print(result) # print(max(alignment), len(result)) new_results.append( [result[0]] + [result[idx + 1] for idx in alignment] + [result[-1]]) assert max(alignment) < (len(result) - 2) return new_results @staticmethod def batch_calculate_alignment(batch_forward_offsets, batch_backward_offsets): """ for each backward_token indicated by backward offset, align a forward token to it. """ alignments = [] for forward_offsets, backward_offsets in zip(batch_forward_offsets, batch_backward_offsets): alignment = [] # Backward: I ha ve a lov ely c at. # Sent: I have a lovely cat # Forward: I hav e a lo ve ly cat. now_idx = 0 for backward_offset in backward_offsets: best_idx = now_idx best_iou = IoU(forward_offsets[best_idx], backward_offset) while (now_idx + 1 < len(forward_offsets)) and \ (forward_offsets[now_idx][1] < backward_offset[1]): now_idx += 1 now_iou = IoU(forward_offsets[now_idx], backward_offset) if now_iou > best_iou: best_idx = now_idx best_iou = now_iou alignment.append(best_idx) alignments.append(alignment) return alignments def IoU(a, b): x1, y1 = a x2, y2 = b len1 = y1 - x1 len2 = y2 - x2 I = max(min(y1, y2) - max(x1, x2), 0) U = len1 + len2 - I return I / max(U, 1) if __name__ == "__main__": revokenizer = ReVokenizer('bert-base-uncased', 'roberta-base', None) tokenizer = AutoTokenizer.from_pretrained('roberta-base') sents = ['Do not panic. ', ' iso have a dream .', ' This is a test???', 'Congratulations to the LiLT Founder and CEO, @stanfordnlp grad, Spence Green!', 'Ay congrats Ethan! An awesome crew, well deserved', ' By the fourth season, fewer than three million viewers tuned in each week despite what some fans and critics considered an increase in episode quality.', 'Filming of the final episode began on Friday, February 25, after the first half of the day was spent completing "Terra Prime". Principal photography took eight days to complete, one day longer than usual. ', 'sda asdo weij sdjf oweif bqosdj weorasd.?SdfasXX...', ] ids = [tokenizer.encode(sent, add_special_tokens=False) for sent in sents] print(sents) sents = [tokenizer.decode(idx) for idx in ids] print(sents) revokenizer.vokenize_ids(ids) ================================================ FILE: vokenization/revokenize_corpus_mp.py ================================================ # coding=utf-8 # Copyleft 2020 project COL. import argparse import copy from multiprocessing import Queue, Process import os import queue import sys import time import h5py import torch import tqdm from spacy.lang.en import English sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) from vokenization.vokenization import load_model_and_tokenizer, Vokenizer from vokenization.revokenization import ReVokenizer # Handle the GPU issue in multi-processing. from multiprocessing import set_start_method try: set_start_method('spawn') except RuntimeError: pass def processer(args, input_queue, output_queue): print(f"Setup workers on gpu {args.gpus}") img_sets = sorted([img_set.strip() for img_set in args.image_sets.split(',')]) print("Build models and tokenizer") # We will assign the GPU to model latter, thus load to cpu first! model, tokenizer = load_model_and_tokenizer(args.load, cpu=True) keys_dir = args.load + '/keys' # Save the keys with the model dict print("Build Retriever from %s with image sets" % keys_dir, img_sets) vokenizer = Vokenizer(model, tokenizer, keys_dir, img_sets=img_sets, max_img_num=args.max_img_num, gpus=args.gpus, sent_level=('sent' in args.load)) print(f"GPU: {args.gpus}, build vokenizer with {vokenizer.img_num} images.") # Before vokenization, save the image ids dset_name = os.path.split(args.corpus)[-1] modifier = f".{vokenizer.img_num}" if vokenizer.img_num != 50000 else "" vokens_img_ids_path = os.path.join( args.output, f"{dset_name}.{'_'.join(img_sets)}{modifier}.ids" ) if args.gpus[0] == 0: if os.path.exists(vokens_img_ids_path): # If the img_ids file exists, assert that they are the same. saved_img_ids = open(vokens_img_ids_path).readlines() img_ids = vokenizer.img_ids assert len(saved_img_ids) == len(img_ids) for saved_img_id, img_id in zip(saved_img_ids, img_ids): assert saved_img_id.strip() == img_id else: vokenizer.dump_img_ids(vokens_img_ids_path) while True: page_id, sents = input_queue.get() # Print the first few sents for debugging if args.gpus[0] == 0: if page_id < 12 and sents is not None: print('page_id:', page_id) print('batch_size:', len(sents)) print('ids of sent[0]:', sents[0]) print('tokens of sent[0]:', tokenizer.convert_ids_to_tokens(sents[0])) print() # print(f"Processer {args.gpus}: Get Page Id {page_id}") if sents is not None: output_str = '' results = vokenizer.vokenize_ids(sents) idxs = results[1] for j, idx in enumerate(idxs): assert len(idx[1:-1]) == len(sents[j]) dump_idx = map(lambda x: str(x.item()), idx[1:-1]) output_str += ' '.join(dump_idx) + '\n' output_queue.put((page_id, output_str)) else: break def reducer(output_fname, output_queue, total_tokens): next_page_id = 0 heap = queue.PriorityQueue() output = open(output_fname, 'a') cache = "" start_time = None processed_tokens = 0 while True: page_id, result = output_queue.get() if start_time is None: # The clock starts to tick when receiving the first package. start_time = time.time() # print("Reducer: Get Page Id %d" % page_id) if result is not None: # Put it into the heap heap.put((page_id, result)) # Check the could-be-dumped data in the queue while heap.qsize() > 0: smallest_page_id, result = heap.get() if smallest_page_id == next_page_id: # which means that this page is the next page, thus dump it # print("Reducer: Commit Page Id %d" % next_page_id) processed_tokens += len(result.split(' ')) cache += result next_page_id += 1 else: heap.put((smallest_page_id, result)) break # print("Reducer: Length of Cache Now", len(cache)) if len(cache) > 1000000: # Dump for every 1000000 characters to reduce IO calls output.write(cache) output.flush() cache = '' used_time = int(time.time() - start_time) print("Process %d tokens, %d to go, with speed %0.2f tokens/second," "finished in %0.2f hours" % ( processed_tokens, total_tokens - processed_tokens, processed_tokens / used_time, (total_tokens - processed_tokens) / (processed_tokens / used_time) / 3600 )) else: if len(cache) > 0: output.write(cache) output.flush() cache = '' break output.close() def setup_mp(args, tokens, sent_ranges, vokens_path): QUEUE_SIZE = 10000 input_queue = Queue(maxsize=QUEUE_SIZE) output_queue = Queue(maxsize=QUEUE_SIZE) workers = [] num_gpu = torch.cuda.device_count() for worker_id in range(args.num_workers): gpu_id = worker_id % num_gpu curr_args = copy.copy(args) curr_args.gpus = (gpu_id,) worker = Process(target=processer, args=(curr_args, input_queue, output_queue)) worker.daemon = True worker.start() workers.append(worker) total_tokens = len(tokens) - sent_ranges[0][0] if len(sent_ranges) > 0 else 0 reduce = Process(target=reducer, args=(vokens_path, output_queue, total_tokens)) reduce.start() for i, start_id in enumerate(range(0, len(sent_ranges), args.batch_size)): sents = [] for left, right in sent_ranges[start_id: start_id + args.batch_size]: sents.append(tokens[left: right]) input_queue.put((i, sents)) # Notifying workers the end of input for _ in workers: input_queue.put((-1, None)) # wait for workers to terminate for w in workers: w.join() # Notify the reducer the end of output output_queue.put((-1, None)) # wait for reducer to terminate reduce.join() def segment_sent( tokens, tokenizer, tokens_line_info_path, tokens_sent_info_path ): """ Single-processed segmentation of sentences. We might need to parallel this as well. """ with open(tokens_line_info_path) as f: line_starts = list(map(int, f.readlines())) nlp = English() sentencizer = nlp.create_pipe("sentencizer") nlp.add_pipe(sentencizer) sent_starts = [0] now = 0 for i in tqdm.tqdm(range(len(line_starts) - 1)): start_token_idx = line_starts[i] end_token_idx = line_starts[i + 1] line_tokens = tokens[start_token_idx: end_token_idx] line = ' '.join(tokenizer.convert_ids_to_tokens(line_tokens)) line = line.replace("[UNK]", "UNK") doc = nlp(line) sents_len = 0 sents = [] for sent in doc.sents: if i < 2: print(sent) sent = str(sent) sents.append(sent) words = sent.split(' ') sent_len = len(words) now += sent_len sent_starts.append(now) sents_len += sent_len if sents_len != len(line_tokens): print(sents_len) print(sents) print(len(line_tokens)) print(line) assert False assert sent_starts[-1] == end_token_idx with open(tokens_sent_info_path, 'w') as f: for sent_start in sent_starts: f.write(str(sent_start) + "\n") if __name__ == "__main__": parser = argparse.ArgumentParser() # Text parser.add_argument('--corpus', type=str, default='/ssd-playpen/data/wiki103/wiki.train.raw') # Models parser.add_argument('--load', type=str, default='/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_robertal4', help='The directory saved the model (containing' 'BEST.pth.model).') parser.add_argument('--output', type=str, default=None, help='The directory to save the extracted feature keys.' '"None" would save in the "load" dir') parser.add_argument('--backward-tokenizer-name', type=str, default='roberta-base') parser.add_argument('--forward-tokenizer-name', type=str, default='roberta-base') # Vision: Define the vokens set parser.add_argument('--image-sets', type=str, default='vg_nococo', help='The splits of images to be extracted') parser.add_argument('--max-img-num', type=int, default=50000, help='number of images used. -1 means all images.') # Speed Up Options: parser.add_argument('--num-workers', type=int, default=-1, help='-1 will use all GPUs.') parser.add_argument('--batch-size', type=int, default=16, help='The # of sentences in a batch.') args = parser.parse_args() if args.num_workers == -1: args.num_workers = torch.cuda.device_count() if args.output is None: args.output = os.path.join(args.load, 'vokens') os.makedirs(args.output, exist_ok=True) dset_name = os.path.split(args.corpus)[-1] img_sets = sorted([img_set.strip() for img_set in args.image_sets.split(',')]) print() print("Main Th" "read: Build a virtual vokenizer to check the number of images.") keys_dir = args.load + '/keys' # Save the keys with the model dict virtual_vokenizer = Vokenizer( None, None, keys_dir, img_sets=img_sets, max_img_num=args.max_img_num, gpus=(-1,), sent_level=('sent' in args.load)) modifier = f".{virtual_vokenizer.img_num}" if virtual_vokenizer.img_num != 50000 else "" vokens_path = os.path.join( args.output, f"{dset_name}.{'_'.join(img_sets)}{modifier}" ) tokens_hdf5_path = f'{args.corpus}.{args.backward_tokenizer_name}.hdf5' tokens_sent_info_path = f'{args.corpus}.{args.backward_tokenizer_name}.sent' # "Load" tokens from hdf5 tokens_hdf5 = h5py.File(tokens_hdf5_path, 'r') tokens = tokens_hdf5['tokens'] # Calibrate the start line if the vokens have been proceeded. if not os.path.exists(tokens_sent_info_path): tokens_line_info_path = f'{args.corpus}.{args.backward_tokenizer_name}.line' model, tokenizer = load_model_and_tokenizer(args.load, cpu=True) segment_sent( tokens, tokenizer, tokens_line_info_path, tokens_sent_info_path ) # Load sent info and find the start sentence with open(tokens_sent_info_path) as f: sent_starts = list(map(int, f.readlines())) # Skip the sentences which have been extracted. extracted_tokens = 0 if os.path.isfile(vokens_path): with open(vokens_path, 'r') as g: for g_line in tqdm.tqdm(g): extracted_tokens += len(g_line.strip().split(' ')) try: start_sent_idx = sent_starts.index(extracted_tokens) except ValueError as e: print("The extracted tokens does not match a starting sentence.") print(e) # Start to vokenize print("Main Thread: Dump visual tokens to %s" % vokens_path) print("Main Thread: Start vokenization from the %d'th token" % sent_starts[start_sent_idx]) sent_ranges = [] for i in range(start_sent_idx, len(sent_starts) - 1): left_token_idx = sent_starts[i] right_token_idx = sent_starts[i + 1] sent_ranges.append((left_token_idx, right_token_idx)) setup_mp(args, tokens, sent_ranges, vokens_path) # save into hdf5 file if os.path.exists(vokens_path + '.hdf5'): print("The hdf5 file %s already exists. So the hdf5 file is not converted." % (vokens_path + '.hdf5')) else: with open(args.corpus + '.' + args.backward_tokenizer_name + ".sent") as f: for i, line in enumerate(f): pass num_tokens = int(line) num_sents = i h5_file = h5py.File(vokens_path + '.hdf5', 'w') dset = h5_file.create_dataset("vokens", (num_tokens,), dtype='int32') dump_interval = 100000 dump_iter = 0 lines = 0 with open(vokens_path) as f: tokens = [] for line in tqdm.tqdm(f, total=num_sents): for token in map(int, line.split(' ')): tokens.append(token) if len(tokens) >= dump_interval: dset[dump_iter: dump_iter + len(tokens)] = tokens dump_iter += len(tokens) tokens = [] lines += 1 dset[dump_iter: dump_iter + len(tokens)] = tokens dump_iter += len(tokens) assert num_tokens == dump_iter print(lines, num_sents) assert lines == num_sents h5_file.close() ================================================ FILE: vokenization/vokenization.py ================================================ # coding=utf-8 # Copyleft 2020 project COL. from collections import defaultdict import math import pickle import os import sys import h5py import numpy as np import torch from torch.nn.utils.rnn import pad_sequence from transformers import BertTokenizer import common from indexing import TorchGPUIndexer, FaissGPUIndexer VERY_LARGE = 9595959595 class Vokenizer: def __init__(self, model, tokenizer, keys_dir, img_sets=('coco_minival',), max_img_num=VERY_LARGE, gpus=(0,), backend='faiss', upper_bound=128, sent_level=False): """ :param model: Hugginface language model :param tokenizer: Hugginface Tokenizer :param keys_dir: the directory which saves the keys. :param img_sets: the img_sets to be loaded, see common.IMAGE_SETS for all options. :param max_img_num: load up to #max_img_num images into the dictionary :param gpus: The GPUs used in calculating the BERT outputs and indexing. Note: Currently only one GPU is supported!!! """ self.model = model.cuda(gpus[0]) if model is not None else model self.tokenizer = tokenizer self.img_sets = img_sets self.gpus = gpus # The GPUs used in the indexer self.gpu = self.gpus[0] self.backend = backend self.upper_bound = upper_bound self.sent_level = sent_level # Otherwise use word level max_img_num = VERY_LARGE if max_img_num == -1 else max_img_num # These two are important, which indicates the mapping from # vokens to their actual images. self.img_paths = [] self.img_ids = [] for img_set in self.img_sets: assert img_set in common.IMAGE_SETS, "%s not in image sets %s" % ( img_set, common.IMAGE_SETS) # Load image paths corresponding to the keys. # img_paths_fname = os.path.join(common.LOCAL_DIR, 'images', img_set + "_paths.txt") # img_ids_fname = os.path.join(common.LOCAL_DIR, 'images', img_set + "_ids.txt") img_paths_fname = os.path.join(keys_dir, f"{img_set}.path") img_ids_fname = os.path.join(keys_dir, f"{img_set}.ids") if not os.path.exists(img_paths_fname): # If the actual images are not saved on the server, we would use the img_ids. img_paths_fname = img_ids_fname with open(img_paths_fname) as f: all_img_paths = list(map(lambda x: x.strip(), f.readlines())) with open(img_ids_fname) as g: all_img_ids = list(map(lambda x: x.strip(), g.readlines())) assert len(all_img_paths) == len(all_img_ids) for img_path, img_id in zip(all_img_paths, all_img_ids): if len(self.img_paths) < max_img_num: self.img_paths.append(img_path) self.img_ids.append(f"{img_set}/{img_id}") else: break assert len(self.img_paths) == len(self.img_ids) # Lazy loading and indexing self.keys = None self.keys_dir = keys_dir self.indexed = False self.indexer = None @property def img_num(self): return len(self.img_paths) def dump_img_ids(self, fname): """ Dump the mapping from the voken_id to img_ids, to fname. Saved in the format of array. """ with open(fname, 'w') as f: for img_id in self.img_ids: f.write(img_id + "\n") def __len__(self): return self.img_num def indexing(self): self.model.eval() # Load pre-extracted image keys. self.keys = [] remain_img_num = self.img_num for img_set in self.img_sets: assert img_set in common.IMAGE_SETS, "%s not in image sets %s" % ( img_set, common.IMAGE_SETS) keys_fname = os.path.join(self.keys_dir, img_set + '.hdf5') if not os.path.exists(keys_fname): assert False, "keys of image set %s is not extracted, please save it at %s" % ( img_set, keys_fname ) # Load Keys h5_file = h5py.File(keys_fname, 'r') dset = h5_file["keys"] load_img_num = min(remain_img_num, len(dset)) load_keys = dset[:load_img_num] self.keys.append(load_keys) remain_img_num -= load_img_num h5_file.close() if load_img_num == 0: break # Lazy indexing self.keys = np.concatenate(self.keys, 0) if self.backend == 'torch': self.indexer = TorchGPUIndexer(self.keys, gpus=self.gpus, fp16=True) elif self.backend == 'faiss': self.indexer = FaissGPUIndexer(self.keys, gpus=self.gpus, fp16=True) else: raise NotImplementedError(f"Backend {self.backend} is not supported") self.indexed = True def vokenize_sents(self, sents, topk=None): input_ids = [] for sent in sents: input_ids.append(self.tokenizer.encode( sent, add_special_tokens=False, # return_tensors='pt' # Return PyTorch (pt) tensors )) return self.vokenize_ids(input_ids, attention_mask=None, topk=topk) def vokenize_ids(self, input_ids, attention_mask=None, topk=None): """ :param input_ids: A list of token_ids i.e., [[token_1_1, token_1_2, ...], [token_2_1, token_2_2, ...], ...] :param attention_mask: I did not use it for now. :param topk: Retrieve the topk vokens for each token. :return: top_scores, top_idxs, input_tokens, top_paths Note: 1. The results would consider the additional special tokens while the input_tokens do **not**. 2. If topk=None, it will be a 2-d results with: [ [s11_top1, s12_top1, ...], [s21_top1, s22_top1, ...], ..... ] If topk!=None (e.g., 1, 5, 10), it will be a 3-d results with: [ [ [s11_top1, s11_top2, ...], [s12_top1, s12_top2, ...], ...... ], [ [s21_top1, s21_top2, ...], [s22_top1, s22_top2, ...], ...... ], ..... ], where s11_top1 means s1(the 1st sentence)1(the 1st token of the 1st sentence)_top1(the top-1 index) """ if not self.indexed: # Index the keys at the first retrieval call. self.indexing() # The original tokens input_tokens = [ ([self.tokenizer.cls_token] + [self.tokenizer._convert_id_to_token(idx) for idx in input_id] + [self.tokenizer.sep_token]) for input_id in input_ids] # Deal with over-length tokens (because the BERT-style encoder has length limit due to the positional embedding) # Here is a process to avoid very short sequence when cutting the long sentence: # Suppose the sentence length is 18 and UPPER_BOUND is 8, # we draw it as <----------------->, where "<" is bos, and ">" is the last token # instead of cut it as <------->------->->, which has very short sequence <-> in the end. # we cut it with almost equal length: <----->----->-----> input_ids = input_ids.copy() sent2segs = defaultdict(list) for i in range(len(input_ids)): if len(input_ids[i]) > self.upper_bound: num_segments = math.ceil(len(input_ids[i]) / self.upper_bound) tokens_per_seg = int(len(input_ids[i]) / num_segments) remaining = input_ids[i][tokens_per_seg:] input_ids[i] = input_ids[i][:tokens_per_seg] while len(remaining) > 0: # print(len(remaining)) sent2segs[i].append(len(input_ids)) input_ids.append(remaining[:tokens_per_seg]) remaining = remaining[tokens_per_seg:] # Convert to torch tensors. if not type(input_ids) is torch.Tensor: input_ids = [ torch.tensor(self.tokenizer.build_inputs_with_special_tokens(list(input_id))) for input_id in input_ids ] input_ids = pad_sequence(input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id) attention_mask = (input_ids != self.tokenizer.pad_token_id) # word_tokens --> 1, pad_token --> 0 if attention_mask.all(): attention_mask = None # Get lengths if attention_mask is not None: lengths = list(attention_mask.sum(1).numpy()) else: lengths = [len(input_ids[0])] * len(input_ids) if attention_mask is not None and type(input_ids) is not torch.Tensor: attention_mask = torch.tensor(attention_mask) # Lang model inference input_ids = input_ids.cuda(self.gpu) if attention_mask is not None: attention_mask = attention_mask.cuda(self.gpu) def apply_model(input_ids, attention_mask, lengths): with torch.no_grad(): lang_output = self.model(input_ids, attention_mask) # b, l, f if type(lang_output) is list: lang_output = lang_output[0] # Gather language output if self.sent_level: # lang_output of shape [batch_size, dim] gathered_output = lang_output else: # lang_output of shape [batch_size, max_len, dim] # --> gathered_output [ \sum_i len(i), dim] gathered_output = torch.cat([output[:length] for output, length in zip(lang_output, lengths)]) # Visn retrieval if topk is None: # It will call the function `max()` and return a 2-d tensor top_score, top_idx = self.indexer.batch_top1(gathered_output) else: # It will call the function `topk(k)` and return a 3-d tensor top_score, top_idx = self.indexer.batch_topk(gathered_output, topk=topk) return top_score, top_idx top_score, top_idx = memory_safe_apply(apply_model, input_ids, attention_mask, lengths) # Split top_score, top_idx = top_score.detach().cpu(), top_idx.detach().cpu() if not self.sent_level: # If word level, split it top_scores = list(top_score.split(lengths)) # [ float_tensor(len1), float_tensor(len2), ...] top_idxs = list(top_idx.split(lengths)) # [ int_tensor(len1), int_tensor(len2), ...] else: # If sent level, repeat the voken. # Use clone() here top_scores = [ts.expand(length, *ts.shape).clone() for ts, length in zip(top_score, lengths)] top_idxs = [tid.expand(length, *tid.shape).clone() for tid, length in zip(top_idx, lengths)] if top_idxs[0].dim() == 1: # Return the top1 paths top_paths = [[self.img_paths[idx.item()] for idx in top_idx] for top_idx in top_idxs] else: # Return the topk paths related to the sentences top_paths = [[[self.img_paths[k_idx.item()] for k_idx in topk_idx] for topk_idx in top_idx] for top_idx in top_idxs] if self.sent_level: for i, tid in enumerate(top_idxs): # Keep the first positive and others negative, to mark the header of the sentence. # [3] --> [3, 3, 3, 3] --> [-4, -4, -4, -4] --> [3, -4, -4, -4] # "-x-1" is used to handle zero, [0] --> [1, 1, 1, 1] --> [-1, -1, -1, -1] --> [0, -1, -1, -1] # print('Before conversion', tid) tid[:] = tid * (-1) - 1 tid[1] = tid[1] * (-1) - 1 # The tid[0] is corresponding to # print('After conversion', top_idxs[i]) # Put back the segments of over-length sentences if len(sent2segs) > 0: for sent_id, segment_ids in sent2segs.items(): for segment_id in segment_ids: # Append the results with the segments: # ---------Now---------------- + ----Appended Segment----- # [ I have a ][:-1] + [ cat . ][1:] # = [ I have a cat . ] top_scores[sent_id] = torch.cat([top_scores[sent_id][:-1], top_scores[segment_id][1:]]) top_idxs[sent_id] = torch.cat([top_idxs[sent_id][:-1], top_idxs[segment_id][1:]]) top_paths[sent_id] = top_paths[sent_id][:-1] + top_paths[segment_id][1:] num_sents = len(input_tokens) top_scores = top_scores[:num_sents] top_idxs = top_idxs[:num_sents] top_paths = top_paths[:num_sents] return top_scores, top_idxs, input_tokens, top_paths def memory_safe_apply(func, *args): """ If batch-wise applying exceeds the GPU memory, it would process each sample separately and sequentially :param func: function with some constraints, see code for details. :param args: args of this function :return: """ try: return func(*args) except RuntimeError as e: print(e) batch_size = len(args[0]) outputs = [] for i in range(batch_size): one_batch_args = tuple(a[i: i+1] for a in args) output = func(*one_batch_args) # **output of the func should be of the format**: # (o1, o2, ...) where each o_i is a tensor of shape [1, ...] assert type(output) is tuple or type(output) is list outputs.append(output) # outputs = ( (o1_1, o1_2, ...), (o2_1, o2_2, ...), ...) # zip(*outputs) = ( (o1_1, o2_1, ...), (o1_2, o2_2, ...), ...) outputs = tuple(torch.cat(output) for output in zip(*outputs)) return outputs default_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') def load_model_and_tokenizer(load, cpu=False): if os.path.exists(load + '/BEST.pth.model'): sys.path.append(load + '/src') for dirc in os.listdir(load + '/src'): sys.path.append(load + '/src/' + dirc) # import model # The pickle has some issues... thus must load the library if cpu: device = torch.device('cpu') joint_model = torch.load(load + '/BEST.pth.model', map_location=device) else: joint_model = torch.load(load + '/BEST.pth.model') joint_model.eval() # DO NOT FORGET THIS!!! else: print("No snapshots there, exit.") exit() if os.path.exists(load + '/tokenizer.pkl'): with open(load + '/tokenizer.pkl', 'rb') as f: tokenizer = pickle.load(f) else: tokenizer = default_tokenizer return joint_model.lang_model, tokenizer ================================================ FILE: vokenization/vokenize_corpus_mp.py ================================================ # coding=utf-8 # Copyleft 2020 project COL. import argparse import copy from multiprocessing import Queue, Process import os import queue import sys import time import h5py import torch import tqdm from spacy.lang.en import English # sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) from vokenization import load_model_and_tokenizer, Vokenizer # Handle the GPU issue in multi-processing. from multiprocessing import set_start_method try: set_start_method('spawn') except RuntimeError: pass def processer(args, input_queue, output_queue): print(f"Setup workers on gpu {args.gpus}") img_sets = sorted([img_set.strip() for img_set in args.image_sets.split(',')]) print("Build models and tokenizer") # We will assign the GPU to model latter, thus load to cpu first! model, tokenizer = load_model_and_tokenizer(args.load, cpu=True) keys_dir = args.load + '/keys' # Save the keys with the model dict print("Build Retriever from %s with image sets" % keys_dir, img_sets) vokenizer = Vokenizer(model, tokenizer, keys_dir, img_sets=img_sets, max_img_num=args.max_img_num, gpus=args.gpus, sent_level=('sent' in args.load)) print(f"GPU: {args.gpus}, build vokenizer with {vokenizer.img_num} images.") # Before vokenization, save the image ids dset_name = os.path.split(args.corpus)[-1] modifier = f".{vokenizer.img_num}" if vokenizer.img_num != 50000 else "" vokens_img_ids_path = os.path.join( args.output, f"{dset_name}.{'_'.join(img_sets)}{modifier}.ids" ) if args.gpus[0] == 0: if os.path.exists(vokens_img_ids_path): # If the img_ids file exists, assert that they are the same. saved_img_ids = open(vokens_img_ids_path).readlines() img_ids = vokenizer.img_ids assert len(saved_img_ids) == len(img_ids) for saved_img_id, img_id in zip(saved_img_ids, img_ids): assert saved_img_id.strip() == img_id else: vokenizer.dump_img_ids(vokens_img_ids_path) while True: page_id, sents = input_queue.get() # Print the first few sents for debugging if args.gpus[0] == 0: if page_id < 12 and sents is not None: print('page_id:', page_id) print('batch_size:', len(sents)) print('ids of sent[0]:', sents[0]) print('tokens of sent[0]:', tokenizer.convert_ids_to_tokens(sents[0])) print() # print(f"Processer {args.gpus}: Get Page Id {page_id}") if sents is not None: output_str = '' results = vokenizer.vokenize_ids(sents) idxs = results[1] for j, idx in enumerate(idxs): assert len(idx[1:-1]) == len(sents[j]) dump_idx = map(lambda x: str(x.item()), idx[1:-1]) output_str += ' '.join(dump_idx) + '\n' output_queue.put((page_id, output_str)) else: break def reducer(output_fname, output_queue, total_tokens): next_page_id = 0 heap = queue.PriorityQueue() output = open(output_fname, 'a') cache = "" start_time = None processed_tokens = 0 while True: page_id, result = output_queue.get() if start_time is None: # The clock starts to tick when receiving the first package. start_time = time.time() # print("Reducer: Get Page Id %d" % page_id) if result is not None: # Put it into the heap heap.put((page_id, result)) # Check the could-be-dumped data in the queue while heap.qsize() > 0: smallest_page_id, result = heap.get() if smallest_page_id == next_page_id: # which means that this page is the next page, thus dump it # print("Reducer: Commit Page Id %d" % next_page_id) processed_tokens += len(result.split(' ')) cache += result next_page_id += 1 else: heap.put((smallest_page_id, result)) break # print("Reducer: Length of Cache Now", len(cache)) if len(cache) > 1000000: # Dump for every 1000000 characters to reduce IO calls output.write(cache) output.flush() cache = '' used_time = int(time.time() - start_time) print("Process %d tokens, %d to go, with speed %0.2f tokens/second," "finished in %0.2f hours" % ( processed_tokens, total_tokens - processed_tokens, processed_tokens / used_time, (total_tokens - processed_tokens) / (processed_tokens / used_time) / 3600 )) else: if len(cache) > 0: output.write(cache) output.flush() cache = '' break output.close() def setup_mp(args, tokens, sent_ranges, vokens_path): QUEUE_SIZE = 10000 input_queue = Queue(maxsize=QUEUE_SIZE) output_queue = Queue(maxsize=QUEUE_SIZE) workers = [] num_gpu = torch.cuda.device_count() for worker_id in range(args.num_workers): gpu_id = worker_id % num_gpu curr_args = copy.copy(args) curr_args.gpus = (gpu_id,) worker = Process(target=processer, args=(curr_args, input_queue, output_queue)) worker.daemon = True worker.start() workers.append(worker) total_tokens = len(tokens) - sent_ranges[0][0] if len(sent_ranges) > 0 else 0 reduce = Process(target=reducer, args=(vokens_path, output_queue, total_tokens)) reduce.start() for i, start_id in enumerate(range(0, len(sent_ranges), args.batch_size)): sents = [] for left, right in sent_ranges[start_id: start_id + args.batch_size]: sents.append(tokens[left: right]) input_queue.put((i, sents)) # Notifying workers the end of input for _ in workers: input_queue.put((-1, None)) # wait for workers to terminate for w in workers: w.join() # Notify the reducer the end of output output_queue.put((-1, None)) # wait for reducer to terminate reduce.join() def segment_sent( tokens, tokenizer, tokens_line_info_path, tokens_sent_info_path ): """ Single-processed segmentation of sentences. We might need to parallel this as well. """ with open(tokens_line_info_path) as f: line_starts = list(map(int, f.readlines())) nlp = English() sentencizer = nlp.create_pipe("sentencizer") nlp.add_pipe(sentencizer) sent_starts = [0] now = 0 print("Now, split lines into sentences with Spacy:") for i in tqdm.tqdm(range(len(line_starts) - 1)): start_token_idx = line_starts[i] end_token_idx = line_starts[i + 1] line_tokens = tokens[start_token_idx: end_token_idx] line = ' '.join(tokenizer.convert_ids_to_tokens(line_tokens)) line = line.replace("[UNK]", "UNK") doc = nlp(line) sents_len = 0 sents = [] for sent in doc.sents: if i < 2: print(sent) sent = str(sent) sents.append(sent) words = sent.split(' ') sent_len = len(words) now += sent_len sent_starts.append(now) sents_len += sent_len if sents_len != len(line_tokens): print(sents_len) print(sents) print(len(line_tokens)) print(line) assert False assert sent_starts[-1] == end_token_idx with open(tokens_sent_info_path, 'w') as f: for sent_start in sent_starts: f.write(str(sent_start) + "\n") if __name__ == "__main__": parser = argparse.ArgumentParser() # Text parser.add_argument('--corpus', type=str, default='/ssd-playpen/data/wiki103/wiki.train.raw') # Models parser.add_argument('--load', type=str, default='/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_robertal4', help='The directory saved the model (containing' 'BEST.pth.model).') parser.add_argument('--output', type=str, default=None, help='The directory to save the extracted feature keys.' '"None" would save in the "load" dir') parser.add_argument('--tokenizer-name', type=str, default='roberta-base') # Vision: Define the vokens set parser.add_argument('--image-sets', type=str, default='vg_nococo', help='The splits of images to be extracted') parser.add_argument('--max-img-num', type=int, default=50000, help='number of images used. -1 means all images.') # Speed Up Options: parser.add_argument('--num-workers', type=int, default=-1, help='-1 will use all GPUs.') parser.add_argument('--batch-size', type=int, default=16, help='The # of sentences in a batch.') args = parser.parse_args() if args.num_workers == -1: args.num_workers = torch.cuda.device_count() if args.output is None: args.output = os.path.join(args.load, 'vokens') os.makedirs(args.output, exist_ok=True) dset_name = os.path.split(args.corpus)[-1] img_sets = sorted([img_set.strip() for img_set in args.image_sets.split(',')]) print() print("Main Th" "read: Build a virtual vokenizer to check the number of images.") keys_dir = args.load + '/keys' # Save the keys with the model dict virtual_vokenizer = Vokenizer( None, None, keys_dir, img_sets=img_sets, max_img_num=args.max_img_num, gpus=(-1,), sent_level=('sent' in args.load)) modifier = f".{virtual_vokenizer.img_num}" if virtual_vokenizer.img_num != 50000 else "" vokens_path = os.path.join( args.output, f"{dset_name}.{'_'.join(img_sets)}{modifier}" ) tokens_hdf5_path = f'{args.corpus}.{args.tokenizer_name}.hdf5' tokens_sent_info_path = f'{args.corpus}.{args.tokenizer_name}.sent' # "Load" tokens from hdf5 tokens_hdf5 = h5py.File(tokens_hdf5_path, 'r') tokens = tokens_hdf5['tokens'] # Calibrate the start line if the vokens have been proceeded. if not os.path.exists(tokens_sent_info_path): tokens_line_info_path = f'{args.corpus}.{args.tokenizer_name}.line' model, tokenizer = load_model_and_tokenizer(args.load, cpu=True) segment_sent( tokens, tokenizer, tokens_line_info_path, tokens_sent_info_path ) # Load sent info and find the start sentence with open(tokens_sent_info_path) as f: sent_starts = list(map(int, f.readlines())) # Skip the sentences which have been extracted. extracted_tokens = 0 if os.path.isfile(vokens_path): with open(vokens_path, 'r') as g: for g_line in tqdm.tqdm(g): extracted_tokens += len(g_line.strip().split(' ')) try: start_sent_idx = sent_starts.index(extracted_tokens) except ValueError as e: print("The extracted tokens does not match a starting sentence.") print(e) # Start to vokenize print("Main Thread: Dump visual tokens to %s" % vokens_path) print("Main Thread: Start vokenization from the %d'th token" % sent_starts[start_sent_idx]) sent_ranges = [] for i in range(start_sent_idx, len(sent_starts) - 1): left_token_idx = sent_starts[i] right_token_idx = sent_starts[i + 1] sent_ranges.append((left_token_idx, right_token_idx)) setup_mp(args, tokens, sent_ranges, vokens_path) # save into hdf5 file if os.path.exists(vokens_path + '.hdf5'): print("The hdf5 file %s already exists. So the hdf5 file is not converted." % (vokens_path + '.hdf5')) else: with open(args.corpus + '.' + args.tokenizer_name + ".sent") as f: for i, line in enumerate(f): pass num_tokens = int(line) num_sents = i h5_file = h5py.File(vokens_path + '.hdf5', 'w') dset = h5_file.create_dataset("vokens", (num_tokens,), dtype='int32') dump_interval = 100000 dump_iter = 0 lines = 0 with open(vokens_path) as f: tokens = [] for line in tqdm.tqdm(f, total=num_sents): for token in map(int, line.split(' ')): tokens.append(token) if len(tokens) >= dump_interval: dset[dump_iter: dump_iter + len(tokens)] = tokens dump_iter += len(tokens) tokens = [] lines += 1 dset[dump_iter: dump_iter + len(tokens)] = tokens dump_iter += len(tokens) assert num_tokens == dump_iter print(lines, num_sents) assert lines == num_sents h5_file.close() ================================================ FILE: xmatching/__init__.py ================================================ ================================================ FILE: xmatching/data.py ================================================ # coding=utf-8 import json from pathlib import Path import random from torch.utils.data import Dataset from torchvision.datasets.folder import default_loader from PIL import Image Image.MAX_IMAGE_PIXELS = None TINY_IMG_NUM = 1000 FAST_IMG_NUM = 10000 lxrt_imgsplits = { 'mscoco_train', 'mscoco_nominival', 'vgnococo', 'mscoco_minival', } lxrt_langsplits = { 'mscoco', 'vg', 'vqa', 'gqa', 'visual7w' } cc_imgsplits = { 'cc_train': 'training.tsv', 'cc_valid': 'validation.tsv', } cc_langsplits = { 'cc', } CC_ROOT = 'data/cc' COCO_ROOT = 'data/mscoco' VG_ROOT = '/ssd-playpen/data/vg' LXRT_ROOT = 'data/lxmert' def make_uid(img_id, source, sent_id): """ see the descriptions in function 'make_datum' """ return "%s:%s:%s" % (img_id, source, sent_id) def get_img_path(source, img_id): if source == 'cc': split_tag, _ = img_id.split('_') return "%s/images/%s/%s" % (CC_ROOT, split_tag, img_id) elif 'COCO' in img_id: _, split_tag, _ = img_id.split('_') return "%s/images/%s/%s" % (COCO_ROOT, split_tag, img_id + '.jpg') else: # VG images return "%s/images/%s.jpg" % (VG_ROOT, img_id) def make_datum(source: str, img_id: str, sent_id: int, sent: str): """ Create a datum from the provided infos. :param source: the dataset of the particular sentence. :param img_id: id of the image :param sent_id: id of the sentence (of the image) :param sent: the sentence :return: a dict of datum """ uid = make_uid(img_id, source, sent_id) img_path = get_img_path(source, img_id) return { 'uid': uid, 'img_id': img_id, 'img_path': img_path, 'sent': sent, } class ImgSentDataset: def __init__(self, img_splits: str, lang_splits: str, tiny=False, fast=False): """ :param split: train, valid, test :param sources: The data sources to be loaded, separated by comma. from: mscoco, cc, vg, vqa, gqa, visual7w 'vg' stands for visual genome captions 'cc' stands for conceptual captions. example: 'mscoco, vg' """ self.img_splits = [img_split.lower().strip() for img_split in img_splits.split(',')] self.lang_splits = [lang_split.lower().strip() for lang_split in lang_splits.split(',')] self.data = [] debug_imgs = -1 if tiny: debug_imgs = TINY_IMG_NUM elif fast: debug_imgs = FAST_IMG_NUM # Loading LXRT data (i.e., COCO Cap, VQA, GQA, VG Cap, VG QA (visual7w)) lxrt_data = [] lxrt_path = Path(LXRT_ROOT) for img_split in self.img_splits: if img_split in lxrt_imgsplits: fname = img_split + ".json" if debug_imgs > 0 and fname != 'mscoco_nominival.json' \ and fname != 'mscoco_minival.json': # Only load nominival when debugging continue lxrt_data.extend(json.load((lxrt_path / fname).open())) for i, lxrt_datum in enumerate(lxrt_data): img_id = lxrt_datum['img_id'] for lang_split in self.lang_splits: if lang_split in lxrt_datum['sentf']: sents = lxrt_datum['sentf'][lang_split] for j, sent in enumerate(sents): self.data.append(make_datum(lang_split, img_id, j, sent)) if debug_imgs > 0: # Only load one sentence if debugging break if i+1 == debug_imgs: # Load top #debug_imgs images break # Loading Conceptual Caption (CC) data for img_split in self.img_splits: if img_split in cc_imgsplits: cc_path = Path(CC_ROOT) for fname in cc_imgsplits[img_split]: for i, line in enumerate((cc_path / fname).open()): sent, img_id = line.split('\t') self.data.append(make_datum('cc', img_id.strip(), 0, sent)) if i+1 == debug_imgs: break def __len__(self): return len(self.data) def __getitem__(self, item): return self.data[item] def shuffle(self): random.seed(9595) random.shuffle(self.data) class ImgSentTorchDataset(Dataset): def __init__(self, dataset: ImgSentDataset, img_transform, tokenizer, sent_len: int): super().__init__() self.raw_dataset = dataset self.img_transform = img_transform self.tokenizer = tokenizer self.sent_len = sent_len def __len__(self): return len(self.raw_dataset) def __getitem__(self, item: int): datum = self.raw_dataset[item] uid = datum['uid'] img_id = datum['img_id'] img_path = datum['img_path'] sent = datum['sent'] # Step 1: Load and pre-process the image try: pil_img = default_loader(img_path) except Exception as e: print(e) print(img_path) return self.__getitem__((item + 95) % self.__len__()) tensor_img = self.img_transform(pil_img) # Step 2: Tokenization (to integers) and Padding encoded_sent = self.tokenizer.encode_plus( sent, add_special_tokens=True, max_length=self.sent_len, truncation=True, # pad_to_max_length=True, padding='max_length', return_tensors='pt' # Return PyTorch (pt) tensors ) input_ids = encoded_sent['input_ids'].squeeze() attention_mask = encoded_sent['attention_mask'].squeeze() # print('sent', sent) # print('input_ids', input_ids) # print('attention_mask', attention_mask) return uid, (input_ids, attention_mask, ), (tensor_img, ) ================================================ FILE: xmatching/frozen_batch_norm.py ================================================ # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved # Note: This file is copied from https://github.com/facebookresearch/detectron2/blob/master/detectron2/layers/batch_norm.py # to avoid any future change from that project. import torch from torch import nn from torch.nn import functional as F class FrozenBatchNorm2d(nn.Module): """ BatchNorm2d where the batch statistics and the affine parameters are fixed. It contains non-trainable buffers called "weight" and "bias", "running_mean", "running_var", initialized to perform identity transformation. The pre-trained backbone models from Caffe2 only contain "weight" and "bias", which are computed from the original four parameters of BN. The affine transform `x * weight + bias` will perform the equivalent computation of `(x - running_mean) / sqrt(running_var) * weight + bias`. When loading a backbone model from Caffe2, "running_mean" and "running_var" will be left unchanged as identity transformation. Other pre-trained backbone models may contain all 4 parameters. The forward is implemented by `F.batch_norm(..., training=False)`. """ _version = 3 def __init__(self, num_features, eps=1e-5): super().__init__() self.num_features = num_features self.eps = eps self.register_buffer("weight", torch.ones(num_features)) self.register_buffer("bias", torch.zeros(num_features)) self.register_buffer("running_mean", torch.zeros(num_features)) self.register_buffer("running_var", torch.ones(num_features) - eps) def forward(self, x): if x.requires_grad: # When gradients are needed, F.batch_norm will use extra memory # because its backward op computes gradients for weight/bias as well. scale = self.weight * (self.running_var + self.eps).rsqrt() bias = self.bias - self.running_mean * scale scale = scale.reshape(1, -1, 1, 1) bias = bias.reshape(1, -1, 1, 1) return x * scale + bias else: # When gradients are not needed, F.batch_norm is a single fused op # and provide more optimization opportunities. return F.batch_norm( x, self.running_mean, self.running_var, self.weight, self.bias, training=False, eps=self.eps, ) def __repr__(self): return "FrozenBatchNorm2d(num_features={}, eps={})".format(self.num_features, self.eps) @classmethod def convert_frozen_batchnorm(cls, module): """ Convert BatchNorm/SyncBatchNorm in module into FrozenBatchNorm. Args: module (torch.nn.Module): Returns: If module is BatchNorm/SyncBatchNorm, returns a new module. Otherwise, in-place convert module and return it. Similar to convert_sync_batchnorm in https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/batchnorm.py """ bn_module = nn.modules.batchnorm bn_module = (bn_module.BatchNorm2d, bn_module.SyncBatchNorm) res = module if isinstance(module, bn_module): res = cls(module.num_features) if module.affine: res.weight.data = module.weight.data.clone().detach() res.bias.data = module.bias.data.clone().detach() res.running_mean.data = module.running_mean.data res.running_var.data = module.running_var.data res.eps = module.eps else: for name, child in module.named_children(): new_child = cls.convert_frozen_batchnorm(child) if new_child is not child: res.add_module(name, new_child) return res ================================================ FILE: xmatching/loss.py ================================================ import torch def hinge(x): return torch.clamp(x, min=0.) def paired_hinge_rank_loss( lang_output: torch.Tensor, visn_output: torch.Tensor, lang_mask: torch.Tensor, margin: float, ): """ Consider the first half as positive and the second half as negative. :param lang_output: [batch_size, max_len, hid_dim] :param visn_output: [batch_size, hid_dim] :param lang_mask: Int Tensor [batch_size, max_len], 1 for tokens, 0 for paddings. :param margin: margin in the ranking loss :return: a scalar loss """ batch_size, lang_len, dim = lang_output.shape assert batch_size % 2 == 0 and batch_size == visn_output.shape[0] assert margin > 0. # Expand the visn_output to match each word visn_output = visn_output.unsqueeze(1) # [b, 1, hid_dim] # Split to positive and negative sets. half_batch_size = batch_size // 2 pos_lang, neg_lang = torch.split(lang_output, half_batch_size, dim=0) pos_visn, neg_visn = torch.split(visn_output, half_batch_size, dim=0) # Calculate positive and negative scores. true_pos_score = (pos_lang * pos_visn).sum(-1) # [batch_size / 2, max_len] true_neg_score = (neg_lang * neg_visn).sum(-1) # [batch_size / 2, max_len] false_pos_score = (pos_lang * neg_visn).sum(-1) # [batch_size / 2, max_len] false_neg_score = (neg_lang * pos_visn).sum(-1) # [batch_size / 2, max_len] # Hinge Loss float_lang_mask = lang_mask.type(lang_output.dtype) # Either fp16 or fp32 pos_lang_mask, neg_lang_mask = torch.split(float_lang_mask, half_batch_size, dim=0) pos_loss = hinge(margin - true_pos_score + false_pos_score) * pos_lang_mask neg_loss = hinge(margin - true_neg_score + false_neg_score) * neg_lang_mask # Averaging cnt = float_lang_mask.sum() # Number of words. loss = (pos_loss.sum() + neg_loss.sum()) / cnt return loss def batchwise_hinge_rank_loss( lang_output: torch.Tensor, visn_output: torch.Tensor, lang_mask: torch.Tensor, margin: float, ): """ Consider all un-matched pairs in the batch as negative samples. :param lang_output: [batch_size, max_len, hid_dim] :param visn_output: [batch_size, hid_dim] :param lang_mask: Int Tensor [batch_size, max_len], 1 for tokens, 0 for paddings. :param margin: margin in the ranking loss :return: a scalar loss """ batch_size, lang_len, dim = lang_output.shape assert batch_size % 2 == 0 and batch_size == visn_output.shape[0] assert margin > 0. # Expand the visn_output to match each word visn_output = visn_output.unsqueeze(1) # [b, 1, dim] # The score of positive pairs positive_score = (lang_output * visn_output.unsqueeze(1)).sum(-1) # [b, max_len] # The score of negative pairs. Note that the diagonal is actually the positive score, # but it would be zero-graded in calculating the loss below. negative_scores = (lang_output.reshape(batch_size, 1, lang_len, dim) * visn_output.reshape(1, batch_size, 1, dim)).sum(-1) # [b(lang), b(visn), max_len] # negative_scores = torch.einsum('ikd,jd->ijk', lang_output, visn_output) # Calculate of the hinge rank loss, let me explain why it works: # For the diagonal, the scores are for positive, we thus create a positive_mask to neglect these scores. # max(0., margin - x^T x + (x^T x - 2 margin) ) # = max(0., -margin) # = 0. , since we have made sure that margin > 0 # During backwards, the operator max(0., -margin) would raise a grad of 0 to the operand "-margin", # thus it is just what we want. float_lang_mask = lang_mask.type(lang_output.dtype) # Either fp16 or fp32 positive_mask = torch.eye(batch_size) negative_scores = negative_scores - positive_mask.unsqueeze(-1) * margin * 2 lang_loss = hinge(margin - positive_score.unsqueeze(1) + negative_scores) * float_lang_mask.unsqueeze(1) visn_loss = hinge(margin - positive_score.unsqueeze(0) + negative_scores) * float_lang_mask.unsqueeze(1) # Averaging # Each sentence is duplicated by batch_size thus the total length is also multiplied by this term. cnt = max(float_lang_mask.sum() * batch_size, 1.) # Number of words. lang_loss = lang_loss.sum() / cnt visn_loss = visn_loss.sum() / cnt return lang_loss + visn_loss ================================================ FILE: xmatching/main.py ================================================ import collections import os import pickle import sys import torch import torch.multiprocessing as mp import torchvision.transforms as transforms import torch.nn as nn import torch.distributed as dist import tqdm from transformers import BertTokenizer sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) from xmatching.data import ImgSentDataset, ImgSentTorchDataset from xmatching.loss import paired_hinge_rank_loss from xmatching.metric import batchwise_accuracy, batchwise_recall from xmatching.model import LangModel, VisnModel, JointModel, LANG_MODELS from xmatching.param import parse_args def is_port_in_use(port): import socket with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: return s.connect_ex(('localhost', port)) == 0 def main(): os.environ['MASTER_ADDR'] = '127.0.0.1' port = 9595 while is_port_in_use(port): port += 1 print("Use port", port) os.environ['MASTER_PORT'] = str(port) # Using all available gpus for multi-processing distributed args = parse_args() args.gpus = torch.cuda.device_count() print("Use gpus ", list(range(args.gpus))) args.world_size = args.gpus * args.nodes # mp.spawn(setup, nprocs=args.gpus, args=(args,)) # args.world_size = args.gpus * args.nodes mp.spawn(train, nprocs=args.gpus, args=(args,)) def train(gpu, args): device = torch.device('cuda', gpu) rank = args.nr * args.gpus + gpu dist.init_process_group( backend='nccl', init_method='env://', world_size=args.world_size, rank=rank ) # Models lang_layers = list(map(lambda x: -int(x), args.lang_layers.split(','))) # The layers concated as the output. lang_model = LangModel(args.dim, arch=args.lang, layers=lang_layers, pretrained=args.lang_pretrained, finetuning=args.lang_finetune) visn_model = VisnModel(args.dim, arch=args.visn, pretrained=args.visn_pretrained, finetuning=args.visn_finetune) # The use of joint model would help synchronization in distributed learning. model = JointModel(lang_model, visn_model) # Since we will disallow the broadcast of buffers in DDP # we want make sure that there are no buffers besides batch normalization and position id. for name, buffer in model.named_buffers(): assert 'bn' in name or 'downsample' in name or "position_ids" in name if args.load is not None: state_dict = torch.load(args.load, map_location=device) new_state_dict = {} for key, value in state_dict.items(): # If the ddp state_dict is saved if 'num_batches_tracked' not in key: if key.startswith("module."): new_state_dict[key[len("module."):]] = state_dict[key] else: new_state_dict[key] = state_dict[key] model_keys = set(model.state_dict().keys()) load_keys = set(new_state_dict.keys()) print("Keys in model but not in load:") for key in sorted(model_keys - load_keys): print(key) print("Keys in load but not in model:") for key in sorted(load_keys - model_keys): print(key) model.load_state_dict(new_state_dict) # Pre-processing Hyper-Params normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) train_transform = transforms.Compose([ transforms.RandomResizedCrop(224), transforms.RandomHorizontalFlip(), transforms.ToTensor(), normalize ]) valid_transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), normalize ]) Model, Tokenizer, weight = LANG_MODELS[args.lang] tokenizer = Tokenizer.from_pretrained(weight) # tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') max_len = args.max_len # Dump the pre-processing objs for future feature extractions. if gpu == 0: pickle.dump(tokenizer, open( os.path.join(args.output, 'tokenizer.pkl'), 'wb')) pickle.dump(valid_transform, open( os.path.join(args.output, 'img_transform.pkl'), 'wb')) # Data Sets train_set = ImgSentDataset(args.train_imgs, args.train_langs, tiny=args.tiny, fast=args.fast) train_tset = ImgSentTorchDataset( train_set, train_transform, tokenizer, max_len ) print("GPU %d: load %d data in training." % (gpu, len(train_set))) valid_set = ImgSentDataset(args.valid_imgs, args.valid_langs, tiny=args.tiny, fast=args.fast) valid_set.shuffle() # Valid set only gets shuffled once!!! print("GPU %d: load %d data in validation." % (gpu, len(valid_set))) valid_tset = ImgSentTorchDataset( valid_set, valid_transform, tokenizer, max_len ) print() # Data Loader train_sampler = torch.utils.data.distributed.DistributedSampler( train_tset, num_replicas=args.world_size, rank=rank, shuffle=True, ) train_loader = torch.utils.data.DataLoader( dataset=train_tset, batch_size=(args.batch_size // args.world_size), shuffle=False, # Will be shuffled in the sampler. num_workers=max(args.num_workers // args.world_size, 1), pin_memory=True, sampler=train_sampler, drop_last=True ) valid_loader = torch.utils.data.DataLoader( dataset=valid_tset, batch_size=256, # Fix batch_size to have stable batchwise evaluations. shuffle=False, num_workers=args.num_workers, pin_memory=True, drop_last=True ) if args.optim == 'bert': from transformers import AdamW, get_linear_schedule_with_warmup no_decay = ["bias", "LayerNorm.weight"] optimizer_grouped_parameters = [ { "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], "weight_decay": 0.01, }, { "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0, }, ] optimizer = AdamW(optimizer_grouped_parameters, lr=args.lr, eps=1e-8) t_total = len(train_loader) * args.epochs warmup_steps = int(t_total * args.warmup_ratio) print("Train for %d steps and warm up for %d steps" % (t_total, warmup_steps)) scheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total ) else: if args.optim == 'sgd': optimizer = args.optimizer( [param for param in model.parameters() if param.requires_grad], args.lr, momentum=0.9 ) else: optimizer = args.optimizer( [param for param in model.parameters() if param.requires_grad], args.lr, # momentum=0.9 ) # Loss and optimizer criterion = paired_hinge_rank_loss torch.cuda.set_device(gpu) model.cuda(gpu) if args.fp16: try: from apex import amp from apex.parallel import DistributedDataParallel as DDP model, optimizer = amp.initialize(model, optimizer, opt_level='O2') # Defautly, current apex DDP would not broadcast the buffers. model = DDP(model) except Exception as e: print(e) print("Please install apex library") return else: # Note that we disallow broad cast buffers here to reduce communication cost. model = nn.parallel.DistributedDataParallel( model, device_ids=[gpu], find_unused_parameters=True, broadcast_buffers=False, ) if args.test_only or args.load: # Test the loading performance if gpu == 0: print("Test: GPU %d will test %d data in %d iterations." % (gpu, len(valid_loader) * 256, len(valid_loader))) results = valid(args, model, criterion, valid_loader) print("Initial test results:") for key, value in results.items(): print('\t%s: %0.4f' % (key, value)) if args.test_only: exit() best_valid_loss = 9595. for epoch in range(args.epochs): if gpu == 0: print("Training of Epoch %d: GPU %d will process %d data in %d iterations." % (epoch, gpu, len(train_loader) * args.batch_size // args.world_size, len(train_loader))) prev_loss = total_loss = 0. for i, (uid, lang_input, visn_input) in enumerate(tqdm.tqdm(train_loader, disable=(gpu!=0))): # Currently, lang_input is the (input_ids, attention_mask) # visn_input is (tensor_img) lang_input = tuple(x.cuda(non_blocking=True) for x in lang_input) visn_input = tuple(x.cuda(non_blocking=True) for x in visn_input) # Forward pass model.zero_grad() lang_output, visn_output = model(lang_input, visn_input) loss = criterion(lang_output, visn_output, lang_input[1], args.margin) total_loss += loss.item() # Backward if args.fp16: with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward() else: loss.backward() # Step if args.fp16: torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), 5.) else: torch.nn.utils.clip_grad_norm_(model.parameters(), 5.) optimizer.step() if args.optim == 'bert': scheduler.step() # # Logging # interval = 100 # if (i+1) % interval == 0: # print("GPU %d Epoch %d Iter %d: Training Loss %0.4f" % # (gpu, epoch, i+1, (total_loss - prev_loss) / interval)) # prev_loss = total_loss if gpu == 0: print("GPU %d Epoch %d: Total Training Loss %0.4f" % (gpu, epoch, total_loss / len(train_loader))) print() print("Validation: GPU %d will process %d data in %d iterations." % (gpu, len(valid_loader) * 256, len(valid_loader))) results = valid(args, model, criterion, valid_loader, use_tqdm=True) for key, value in results.items(): print('\t%s: %0.4f' % (key, value)) if results['loss'] < best_valid_loss: best_valid_loss = results['loss'] snap_path = os.path.join(args.output, 'BEST.pth') print("GPU 0: Save snapshot to ", snap_path) torch.save(model.module.state_dict(), snap_path) torch.save(model.module, snap_path + '.model') print("BEST valid loss %0.4f" % best_valid_loss) print() def valid(args, model, criterion, valid_loader, use_tqdm=True): model.eval() results = collections.defaultdict(lambda: 0) iterator = tqdm.tqdm(valid_loader) if use_tqdm else valid_loader for i, (uid, lang_input, visn_input) in enumerate(iterator): # Currently, lang_input is the (input_ids, attention_mask) # visn_input is (tensor_img) lang_input = tuple(x.cuda(non_blocking=True) for x in lang_input) visn_input = tuple(x.cuda(non_blocking=True) for x in visn_input) with torch.no_grad(): # Forward pass lang_output, visn_output = model(lang_input, visn_input) # Evaluation results['loss'] += criterion(lang_output, visn_output, lang_input[1], args.margin).item() recall_results = batchwise_recall(lang_output, visn_output, lang_input[1], recalls=(1, 5, 10)) for key, value in recall_results.items(): results['R%d' % key] += value for key in results: results[key] = results[key] / len(valid_loader) model.train() return results if __name__ == "__main__": main() ================================================ FILE: xmatching/metric.py ================================================ import torch def batchwise_accuracy(lang_output, visn_output, lang_mask): """ Calculate the accuracy of contextual word retrieval, average by batch. :param lang_output: [batch_size, max_len, hid_dim] :param visn_output: [batch_size, hid_dim] :param lang_mask: Int Tensor [batch_size, max_len], 1 for tokens, 0 for paddings. :return: """ batch_size, lang_len, dim = lang_output.shape assert batch_size % 2 == 0 and batch_size == visn_output.shape[0] # Expand the visn_output to match each word visn_output = visn_output.unsqueeze(1) # [b, 1, dim] # The score of negative pairs. Note that the diagonal is actually the positive score, # but it would be zero-graded in calculating the loss below. negative_scores = (lang_output.reshape(batch_size, 1, lang_len, dim) * visn_output.reshape(1, batch_size, 1, dim)).sum(-1) # [b(lang), b(visn), max_len] # negative_scores = torch.einsum('ikd,jd->ijk', lang_output, visn_output) max_neg_score, max_neg_idx = negative_scores.max(1) # [batch, max_len], the batch_idx of max-aligned img pos_idx = torch.arange(0, batch_size, dtype=torch.int64).to(lang_output.device) correct = (pos_idx.unsqueeze(1) == max_neg_idx) bool_lang_mask = lang_mask.type(correct.dtype) correct = correct * bool_lang_mask correct_num = correct.sum() accuracy = correct_num * 1. / bool_lang_mask.sum() return accuracy def batchwise_recall(lang_output, visn_output, lang_mask, recalls=(1,)): """ Calculate the accuracy of contextual word retrieval, average by batch. :param lang_output: [batch_size, max_len, hid_dim] :param visn_output: [batch_size, hid_dim] :param lang_mask: Int Tensor [batch_size, max_len], 1 for tokens, 0 for paddings. :param recall: a list, which are the number of recalls to be evaluated. :return: """ batch_size, lang_len, dim = lang_output.shape assert batch_size % 2 == 0 and batch_size == visn_output.shape[0] # Expand the visn_output to match each word visn_output = visn_output.unsqueeze(1) # [b, 1, dim] # The score of positive pairs positive_score = (lang_output * visn_output).sum(-1) # [b, max_len] # The score of negative pairs. Note that the diagonal is actually the positive score, # but it would be zero-graded in calculating the loss below. negative_scores = (lang_output.reshape(batch_size, 1, lang_len, dim) * visn_output.reshape(1, batch_size, 1, dim)).sum(-1) # [b(lang), b(visn), max_len] # negative_scores = torch.einsum('ikd,jd->ijk', lang_output, visn_output) result = {} for recall in recalls: kthscore, kthidx = torch.kthvalue(negative_scores, batch_size - recall, dim=1) # [b, max_len] # print(kthscore.shape) print(positive_score.shape) correct = (positive_score >= kthscore) # [b, max_len] bool_lang_mask = lang_mask.type(correct.dtype) correct = correct * bool_lang_mask correct_num = correct.sum() # print(correct_num) # print(bool_lang_mask.sum()) result[recall] = (correct_num * 1. / bool_lang_mask.sum()).item() return result ================================================ FILE: xmatching/model.py ================================================ import torch from torch import nn import torchvision.models as models from transformers import * from .frozen_batch_norm import FrozenBatchNorm2d LANG_MODELS = { 'bert': (BertModel, BertTokenizer, 'bert-base-uncased'), 'bert-large': (BertModel, BertTokenizer, 'bert-large-uncased'), 'gpt': (OpenAIGPTModel, OpenAIGPTTokenizer, 'openai-gpt'), 'gpt2': (GPT2Model, GPT2Tokenizer, 'gpt2'), 'ctrl': (CTRLModel, CTRLTokenizer, 'ctrl'), 'xl': (TransfoXLModel, TransfoXLTokenizer, 'transfo-xl-wt103'), 'xlnet': (XLNetModel, XLNetTokenizer, 'xlnet-base-cased'), 'xlm': (XLMModel, XLMTokenizer, 'xlm-mlm-enfr-1024'), 'distil': (DistilBertModel, DistilBertTokenizer, 'distilbert-base-cased'), 'roberta': (RobertaModel, RobertaTokenizer, 'roberta-base'), 'xlm-roberta': (XLMRobertaModel, XLMRobertaTokenizer, 'xlm-roberta-base'), } def get_visn_arch(arch): try: return getattr(models, arch) except AttributeError as e: print(e) print("There is no arch %s in torchvision." % arch) class VisnModel(nn.Module): def __init__(self, dim, arch='resnet50', pretrained=True, finetuning=False): """ :param dim: dimension of the output :param arch: backbone architecture, :param pretrained: load feature with pre-trained vector :param finetuning: finetune the model """ super().__init__() self.finetuning = finetuning # Setup Backbone resnet = get_visn_arch(arch)(pretrained=pretrained) backbone_dim = resnet.fc.in_features if not self.finetuning: for param in resnet.parameters(): param.requires_grad = False resnet.fc = nn.Identity() self.backbone = resnet # Surgery on the Networks # 1. Frozen Batch Norm # Note that BatchNorm modules have been in-place replaced! # This piece of code is copied from Detectron2, and it was copied from mask-rcnn? self.backbone = FrozenBatchNorm2d.convert_frozen_batchnorm( self.backbone) # print(self.backbone) # 2. Frozen the first two (blocks of) layers for module in [self.backbone.conv1, self.backbone.layer1]: for param in module.parameters(): param.requires_grad = False print(f"Visn Model: {arch}, Finetune: {finetuning}, Pre-trained: {pretrained}") print(f"Visn Model: backbone dim {backbone_dim} --> output dim {dim}") # Setup follow-up layers self.mlp = nn.Sequential( nn.Linear(backbone_dim, 256), nn.ReLU(), nn.Dropout(0.3), nn.Linear(256, dim), ) def forward(self, img): """ :param img: a tensor of shape [batch_size, H, W, C] :return: a tensor of [batch_size, d] """ if not self.finetuning: with torch.no_grad(): x = self.backbone(img) x = x.detach() else: x = self.backbone(img) x = self.mlp(x) # [b, dim] x = x / x.norm(2, dim=-1, keepdim=True) return x class LangModel(nn.Module): def __init__(self, dim, arch='BERT', layers=(-1,), pretrained=True, finetuning=False): """ :param dim: dimension of the output :param arch: backbone architecture, :param aggregate: one of 'last4', :param pretrained: load feature with pre-trained vector :param finetuning: finetune the model """ super().__init__() self.finetuning = finetuning # Setup Backbone Model, Tokenizer, weight = LANG_MODELS[arch] bert = Model.from_pretrained( weight, output_hidden_states=True ) if not pretrained: bert.init_weights() if not self.finetuning: for param in bert.parameters(): param.requires_grad = False backbone_dim = bert.config.hidden_size self.backbone = bert self.layers = sorted(layers) print(f"Language Model: {arch} with weight {weight}; Fine-tuning: {finetuning}, Pre-trained: {pretrained}.") print(f"Language Model: using layers {self.layers}, result in backbone dim {backbone_dim * len(self.layers)} " f"--> output dim {dim}.") # Setup follow-up layers self.mlp = nn.Sequential( nn.Linear(backbone_dim * len(self.layers), 256), nn.ReLU(), nn.Dropout(0.3), nn.Linear(256, dim), ) def forward(self, input_ids, attention_mask, token_type_ids=None): """ :param input_ids: [batch_size, max_len] :param attention_mask: [batch_size, max_len] :param token_type_ids: [batch_size, max_len] :return: [batch_size, max_len, dim] """ if not self.finetuning: with torch.no_grad(): x = self.backbone( input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, ) else: x = self.backbone( input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, ) # sequence_output, pooled_output, (hidden_states), (attentions) --> seq_output if type(self.backbone) is XLNetModel: output, hidden_states = x[:2] else: output, pooled_output, hidden_states = x[:3] # gather the layers if type(self.backbone) is XLNetModel: x = torch.cat(list(hidden_states[layer].permute(1, 0, 2) for layer in self.layers), -1) else: x = torch.cat(list(hidden_states[layer] for layer in self.layers), -1) if not self.finetuning: x = x.detach() # [batch_size, max_len, backbone_dim] --> # [batch_size, max_len, output_dim] x = self.mlp(x) x = x / x.norm(2, dim=-1, keepdim=True) return x class JointModel(nn.Module): def __init__(self, lang_model, visn_model): super().__init__() self.lang_model = lang_model self.visn_model = visn_model def forward(self, lang_input, visn_input): lang_output = self.lang_model(*lang_input) visn_output = self.visn_model(*visn_input) return lang_output, visn_output ================================================ FILE: xmatching/param.py ================================================ # coding=utf-8 # Copyleft 2020 project COL. # Copyleft 2019 project LXRT. import argparse import random import numpy as np import torch def get_optimizer(optim): # Bind the optimizer if optim == 'rms': # print("Optimizer: Using RMSProp") optimizer = torch.optim.RMSprop elif optim == 'adam': # print("Optimizer: Using Adam") optimizer = torch.optim.Adam elif optim == 'adamax': # print("Optimizer: Using Adamax") optimizer = torch.optim.Adamax elif optim == 'sgd': # print("Optimizer: sgd") optimizer = torch.optim.SGD elif 'bert' in optim: optimizer = 'bert' # The bert optimizer will be bind later. else: assert False, "Please add your optimizer %s in the list." % optim return optimizer def parse_args(): parser = argparse.ArgumentParser() # Data Splits parser.add_argument("--sources", default='mscoco', help="mscoco, cc, vg, vqa, gqa, visual7w") parser.add_argument("--train-imgs", default='mscoco_train,mscoco_nominival,vg_nococo') parser.add_argument("--valid-imgs", default='mscoco_minival') parser.add_argument("--train-langs", default='mscoco', help='Some of mscoco, cc, vg, vqa, gqa, visual7w.' 'split by comma') parser.add_argument("--valid-langs", default='mscoco', help='Some of mscoco, cc, vg, vqa, gqa, visual7w.' 'split by comma') parser.add_argument("--test", default=None) parser.add_argument("--test-only", action='store_true') # Datasets Configuration parser.add_argument("--fast", action='store_true') parser.add_argument("--tiny", action='store_true') parser.add_argument("--max-len", default=20, type=int) # Training Hyper-parameters parser.add_argument('--batchSize', dest='batch_size', type=int, default=256) parser.add_argument('--optim', default='bert') parser.add_argument('--lr', type=float, default=1e-4) parser.add_argument('--warmup-ratio', type=float, default=0.05) parser.add_argument('--epochs', type=int, default=10) parser.add_argument('--dropout', type=float, default=0.1) parser.add_argument('--seed', type=int, default=9595, help='random seed') parser.add_argument("--fp16", action='store_true') # Model Hyper-parameters parser.add_argument('--visn', type=str, default='resnext101_32x8d', help='The vision backbone model.') parser.add_argument('--lang', type=str, default='bert', help='The language backbone model.') parser.add_argument('--lang-layers', type=str, default='-1', help='The language backbone model.') parser.add_argument('--dim', type=int, default=64, help='The output dim of the joint emb.') # Model Loading parser.add_argument('--load', type=str, default=None, help='Load the model (usually the fine-tuned model).') parser.add_argument('--lang-finetune', action='store_true', help='finetune the language encoder.') parser.add_argument('--visn-finetune', action='store_true', help='finetune the visual encoder.') parser.add_argument('--lang-pretrained', action='store_true', help='Use the pre-trained language encoder.') parser.add_argument('--visn-pretrained', action='store_true', help='Use the pre-trained visual encoder.') # Optimization parser.add_argument("--margin", default=0.5, type=float, help='The margin in the hinge losses.') parser.add_argument("--loss", dest='loss', default='paired_hinge', type=str) # Training configuration parser.add_argument("--num-workers", default=0, type=int) parser.add_argument('--output', type=str, default='snap/test') # Distributed Training Configuration parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N') parser.add_argument('-g', '--gpus', default=1, type=int, help='number of gpus per node') parser.add_argument('-nr', '--nr', default=0, type=int, help='ranking within the nodes') # Parse the arguments. args = parser.parse_args() # Bind optimizer class. args.optimizer = get_optimizer(args.optim) # Set seeds torch.manual_seed(args.seed) random.seed(args.seed) np.random.seed(args.seed) return args # args = parse_args()