[
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2020 Hao Tan\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# Vokenization\n\nPyTorch code for the EMNLP 2020 paper \"[Vokenization: Improving Language Understanding with Contextualized, \nVisual-Grounded Supervision](https://arxiv.org/pdf/2010.06775.pdf)\" (Hao Tan and Mohit Bansal).\n\n**Outline**\n* [Contextualized Cross-Modal Matching](#contextualized-cross-modal-matching-xmatching)\n    * [Downloading Image and Captioning Data](#download-image-and-captioning-data)\n    * [Model Training](#training-the-cross-modal-matching-model)\n    * [Benchmark (Optional)](#benchmarking-cross-modal-matching-models-optional)\n* [Vokenization](#vokenization-vokenization)\n    * [Downloading Pure-Language Data](#downloading-and-pre-processing-pure-language-data)\n    * [Extracting Visual Feature](#extracting-image-features)\n    * [Vokenization Process](#the-vokenization-process)\n* [Visually-Supervised Language Model](#visually-supervised-language-model-vlm)\n    * [VLM Pre-training](#pre-training-with-vlm)\n    * [GLUE Evaluation](#glue-evaluation)\n    * [MLM Pre-training (as baselines)](#bert-as-baselines)\n    \n> Note: I recommend to focus on \"Wiki103\" first and \n> ingore the code blocks related to \"English Wikipedia\".\n> \"Eng Wiki\" might take too long to complete.\n\n## Installation\n```shell script\npip install -r requirements.txt\n```\n\nRequire python 3.6 + (to support huggingface [transformers](https://github.com/huggingface/transformers)).\n\n## Contextualized Cross-Modal Matching (xmatching)\nIn this [module](xmatching) (corresponding to Sec 3.2 of the [paper](https://arxiv.org/pdf/2010.06775.pdf)), \nwe want to learn a token-image matching model from sentence-image aligned data (i.e., image captioning data).\nThe model \"contextually\" measures the relevance between tokens (i.e., words) and images.\nThe terminology \"contextual\" emphasize the nature that \nthe sentences (the context) are considered\nwhen measuring the token-image relevance score.\n\n\n### Download Image and Captioning Data\n1. Download MS COCO images:\n    ```shell script\n    # MS COCO (Train 13G, Valid 6G)\n    mkdir -p data/mscoco\n    wget http://images.cocodataset.org/zips/train2014.zip -P data/mscoco\n    wget http://images.cocodataset.org/zips/val2014.zip -P data/mscoco\n    unzip data/mscoco/train2014.zip -d data/mscoco/images/ && rm data/mscoco/train2014.zip\n    unzip data/mscoco/val2014.zip -d data/mscoco/images/ && rm data/mscoco/val2014.zip\n    ```\n   If you already have COCO image on disk. Save them as \n    ```\n    data\n      |-- mscoco\n            |-- images\n                 |-- train2014\n                         |-- COCO_train2014_000000000009.jpg\n                         |-- COCO_train2014_000000000025.jpg\n                         |-- ......\n                 |-- val2014\n                         |-- COCO_val2014_000000000042.jpg\n                         |-- ......\n    ```\n\n2. Download captions (split following the LXMERT project):\n    ```shell script\n    mkdir -p data/lxmert\n    wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_train.json -P data/lxmert/\n    wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_nominival.json -P data/lxmert/\n    wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/vgnococo.json -P data/lxmert/\n    wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_minival.json -P data/lxmert/\n    ```\n\n### Training the Cross-Modal Matching Model\nThe model is trained on MS COCO with pairwise hinge loss (details in Sec. 3.2 of the [paper](https://arxiv.org/pdf/2010.06775.pdf)).\n\nRunning Commands:\n```bash\n# Run the cross-modal matching model with single-machine multi-processing distributed training\n# \"0,1\" indicates using the GPUs 0 and 1.\n# \"bert_resnext\" is the name of this snapshot and would be saved at snap/xmatching/bert_resnext\n# \"--visn resnext101_32x8d\" is the vision backbone\n# \"--lang bert\" is the langaugae backbone\n# Speed: 20 min ~ 30 min / 1 Epoch, 20 Epochs by default.\nbash scripts/run_xmatching.bash 0,1 bert_resnext --visn resnext101_32x8d --lang bert\n```\nThe options `--visn` and `--lang` specify the architecture of the encoder.\nTested options \n```\n--visn $VISN_MODEL\nVISN_MODEL={resnet18, resnet34, resnet50, resnet101, resnet152, \n            wide_resnet50_2, wide_resnet101_2, resnext101_32x8d (default), ...} \n--lang $LANG_MODEL\nLANG_MODEL={bert, roberta, xlnet, bert-large, ...}\n```\nFor visual backbones, the models in [torchvision](https://pytorch.org/docs/stable/torchvision/models.html) are mostly supported.\nYou might need to handle the last FC layer, because it is written differently in different backbones.\nThe language backbones are initialized from huggingface [transformers](https://github.com/huggingface/transformers).\n\n> We found that the results with XLNet is pretty low but have not identified \n> the reason. Results of other backbones are similar.\n\n## Vokenization (vokenization)\nThe vokenization is a bridge between the cross-modality (words-and-image) matching models (xmatching) and \nvisually-supervised lagnauge models (vlm).\nThe final goal is to convert the language tokens to related images \n(we called them **vokens**).\nThese **vokens** enable the visual supervision of the language model.\nWe mainly provide pr-eprocessing tools (i.e., feature extraction, tokenization, and vokenization) and\nevaluation tools of previous cross-modal matching models here.\nHere is a diagram of these processes and we next discuss them one-by-one:\n```\nExtracting Image Features-----> Benchmakring the Matching Models (Optional) --> Vokenization\nDownloading Language Data --> Tokenization -->-->--/\n```\n\n### Downloading and Pre-Processing Pure-Language Data \nWe provide scripts to get the datasets \"wiki103\" and \"wiki\".\nWe would note them as \"XX-cased\" or \"XX-uncased\" where the suffix \"cased\" / \"uncased\" only indicates\nthe property of the raw text.\n1. **Wiki103**. The [wiki103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) dataset\nis a seleted subset of English Wikipedia, containing around 100M tokens.\n    ```shell script\n    bash data/wiki103/get_data_cased.sh\n    ```\n2. **English Wikipedia**. \nThe script to download and process wiki data are modified from [XLM](https://github.com/facebookresearch/XLM).\nIt will download a 17G file. \nThe speed depends on the networking and it usually takes several hours to filter the data.\nThe process ends with around 2.8B tokens.\n    ```shell script\n    bash data/wiki/get_data_cased.bash en\n    ```\n    Note: For *RoBERTa*, it requires an untokenized version of wiki (o.w. the results would be much lower), \n    so please use the following command:\n    ```shell script\n    bash data/wiki/get_data_cased_untokenized.bash en\n    ```\n   \n> Note: I recommend to focus on \"Wiki103\" first and \n> ingore the code blocks related to \"English Wikipedia\".\n> \"Eng Wiki\" might take too long to complete.\n   \n### Tokenization of Language Data\nWe next tokenize the language corpus.\nIt would locally save three files: \n\"$dataset_name.$tokenizer_name\", \n\"$dataset_name.$tokenizer_name.hdf5\",\nand \"$dataset_name.$tokenizer_name.line\".\nTaking the wiki103 dataset and BERT tokenizer as an example, \nwe convert the training file into\n```\ndata \n |-- wiki103-cased \n        |-- wiki.train.raw.bert-base-uncased\n        |-- wiki.train.raw.bert-base-uncased.hdf5\n        |-- wiki.train.raw.bert-base-uncased.line\n```\nThe txt file `wiki.train.raw.bert-base-uncased` saves the tokens and each line in this file is the tokens of a line \nin the original file,\nThe hdf5 file `wiki.train.raw.bert-base-uncased.hdf5` stores all the tokens continuously and use\n`wiki.train.raw.bert-base-uncased.line` to index the starting token index of each line.\nThe \".line\" file has `L+1` lines where `L` is the number of lines in the original files.\nEach line has a range \"line[i]\" to \"line[i+1]\" in the hdf5 file.\n\nCommands:\n1. Wiki103 (around 10 min)\n    ```shell script\n    bash tokenization/tokenize_wiki103_bert.bash \n    ```\n2. English Wikipedia (around 3 hours)\n    ```shell script\n    bash tokenization/tokenize_wiki_bert.bash \n    ```\n\n### Extracting Image Features\nThe image pre-processing extracts the image features to build the keys in the vokenization retrieval process.\n\n#### Download the Visual Genome (VG) images\nSince MS COCO images are used in training the cross-modal matching model\nas in [xmatching](#contextualized-cross-modal-matching-xmatching).\nWe will use the [Visual Genome](https://visualgenome.org/) images as \ncandidate vokens for retrievel.\nWe here download the images first.\n```shell script\nwget https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip -P data/vg/\nwget https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip -P data/vg/\nunzip data/vg/images.zip -d data/vg/images && rm data/vg/images.zip\nunzip data/vg/images2.zip -d data/vg/images && rm data/vg/images2.zip\ncd data/vg/images\nmv VG_100K/* .\nmv VG_100K_2/* .\nrm -rf VG_100K VG_100K_2\ncd ../../../\n```\nIf you already have Visual Genome image on disk. Save them as \n```\ndata\n|-- vg\n    |-- images\n         |-- 1000.jpg\n         |-- 1001.jpg\n         |-- ......\n```\n    \n#### Build Universal Image Ids\nWe first build a list of universal image indexes with \n[vokenization/create_image_ids.py](vokenization/create_image_ids.py). \nIt is used to unify the image ids in different experiments \nthus the feature array stored in hdf5 could be universally indexed.\nThe image ids are saved under a shared path `LOCAL_DIR` (default to `data/vokenization`)\n defined in [vokenization/common.py](vokenization/common.py).\nThe image ids are saved under `data/vokenization/images` with format `{IMAGE_SET}_ids.txt`.\nWe will make sure that all the experiments agree with this meta info,\nso that we would not get different indexing in different retrieval experiments.\n\n> Note: The ids created by [create_image_ids.py](vokenization/create_image_ids.py) are only the order of the images.\n> The actual images in the dictionary are provided by `extract_keys.bash`, thus is corresponding to the \n> `_paths.txt`, because the `extract_keys` will filter all broken images and non-existing images.\n\nCommands:\n```bash\n# Step 1, Build image orders.\npython vokenization/create_image_ids.py  \n```\n\n#### Extracting Image Features\n\nExtract image features regarding the list built above, using code \n[vokenization/extract_vision_keys.py](vokenization/extract_vision_keys.py). \nThe code will first read the image ids saved in `data/vokenization/images/{IMAGE_SET}_ids.txt` and locate the images.\nThe features will be saved under `snap/xmatching/bert_resnext/keys/{IMAGE_SET}.hdf5`.\nIt finishes within 1 hour.\n\nCommands:\n```bash\n# Step 2, Extract features. \n# bash scripts/extract_keys.bash $GPU_ID $MODEL_NAME \nbash scripts/extract_keys.bash 0 bert_resnext \n```\n\n\n### Benchmarking Cross-Modal Matching Models (Optional)\n> Before evaluating, please make sure that `extracting_image_features` and `tokenization` are completed.\n\nWe benchmark the performance of cross-modal matching models from large scale.\nThe evaluation includes two different metrics: diversity and the retrieval performance.\n\nDiversity \n(in [vokenization/evaluate_diversity.py](vokenization/evaluate_diversity.py))\nensures that the same [token type](https://arxiv.org/pdf/1902.06006.pdf)\nis mapped to diverse images regarding its context (i.e., the sentence).\nRetrieval \n(in [vokenization/evaluate_retrieval.py](vokenization/evaluate_retrieval.py)) \nmeasures the correspondence of the token and the retrieved images.\n\nWe gather these two utils into one script and the command here:\n```bash\nbash scripts/xmatching_benchmark.bash 0 bert_resnext\n```\n\n### The Vokenization Process\nAfter all these steps, we could start to vokenize the language corpus.\nIt would load the tokens saved in `dataset_name.tokenizer_name.hdf5` \nand uses the line-split information in `dataset_name.tokenzier_name.line`.\n\nThe code is optimized and could be continued by just rerunning it.\nThe vokens will be saved in `snap/xmatching/bert_resnext/vokens/wiki.train.raw.vg_nococo.hdf5` by default.\nThe file `snap/xmatching/bert_resnext/vokens/wiki.train.raw.vg_nococo.ids` contains the universal image ids \nfor each voken, \ne.g., the image id `vg_nococo/8` corresponds to 8-th feature\nsaved in `snap/xmatching/bert_resnext/keys/vg_nococo.hdf5`.\n\n\n> Note: `--tokenizer-name` must be provided in the script.\n\n\nCommands\n1. Wiki103 (around 1 hour on 4 Titan V)\n    ```shell script\n    # Note: mp is the abbreviation for \"multi-processing\"\n    # bash scripts/mpvokenize_wiki103.bash $USE_GPUS $SNAP_NAME\n    bash scripts/mpvokenize_wiki103.bash 0,1,2,3 bert_resnext\n    ```\n2. English Wikipedia (around 1 day on 4 Titan V)\n    ```shell script\n    # bash scripts/mpvokenize_wiki.bash $USE_GPUS $SNAP_NAME\n    bash scripts/mpvokenize_wiki.bash 0,1,2,3 bert_resnext\n    ```\n\n> The script will call\n> [vokenization/vokenize_corpus_mp.py](vokenization/vokenize_corpus_mp.py)\n> to vokenize a corpus. \n> The vokenziation happens in [vokenization/vokenization.py](vokenization/vokenization.py) and\n> it use [vokenization/indexing.py](vokenization/indexing.py) to do nearest neighbor search\n> (based on [faiss](https://github.com/facebookresearch/faiss)).\n\n\n## Visually-Supervised Language Model (vlm)\n\n### Pre-Training with VLM\nAs discussed in Sec. 2 of the [paper](https://arxiv.org/pdf/2010.06775.pdf),\nwe use previous generated vokens to pre-train the model \nwith visual supervision.\n\n#### Wiki103 \nAfter the [vokenization process](#the-vokenization-process) of wiki103,\nwe could run the model with command:\n```shell script\n# bash scripts/small_vlm_wiki103_glue.bash $GPUs $SNAP_NAME\nbash scripts/small_vlm_wiki103.bash 0,1,2,3 wiki103_bert_small\n```\nIt will call \n[vlm/run_vlm_distributed.py](vlm/run_vlm_distributed.py)\nand run a BERT-6Layers-512Hiddens model on [wiki103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)\ndataset with the support of voken supervisions.\nThe snapshot will be saved to `snap/vlm/wiki103_bert_small`.\nWe recommend to run this Wiki103 experiment first since it will finish \nin a reasonable time (20 hours).\nThe pure BERT pre-training option is also available [later](#bert-as-baselines)\nfor comparisons.\n\nNote: defautly, the mixed-precision training is not used.\nTo support the mixed precision pre-training, \nplease install the [nvidia/apex](https://github.com/NVIDIA/apex) library with command:\n```shell script\ngit clone https://github.com/NVIDIA/apex\ncd apex\npip install -v --no-cache-dir --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\" ./\n```\nAfter that, you could bring back the option `--fp16` and `--fp16_opt_level O2` in \nthe script `scripts/small_vlm_wiki103.bash`.\nI recommend to use `--fp16_opt_level O2`.\nAlthough the option O2 might be [unstable](https://github.com/NVIDIA/apex/issues/818#issuecomment-639012282),\nit saves a lot memory:\nthe max per-gpu-batch-size is 32 with O1 but 64 with O2.\n\n#### English Wikipedia\nAfter the [vokenization process](#the-vokenization-process) of wiki103,\nwe could run the model with command:\n```shell script\n# bash scripts/base_vlm_wiki.bash $GPUs $SNAP_NAME\nbash scripts/base_vlm_wiki.bash 0,1,2,3 wiki_bert_base\n```\nIt will run a BERT-12Layers-768Hiddens (same as BERT_BASE) model on the English Wikipedia\ndataset with the support of voken supervisions.\nThe snapshot will be saved to `snap/vlm/wiki_bert_base`.\n\nIt takes around 3-5 days on 4 Titan V / GTX 2080\nand around 5-7 days to finish in 4 Titan Pascal/T4 cards.\n(This estimation is accurate since I inevitably run experiments on all these servers...).\nTitan V / 2080 / T4 have native support of mixed precision training (triggered by `--fp16` option and need\ninstalling [apex](https://github.com/NVIDIA/apex)).\nThe speed would be much faster.\nTitan Pascal would also save some memory with the `--fp16` option.\n\n\n### GLUE Evaluation\nWe defautly use the [GLUE](https://gluebenchmark.com/) benchmark\n(e.g., [SST](https://nlp.stanford.edu/sentiment/index.html),\n[MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398),\n[QQP](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs),\n[MNLI](https://cims.nyu.edu/~sbowman/multinli/),\n[QNLI](https://rajpurkar.github.io/SQuAD-explorer/),)\n as downstreaming tasks.\nOther tasks could be evaluated following the setup [here](https://github.com/huggingface/transformers/tree/28d183c90cbf91e94651cf4a655df91a52ea1033/examples)\nby changing the option `--model_name_or_path` to the correct snapshot path `snap/bert/wiki103`.\n\n#### Download GLUE dataset\nThis downloaindg scrip is copied from [huggingface transformers](https://github.com/huggingface/transformers/tree/master/examples/text-classification)\nproject.\nSince the [transformers](https://github.com/huggingface/transformers) is still under dense\ndevelopment, the change of APIs might affect the code. \nI have upgraded the code compaticability to transformers==3.3.\n```shell script\nwget https://raw.githubusercontent.com/huggingface/transformers/master/utils/download_glue_data.py\npython download_glue_data.py --data_dir data/glue --tasks all\n```\n\n#### Finetuning on GLUE Tasks\nThe pre-trained snapshots are evaluated by fine-tuning them on the [GLUE](https://gluebenchmark.com/) \nbenchmark.\nThe code are modified from the huggingface [transformers](https://github.com/huggingface/transformers).\n\nRunning GLUE evaluation for snapshots from different epochs:\n```bash\n# bash scripts/run_glue_epochs.bash $GPUS #SNAP_PATH --snaps $NUM_OF_SNAPS                            \nbash scripts/run_glue_epochs.bash 0,1,2,3 snap/vlm/wiki103_bert_small --snaps 7                            \n```\nIt will assess 7 snaps using all 0,1,2,3 GPUs. \nSetting `snaps=-1` will assess all checkpoints.\nIf you just want to evaluate the last (usually the best) snapshot, please use:\n```\nbash scripts/run_glue_epochs.bash 0 snap/vlm/wiki103_bert_small --snaps 1\n```\n\n#### Showing the results\nFor all results saved under `snap/` (whatever the dir names),\nrunning the folloing command will print out all the results.\n```bash\npython vlm/show_glue_results_epochs.py \n```\n\nIt will print results like\n```\nsnap/vlm/test_finetune/glueepoch_checkpoint-epoch0019\n     RTE    MRPC   STS-B    CoLA   SST-2    QNLI     QQP    MNLI MNLI-MM    GLUE\n   54.51   84.72   87.18   52.32   90.02   88.36   87.16   81.92   82.57   78.75\nsnap/vlm/bert_6L_512H_wiki103_sharedheadctr_noshuffle/glueepoch_checkpoint-epoch0029\n     RTE    MRPC   STS-B    CoLA   SST-2    QNLI     QQP    MNLI MNLI-MM    GLUE\n   58.12   82.76   84.45   26.74   89.56   84.40   86.52   77.56   77.99   74.23\n```\n\n### BERT (As baselines)\nWe also provide pure language-model pre-training as baselines.\n\n#### Wiki103\n```shell script\n# bash scripts/small_wiki103.bash $GPUs $SNAP_NAME\nbash scripts/small_wiki103.bash 0,1,2,3 bert_small\n```\nIt will call \n[vlm/run_lm_distributed.py](vlm/run_lm_distributed.py)\nand run a BERT-6Layers-512Hiddens model on [wiki103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)\ndataset with the masked language model only.\nThe snapshot will be saved to `snap/bert/wiki103_bert_small`.\n\nOr you could directly using the script `small_wiki103_glue.bash` to \nenable GLUE evaluation after finishing pre-training.\n```shell script\nbash scripts/small_wiki103_glue.bash 0,1,2,3 bert_small\n```\n\n#### English Wikipedia\nCommand:\n```shell script\n# bash scripts/base_wiki.bash $GPUs $SNAP_NAME\nbash scripts/base_wiki.bash 0,1,2,3 bert_wiki\n```\n\nWith GLUE evaluation:\n```shell script\nbash scripts/base_wiki_glue.bash 0,1,2,3 bert_wiki\n```\n\n## Pre-processed Data and Pre-trained Models\n### Data\n\nWiki103 (100M tokens)\n```\nmkdir -p data/wiki103-cased\nwget  https://nlp.cs.unc.edu/data/vokenization/wiki103-cased/wiki.test.raw.bert-base-uncased.hdf5 -P data/wiki103-cased\nwget https://nlp.cs.unc.edu/data/vokenization/wiki103-cased/wiki.train.raw.bert-base-uncased.hdf5 -P data/wiki103-cased\nwget https://nlp.cs.unc.edu/data/vokenization/wiki103-cased/wiki.valid.raw.bert-base-uncased.hdf5 -P data/wiki103-cased\n```\n\nWiki (2800 M tokens)\n```\nmkdir -p data/wiki-cased\nwget  https://nlp.cs.unc.edu/data/vokenization/wiki-cased/en.test.raw.bert-base-uncased.hdf5 -P data/wiki-cased\nwget  https://nlp.cs.unc.edu/data/vokenization/wiki-cased/en.train.raw.bert-base-uncased.hdf5 -P data/wiki-cased\nwget  https://nlp.cs.unc.edu/data/vokenization/wiki-cased/en.valid.raw.bert-base-uncased.hdf5 -P data/wiki-cased\n```\n\n### Models\n- Cross-Modal Matching model: [https://nlp.cs.unc.edu/data/vokenization/coco_hinge05_dim64_resxt101_bertl4.zip](https://nlp.cs.unc.edu/data/vokenization/coco_hinge05_dim64_resxt101_bertl4.zip)\n- BERT (on Wiki): [https://nlp.cs.unc.edu/data/vokenization/bert_12L_768H_wiki.zip](https://nlp.cs.unc.edu/data/vokenization/bert_12L_768H_wiki.zip)\n- BERT + VLM (on Wiki): [https://nlp.cs.unc.edu/data/vokenization/vlm_12L_768H_wiki.zip](https://nlp.cs.unc.edu/data/vokenization/vlm_12L_768H_wiki.zip)\n- RoBERTa + VLM (on Wiki): [https://nlp.cs.unc.edu/data/vokenization/vlm_roberta_12L_768H_wiki.zip](https://nlp.cs.unc.edu/data/vokenization/vlm_roberta_12L_768H_wiki.zip)\n\n## Reference\nIf you find our project useful, please cite this paper:\n```\n@inproceedings{tan2020vokenization,\n  title={Vokenization: Improving Language Understanding with Contextualized, \nVisual-Grounded Supervision},\n  author={Tan, Hao and Bansal, Mohit},\n  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing},\n  year={2020}\n}\n```\n\n## Acknowledgement\nI thank the support from [Bloomberg Data Science Ph.D. Fellowship](https://www.techatbloomberg.com/bloomberg-data-science-ph-d-fellowship/).\nWe thank the reviewers and [Yixin Nie](https://easonnie.github.io/) \nand [Jie Lei](https://www.cs.unc.edu/~jielei/)\nfor their helpful discussions.\nPart of the code are built based on huggingface [transformers](https://github.com/huggingface/transformers) and \nfacebook [xlm](https://github.com/facebookresearch/XLM) and [faiss](https://github.com/facebookresearch/faiss).\n\n4K3.\n"
  },
  {
    "path": "data/lxmert/.gitignore",
    "content": "/mscoco_minival.json\n/mscoco_nominival.json\n/mscoco_train.json\n/vgnococo.json\n"
  },
  {
    "path": "data/mscoco/.gitignore",
    "content": "/images\n"
  },
  {
    "path": "data/vg/.gitignore",
    "content": "/images\n"
  },
  {
    "path": "data/wiki/get_data_cased.bash",
    "content": "# Copyright (c) 2019-present, Facebook, Inc.\n# Copy frrom https://github.com/facebookresearch/XLM\n# All rights reserved.\n#\n# This source code is licensed under the license found in the\n# LICENSE file in the root directory of this source tree.\n#\n\n#\n# Usage: ./get-data-wiki.sh $lg (en)\n#\n\nset -e\n\nlg=$1  # input language\n\n# data path\nWIKI_PATH=data/wiki-cased\nMAIN_PATH=$WIKI_PATH\n\n# tools paths\nTOOLS_PATH=$MAIN_PATH/tools\nTOKENIZE=$TOOLS_PATH/tokenize.sh\nREMOVE_ACCENT=$TOOLS_PATH/remove_accent.py\n\n# Wiki data\nWIKI_DUMP_NAME=${lg}wiki-latest-pages-articles.xml.bz2\nWIKI_DUMP_LINK=https://dumps.wikimedia.org/${lg}wiki/latest/$WIKI_DUMP_NAME\n\n# install tools\ndata/wiki/install-tools.sh $TOOLS_PATH\n\n# create Wiki paths\nmkdir -p $WIKI_PATH/bz2\nmkdir -p $WIKI_PATH/txt\n\n# download Wikipedia dump\necho \"Downloading $lg Wikipedia dump from $WIKI_DUMP_LINK ...\"\nwget -c $WIKI_DUMP_LINK -P $WIKI_PATH/bz2/\necho \"Downloaded $WIKI_DUMP_NAME in $WIKI_PATH/bz2/$WIKI_DUMP_NAME\"\n\n# extract and tokenize Wiki data\necho \"*** Cleaning and tokenizing $lg Wikipedia dump ... ***\"\n#python -m $TOOLS_PATH/wikiextractor/wikiextractor/WikiExtractor $WIKI_PATH/bz2/$WIKI_DUMP_NAME --processes 24 -q -o - \\\nif [ ! -f $WIKI_PATH/txt/$lg.all.raw ]; then\n  python $TOOLS_PATH/wikiextractor/WikiExtractor.py $WIKI_PATH/bz2/$WIKI_DUMP_NAME --processes 24 -q -o - \\\n  | sed \"/^\\s*\\$/d\" \\\n  | grep -v \"^<doc id=\" \\\n  | grep -v \"</doc>\\$\" \\\n  | $TOKENIZE $lg $TOOLS_PATH \\\n  | python $REMOVE_ACCENT \\\n  > $WIKI_PATH/txt/$lg.all.raw\nfi\necho \"*** Tokenized ( + accent-removal) $lg Wikipedia dump to $WIKI_PATH/txt/train.${lg} ***\"\n\n# split into train / valid / test\necho \"*** Split into train / valid / test ***\"\nsplit_data() {\n    NLINES=`wc -l $1  | awk -F \" \" '{print $1}'`;\n    NTRAIN=$((NLINES - 10000));\n    NVAL=$((NTRAIN + 5000));\n    cat $1 | head -$NTRAIN             > $2;\n    cat $1 | head -$NVAL | tail -5000  > $3;\n    cat $1 | tail -5000                > $4;\n}\nsplit_data $WIKI_PATH/txt/$lg.all.raw $WIKI_PATH/txt/$lg.train.raw $WIKI_PATH/txt/$lg.valid.raw $WIKI_PATH/txt/$lg.test.raw\n\n# File structure\nmv $WIKI_PATH/txt/* $WIKI_PATH/\nrm -rf $WIKI_PATH/bz2\nrm -rf $WIKI_PATH/txt\n"
  },
  {
    "path": "data/wiki/get_data_cased_untokenized.bash",
    "content": "# Copyright (c) 2019-present, Facebook, Inc.\n# Copy frrom https://github.com/facebookresearch/XLM\n# All rights reserved.\n#\n# This source code is licensed under the license found in the\n# LICENSE file in the root directory of this source tree.\n#\n\n#\n# Usage: ./get-data-wiki.sh $lg (en)\n#\n\nset -e\n\nlg=$1  # input language\n\n# data path\nWIKI_PATH=data/wiki-cased-untokenized\nMAIN_PATH=$WIKI_PATH\n\n# tools paths\nTOOLS_PATH=$MAIN_PATH/tools\nTOKENIZE=$TOOLS_PATH/tokenize.sh\nREMOVE_ACCENT=$TOOLS_PATH/remove_accent.py\n\n# Wiki data\nWIKI_DUMP_NAME=${lg}wiki-latest-pages-articles.xml.bz2\nWIKI_DUMP_LINK=https://dumps.wikimedia.org/${lg}wiki/latest/$WIKI_DUMP_NAME\n\n# install tools\ndata/wiki/install-tools.sh $TOOLS_PATH\n\n# create Wiki paths\nmkdir -p $WIKI_PATH/bz2\nmkdir -p $WIKI_PATH/txt\n\n# download Wikipedia dump\nif [ ! -f $WIKI_PATH/bz2/enwiki-latest-pages-articles.xml.bz2 ]; then\n    echo \"Downloading $lg Wikipedia dump from $WIKI_DUMP_LINK ...\"\n    wget -c $WIKI_DUMP_LINK -P $WIKI_PATH/bz2/\n    echo \"Downloaded $WIKI_DUMP_NAME in $WIKI_PATH/bz2/$WIKI_DUMP_NAME\"\nfi\n\n# extract and tokenize Wiki data\n#cd $MAIN_PATH\necho \"*** Cleaning and tokenizing $lg Wikipedia dump ... ***\"\nif [ ! -f $WIKI_PATH/txt/$lg.all.raw ]; then\n  python $TOOLS_PATH/wikiextractor/WikiExtractor.py $WIKI_PATH/bz2/$WIKI_DUMP_NAME --processes 24 -q -o - \\\n  | sed \"/^\\s*\\$/d\" \\\n  | grep -v \"^<doc id=\" \\\n  | grep -v \"</doc>\\$\" \\\n  | python $REMOVE_ACCENT \\\n  > $WIKI_PATH/txt/$lg.all.raw\nfi\necho \"*** Not Tokenized ( but + accent-removal) $lg Wikipedia dump to $WIKI_PATH/txt/train.${lg} ***\"\n\n# split into train / valid / test\necho \"*** Split into train / valid / test ***\"\nsplit_data() {\n    NLINES=`wc -l $1  | awk -F \" \" '{print $1}'`;\n    NTRAIN=$((NLINES - 10000));\n    NVAL=$((NTRAIN + 5000));\n    cat $1 | head -$NTRAIN             > $2;\n    cat $1 | head -$NVAL | tail -5000  > $3;\n    cat $1 | tail -5000                > $4;\n}\nsplit_data $WIKI_PATH/txt/$lg.all.raw $WIKI_PATH/txt/$lg.train.raw $WIKI_PATH/txt/$lg.valid.raw $WIKI_PATH/txt/$lg.test.raw\n\n# File structure\nmv $WIKI_PATH/txt/* $WIKI_PATH/\nrm -rf $WIKI_PATH/bz2\nrm -rf $WIKI_PATH/txt\n"
  },
  {
    "path": "data/wiki/install-tools.sh",
    "content": "# Copyright (c) 2019-present, Facebook, Inc.\n# All rights reserved.\n#\n# This source code is licensed under the license found in the\n# LICENSE file in the root directory of this source tree.\n#\n\nset -e\n\n# data path\nTOOLS_PATH=$1\n\n# tools\nMOSES_DIR=mosesdecoder\nFASTBPE_DIR=fastBPE\nFASTBPE=fast\nWMT16_SCRIPTS=wmt16-scripts\n\n# tools path\nmkdir -p $TOOLS_PATH\n\n# Copy the scripts to TOOLS_PATH\ncp -r data/wiki/tools/* $TOOLS_PATH\n\n\n#\n# Download and install tools\n#\n\nold=$(pwd)\ncd $TOOLS_PATH\n\n\n# Download Moses\nif [ ! -d \"$MOSES_DIR\" ]; then\n  echo \"Cloning Moses from GitHub repository...\"\n  git clone https://github.com/moses-smt/mosesdecoder.git\nfi\n\n# Download fastBPE\nif [ ! -d \"$FASTBPE_DIR\" ]; then\n  echo \"Cloning fastBPE from GitHub repository...\"\n  git clone https://github.com/glample/fastBPE\nfi\n\n# Compile fastBPE\nif [ ! -f \"$FASTBPE_DIR/$FASTBPE\" ]; then\n  echo \"Compiling fastBPE...\"\n  cd fastBPE\n  g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast\n  cd ..\nfi\n\n# Download Sennrich's tools\nif [ ! -d \"$WMT16_SCRIPTS\" ]; then\n  echo \"Cloning WMT16 preprocessing scripts...\"\n  git clone https://github.com/rsennrich/wmt16-scripts.git\nfi\n\n# Download WikiExtractor\nif [ ! -d wikiextractor ]; then\n    echo \"Cloning WikiExtractor from GitHub repository...\"\n    git clone https://github.com/attardi/wikiextractor.git\n    cd wikiextractor\n    git checkout e4abb4cbd019b0257824ee47c23dd163919b731b\n    cd ..\nfi\n\ncd $old\n\n# # Chinese segmenter\n# if ! ls $TOOLS_PATH/stanford-segmenter-* 1> /dev/null 2>&1; then\n#   echo \"Stanford segmenter not found at $TOOLS_PATH/stanford-segmenter-*\"\n#   echo \"Please install Stanford segmenter in $TOOLS_PATH\"\n#   exit 1\n# fi\n# \n# # Thai tokenizer\n# if ! python -c 'import pkgutil; exit(not pkgutil.find_loader(\"pythainlp\"))'; then\n#   echo \"pythainlp package not found in python\"\n#   echo \"Please install pythainlp (pip install pythainlp)\"\n#   exit 1\n# fi\n# \n"
  },
  {
    "path": "data/wiki/tools/remove_accent.py",
    "content": "# Copyright (c) 2019-present, Facebook, Inc.\n# All rights reserved.\n#\n# This source code is licensed under the license found in the\n# LICENSE file in the root directory of this source tree.\n#\n\nimport sys\nimport unicodedata\nimport six\n\n\ndef convert_to_unicode(text):\n    \"\"\"\n    Converts `text` to Unicode (if it's not already), assuming UTF-8 input.\n    \"\"\"\n    # six_ensure_text is copied from https://github.com/benjaminp/six\n    def six_ensure_text(s, encoding='utf-8', errors='strict'):\n        if isinstance(s, six.binary_type):\n            return s.decode(encoding, errors)\n        elif isinstance(s, six.text_type):\n            return s\n        else:\n            raise TypeError(\"not expecting type '%s'\" % type(s))\n\n    return six_ensure_text(text, encoding=\"utf-8\", errors=\"ignore\")\n\n\ndef run_strip_accents(text):\n    \"\"\"\n    Strips accents from a piece of text.\n    \"\"\"\n    text = unicodedata.normalize(\"NFD\", text)\n    output = []\n    for char in text:\n        cat = unicodedata.category(char)\n        if cat == \"Mn\":\n            continue\n        output.append(char)\n    return \"\".join(output)\n\n\nfor line in sys.stdin:\n    line = convert_to_unicode(line.rstrip())\n    line = run_strip_accents(line)\n    print(u'%s' % line)\n"
  },
  {
    "path": "data/wiki/tools/segment_th.py",
    "content": "# Copyright (c) 2019-present, Facebook, Inc.\n# All rights reserved.\n#\n# This source code is licensed under the license found in the\n# LICENSE file in the root directory of this source tree.\n#\n\nimport sys\nfrom pythainlp.tokenize import word_tokenize\n\nfor line in sys.stdin.readlines():\n    line = line.rstrip('\\n')\n    print(' '.join(word_tokenize(line)))\n"
  },
  {
    "path": "data/wiki/tools/tokenize.sh",
    "content": "# Copyright (c) 2019-present, Facebook, Inc.\n# All rights reserved.\n#\n# This source code is licensed under the license found in the\n# LICENSE file in the root directory of this source tree.\n#\n\n# Tokenize text data in various languages\n# Usage: e.g.   cat wiki.ar | tokenize.sh ar\n\nset -e\n\nN_THREADS=8\n\nlg=$1\nTOOLS_PATH=$2\n\n# moses\nMOSES=$TOOLS_PATH/mosesdecoder\nREPLACE_UNICODE_PUNCT=$MOSES/scripts/tokenizer/replace-unicode-punctuation.perl\nNORM_PUNC=$MOSES/scripts/tokenizer/normalize-punctuation.perl\nREM_NON_PRINT_CHAR=$MOSES/scripts/tokenizer/remove-non-printing-char.perl\nTOKENIZER=$MOSES/scripts/tokenizer/tokenizer.perl\n\n# Chinese\nif [ \"$lg\" = \"zh\" ]; then\n  $TOOLS_PATH/stanford-segmenter-*/segment.sh pku /dev/stdin UTF-8 0 | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR\n# Thai\nelif [ \"$lg\" = \"th\" ]; then\n  cat - | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | python $TOOLS_PATH/segment_th.py\n# Japanese\nelif [ \"$lg\" = \"ja\" ]; then\n  cat - | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | kytea -notags\n# other languages\nelse\n  cat - | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | $TOKENIZER -no-escape -threads $N_THREADS -l $lg\nfi\n"
  },
  {
    "path": "data/wiki103/get_data_cased.sh",
    "content": "OUTPUT=data/wiki103-cased\nwget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip -P $OUTPUT/\nunzip $OUTPUT/wikitext-103-raw-v1.zip -d $OUTPUT\nmv $OUTPUT/wikitext-103-raw/* $OUTPUT\nrm -rf $OUTPUT/wikitext-103-raw-v1.zip $OUTPUT/wikitext-103-raw\n"
  },
  {
    "path": "data/wiki103/get_data_uncased.sh",
    "content": "OUTPUT=data/wiki103\n\nwget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip -P $OUTPUT/\nunzip $OUTPUT/wikitext-103-v1.zip -d $OUTPUT\nmv $OUTPUT/wikitext-103/* $OUTPUT\nrm -rf $OUTPUT/wikitext-103-v1.zip $OUTPUT/wikitext-103\n"
  },
  {
    "path": "requirements.txt",
    "content": "torch\n#==1.4.0\ntorchvision\n#==0.5.0\ntransformers==3.3.0\ntensorboardX\n\n# For GLUE evaluation\nsklearn\n\n# Fiass supports fast indexing.\n# The code has a torch-implemented GPU indexing, so do not worry if you could not install faiss.\nfaiss-gpu>=1.6.3\n\n# Spacy is used in sentence segmentation where the sentences are the input the cross-modality matching model.\nspacy\n\n# A higher h5py version to support h5py.VirtualLayout\nh5py>=2.10.0\n"
  },
  {
    "path": "scripts/base_vlm_wiki.bash",
    "content": "# The name of experiment\nGPUS=$1\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/bert/$NAME\nmkdir -p $output/src\ncp -r vlm $output/src/\ncp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash\ncp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash \ncp $0 $output/run.bash\n\nexport TRAIN_FILE=data/wiki-cased/en.train.raw\nexport TEST_FILE=data/wiki-cased/en.valid.raw\n\n# Pre-training\nCUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \\\n    --output_dir=$output \\\n\t--overwrite_output_dir \\\n\t--config_name=vlm/configs/bert-12L-768H.json \\\n\t--tokenizer_name=bert-base-uncased \\\n    --model_type=bert \\\n\t--block_size=126 \\\n\t--per_gpu_train_batch_size=32 \\\n    --per_gpu_eval_batch_size=32 \\\n\t--gradient_accumulation_steps=2 \\\n    --max_steps=200000 \\\n\t--learning_rate=2e-4 \\\n\t--weight_decay=0.01 \\\n\t--warmup_steps=5000 \\\n    --mlm_probability 0.15 \\\n    --mlm_ratio 1.0 \\\n    --do_train \\\n    --train_data_file=$TRAIN_FILE \\\n    --do_eval \\\n    --eval_data_file=$TEST_FILE \\\n    --col_data \\\n    --split_sent \\\n    --do_voken_cls \\\n    --voken_labels all \\\n    --voken_dir snap/xmatching/bert_resnext/vokens \\\n    --voken_suffix vg_nococo \\\n    --mlm ${@:3} | tee $output/log.log\n\n    #--fp16 \\\n\t#--fp16_opt_level O2 \\\n\n"
  },
  {
    "path": "scripts/base_vlm_wiki_glue.bash",
    "content": "# The name of experiment\nGPUS=$1\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/bert/$NAME\nmkdir -p $output/src\ncp -r vlm $output/src/\ncp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash\ncp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash \ncp $0 $output/run.bash\n\nexport TRAIN_FILE=data/wiki-cased/en.train.raw\nexport TEST_FILE=data/wiki-cased/en.valid.raw\n\n# Pre-training\nCUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \\\n    --output_dir=$output \\\n\t--overwrite_output_dir \\\n\t--config_name=vlm/configs/bert-12L-768H.json \\\n\t--tokenizer_name=bert-base-uncased \\\n    --model_type=bert \\\n\t--block_size=126 \\\n\t--per_gpu_train_batch_size=32 \\\n    --per_gpu_eval_batch_size=32 \\\n\t--gradient_accumulation_steps=2 \\\n    --max_steps=200000 \\\n\t--learning_rate=2e-4 \\\n\t--weight_decay=0.01 \\\n\t--warmup_steps=5000 \\\n    --mlm_probability 0.15 \\\n    --mlm_ratio 1.0 \\\n    --do_train \\\n    --train_data_file=$TRAIN_FILE \\\n    --do_eval \\\n    --eval_data_file=$TEST_FILE \\\n    --col_data \\\n    --split_sent \\\n    --do_voken_cls \\\n    --voken_labels all \\\n    --voken_dir snap/xmatching/bert_resnext/vokens \\\n    --voken_suffix vg_nococo \\\n    --mlm ${@:3} | tee $output/log.log\n\n    #--fp16 \\\n\t#--fp16_opt_level O2 \\\n\n# Wait for clearing the GPU cache\nsleep 30\nbash scripts/run_glue_epochs.bash $GPUS $output --snaps 4\n\n"
  },
  {
    "path": "scripts/base_wiki.bash",
    "content": "GPUS=$1\n# The name of experiment\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/bert/$NAME\nmkdir -p $output/src\ncp -r vlm/*.py $output/src/\ncp $0 $output/run.bash\ncp run_glue_epochs.bash $output/run_glue_epochs.bash\ncp run_glue_at_epoch.bash $output/run_glue_at_epoch.bash \n\nexport TRAIN_FILE=data/wiki-cased/en.train.raw\nexport TEST_FILE=data/wiki-cased/en.valid.raw\n\n# Pre-training\nCUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \\\n    --output_dir=$output \\\n\t--overwrite_output_dir \\\n\t--config_name=vlm/configs/bert-12L-768H.json \\\n\t--tokenizer_name=bert-base-uncased \\\n    --model_type=bert \\\n\t--block_size=126 \\\n\t--per_gpu_train_batch_size=64 \\\n    --per_gpu_eval_batch_size=64 \\\n\t--gradient_accumulation_steps=1 \\\n    --max_steps 220000 \\\n\t--learning_rate=2e-4 \\\n\t--weight_decay=0.01 \\\n\t--warmup_steps=5000 \\\n    --do_train \\\n    --train_data_file=$TRAIN_FILE \\\n    --do_eval \\\n    --eval_data_file=$TEST_FILE \\\n    --col_data \\\n    --split_sent \\\n    --mlm ${@:3} | tee $output/log.log\n\n    #--fp16 \\\n\t#--fp16_opt_level O2 \\\n"
  },
  {
    "path": "scripts/base_wiki_glue.bash",
    "content": "GPUS=$1\n# The name of experiment\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/bert/$NAME\nmkdir -p $output/src\ncp -r vlm/*.py $output/src/\ncp $0 $output/run.bash\ncp run_glue_epochs.bash $output/run_glue_epochs.bash\ncp run_glue_at_epoch.bash $output/run_glue_at_epoch.bash \n\nexport TRAIN_FILE=data/wiki-cased/en.train.raw\nexport TEST_FILE=data/wiki-cased/en.valid.raw\n\n# Pre-training\nCUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \\\n    --output_dir=$output \\\n\t--overwrite_output_dir \\\n\t--config_name=vlm/configs/bert-12L-768H.json \\\n\t--tokenizer_name=bert-base-uncased \\\n    --model_type=bert \\\n\t--block_size=126 \\\n\t--per_gpu_train_batch_size=64 \\\n    --per_gpu_eval_batch_size=64 \\\n\t--gradient_accumulation_steps=1 \\\n    --max_steps 220000 \\\n\t--learning_rate=2e-4 \\\n\t--weight_decay=0.01 \\\n\t--warmup_steps=5000 \\\n    --do_train \\\n    --train_data_file=$TRAIN_FILE \\\n    --do_eval \\\n    --eval_data_file=$TEST_FILE \\\n    --col_data \\\n    --split_sent \\\n    --mlm ${@:3} | tee $output/log.log\n\n    #--fp16 \\\n\t#--fp16_opt_level O2 \\\n\n#--shuffle \\\n# Wait for clearing the GPU cache\nsleep 30\nbash scripts/run_glue_epochs.bash $GPUS $output --snaps -1\n"
  },
  {
    "path": "scripts/extract_keys.bash",
    "content": "CUDA_VISIBLE_DEVICES=$1 python vokenization/extract_vision_keys.py \\\n    --image-sets vg_nococo,coco_minival,coco_nominival,coco_train,cc_valid \\\n    --load-dir snap/xmatching/$2\n"
  },
  {
    "path": "scripts/mpvokenize_wiki.bash",
    "content": "GPU=$1\n\nLOAD=snap/xmatching/$2\nDATA_DIR=data/wiki-cased\nTOKENIZER=bert-base-uncased\n\nfor DATA_NAME in en.valid.raw en.test.raw en.train.raw\ndo \n    CUDA_VISIBLE_DEVICES=$GPU python vokenization/vokenize_corpus_mp.py \\\n        --load $LOAD \\\n        --corpus=$DATA_DIR/$DATA_NAME \\\n        --tokenizer-name $TOKENIZER \\\n        --image-sets vg_nococo \\\n        --max-img-num 50000 \ndone\n\n"
  },
  {
    "path": "scripts/mpvokenize_wiki103.bash",
    "content": "GPU=$1\n\nLOAD=snap/xmatching/$2\nWIKI_DIR=data/wiki103-cased\nTOKENIZER=bert-base-uncased\n\nfor DATA_NAME in wiki.valid.raw wiki.test.raw wiki.train.raw\ndo \n    CUDA_VISIBLE_DEVICES=$GPU python vokenization/vokenize_corpus_mp.py \\\n        --load $LOAD \\\n        --corpus=$WIKI_DIR/$DATA_NAME \\\n        --tokenizer-name $TOKENIZER \\\n        --image-sets vg_nococo \\\n        --max-img-num 50000\ndone\n\n"
  },
  {
    "path": "scripts/run_glue_at_epoch.bash",
    "content": "export GLUE_DIR=data/glue/\nEPOCHS=$2\nMODEL=$3\nCKPT=$4\n\nfor TASK_NAME in WNLI RTE MRPC STS-B CoLA SST-2 QNLI QQP MNLI\ndo\n    CUDA_VISIBLE_DEVICES=$1 python vlm/run_glue.py \\\n        --model_type bert \\\n        --tokenizer_name=bert-base-uncased \\\n        --model_name_or_path $MODEL/$CKPT \\\n        --task_name $TASK_NAME \\\n        --do_train \\\n        --do_eval \\\n        --do_lower_case \\\n        --data_dir $GLUE_DIR/$TASK_NAME \\\n        --save_steps -1 \\\n        --max_seq_length 126 \\\n        --per_gpu_eval_batch_size=32   \\\n        --per_gpu_train_batch_size=32   \\\n        --learning_rate 1e-4 \\\n        --warmup_steps 0.1 \\\n        --num_train_epochs $EPOCHS.0 \\\n        --output_dir $MODEL/glueepoch_$CKPT/$TASK_NAME\ndone\n\n        #--overwrite_output_dir \\\n"
  },
  {
    "path": "scripts/run_glue_epochs.bash",
    "content": "GPUS=$1\nMODEL=$2\n \npython vlm/run_glue_epochs.py --gpus $GPUS --load $MODEL \\\n    ${@:3}\n\n"
  },
  {
    "path": "scripts/run_xmatching.bash",
    "content": "GPUS=$1\n# The name of experiment\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/xmatching/$NAME\nmkdir -p $output/src/\ncp -r xmatching $output/src/\ncp $0 $output/run.bash\n\n# Pre-training\nCUDA_VISIBLE_DEVICES=$GPUS unbuffer python xmatching/main.py \\\n    --train-imgs mscoco_train,mscoco_nominival --valid-imgs mscoco_minival \\\n    --train-langs mscoco --valid-langs mscoco \\\n    --max-len 20 --dim 64 \\\n    --lang-layers 4,3,2,1 \\\n    --lang-pretrained --visn-pretrained \\\n    --num-workers 8 --batchSize 256 --optim adam --lr 1e-3 --epochs 20 \\\n    --nodes 1 --nr 0 \\\n    --output $output ${@:3} | tee $output/log.log\n\n#--visn resnext101_32x8d --lang bert \\\n"
  },
  {
    "path": "scripts/small_vlm_wiki103.bash",
    "content": "# The name of experiment\nGPUS=$1\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/vlm/$NAME\nmkdir -p $output/src\ncp -r vlm $output/src/\ncp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash\ncp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash \ncp $0 $output/run.bash\n\nexport TRAIN_FILE=data/wiki103-cased/wiki.train.raw\nexport TEST_FILE=data/wiki103-cased/wiki.valid.raw\n\n# Pre-training\nCUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \\\n    --output_dir=$output \\\n\t--overwrite_output_dir \\\n\t--config_name=vlm/configs/bert-6L-512H.json \\\n\t--tokenizer_name=bert-base-uncased \\\n    --model_type=bert \\\n\t--block_size=126 \\\n\t--per_gpu_train_batch_size=32 \\\n    --per_gpu_eval_batch_size=32 \\\n\t--gradient_accumulation_steps=2 \\\n\t--num_train_epochs=40 \\\n\t--learning_rate=2e-4 \\\n\t--weight_decay=0.01 \\\n\t--warmup_steps=10000 \\\n    --mlm_probability 0.15 \\\n    --mlm_ratio 1.0 \\\n    --do_train \\\n    --train_data_file=$TRAIN_FILE \\\n    --do_eval \\\n    --eval_data_file=$TEST_FILE \\\n    --col_data \\\n    --split_sent \\\n    --do_voken_cls \\\n    --voken_labels all \\\n    --voken_dir snap/xmatching/bert_resnext/vokens \\\n    --voken_suffix vg_nococo \\\n    --mlm ${@:3} | tee $output/log.log\n\n    #--fp16 \\\n\t#--fp16_opt_level O2 \\\n\n"
  },
  {
    "path": "scripts/small_vlm_wiki103_glue.bash",
    "content": "# The name of experiment\nGPUS=$1\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/vlm/$NAME\nmkdir -p $output/src\ncp -r vlm $output/src/\ncp scripts/run_glue_epochs.bash $output/run_glue_epochs.bash\ncp scripts/run_glue_at_epoch.bash $output/run_glue_at_epoch.bash \ncp $0 $output/run.bash\n\nexport TRAIN_FILE=data/wiki103-cased/wiki.train.raw\nexport TEST_FILE=data/wiki103-cased/wiki.valid.raw\n\n# Pre-training\nCUDA_VISIBLE_DEVICES=$1 unbuffer python vlm/run_vlm_distributed.py \\\n    --output_dir=$output \\\n\t--overwrite_output_dir \\\n\t--config_name=vlm/configs/bert-6L-512H.json \\\n\t--tokenizer_name=bert-base-uncased \\\n    --model_type=bert \\\n\t--block_size=126 \\\n\t--per_gpu_train_batch_size=32 \\\n    --per_gpu_eval_batch_size=32 \\\n\t--gradient_accumulation_steps=2 \\\n\t--num_train_epochs=40 \\\n\t--learning_rate=2e-4 \\\n\t--weight_decay=0.01 \\\n\t--warmup_steps=10000 \\\n    --mlm_probability 0.15 \\\n    --mlm_ratio 1.0 \\\n    --do_train \\\n    --train_data_file=$TRAIN_FILE \\\n    --do_eval \\\n    --eval_data_file=$TEST_FILE \\\n    --col_data \\\n    --split_sent \\\n    --do_voken_cls \\\n    --voken_labels all \\\n    --voken_dir snap/xmatching/bert_resnext/vokens \\\n    --voken_suffix vg_nococo \\\n    --mlm ${@:3} | tee $output/log.log\n\n    #--fp16 \\\n\t#--fp16_opt_level O2 \\\n\n# Wait for clearing the GPU cache\nsleep 30\nbash scripts/run_glue_epochs.bash $GPUS $output --snaps 4\n\n"
  },
  {
    "path": "scripts/small_wiki103.bash",
    "content": "# The name of experiment\nGPUS=$1\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/bert/$NAME\nmkdir -p $output/src\ncp -r vlm/*.py $output/src/\ncp $0 $output/run.bash\n\nexport TRAIN_FILE=data/wiki103-cased/wiki.train.raw\nexport TEST_FILE=data/wiki103-cased/wiki.valid.raw\n\n# Pre-training\nCUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \\\n    --output_dir=$output \\\n\t--overwrite_output_dir \\\n\t--config_name=vlm/configs/bert-6L-512H.json \\\n\t--tokenizer_name=bert-base-uncased \\\n    --model_type=bert \\\n\t--block_size=126 \\\n\t--per_gpu_train_batch_size=64 \\\n    --per_gpu_eval_batch_size=64 \\\n\t--gradient_accumulation_steps=1 \\\n\t--num_train_epochs=44 \\\n\t--learning_rate=2e-4 \\\n\t--weight_decay=0.01 \\\n\t--warmup_steps=10000 \\\n    --do_train \\\n    --train_data_file=$TRAIN_FILE \\\n    --do_eval \\\n    --eval_data_file=$TEST_FILE \\\n    --col_data \\\n    --split_sent \\\n    --shuffle \\\n    --mlm ${@:3} | tee $output/log.log\n\n    #--fp16 \\\n\t#--fp16_opt_level O2 \\\n\n\n"
  },
  {
    "path": "scripts/small_wiki103_glue.bash",
    "content": "# The name of experiment\nGPUS=$1\nNAME=$2\n\n# Create dirs and make backup\noutput=snap/bert/$NAME\nmkdir -p $output/src\ncp -r vlm/*.py $output/src/\ncp $0 $output/run.bash\n\nexport TRAIN_FILE=data/wiki103-cased/wiki.train.raw\nexport TEST_FILE=data/wiki103-cased/wiki.valid.raw\n\n# Pre-training\nCUDA_VISIBLE_DEVICES=$GPUS unbuffer python vlm/run_lm_distributed.py \\\n    --output_dir=$output \\\n\t--overwrite_output_dir \\\n\t--config_name=vlm/configs/bert-6L-512H.json \\\n\t--tokenizer_name=bert-base-uncased \\\n    --model_type=bert \\\n\t--block_size=126 \\\n\t--per_gpu_train_batch_size=64 \\\n    --per_gpu_eval_batch_size=64 \\\n\t--gradient_accumulation_steps=1 \\\n\t--num_train_epochs=44 \\\n\t--learning_rate=2e-4 \\\n\t--weight_decay=0.01 \\\n\t--warmup_steps=10000 \\\n    --do_train \\\n    --train_data_file=$TRAIN_FILE \\\n    --do_eval \\\n    --eval_data_file=$TEST_FILE \\\n    --col_data \\\n    --split_sent \\\n    --shuffle \\\n    --mlm ${@:3} | tee $output/log.log\n\n    #--fp16 \\\n\t#--fp16_opt_level O2 \\\n\n\n# Wait for clearing the GPU cache\nsleep 30\nbash scripts/run_glue_epochs.bash $GPUS $output --snaps 4\n"
  },
  {
    "path": "scripts/xmatching_benchmark.bash",
    "content": "# Benchmarking the cross-modal matching model with\n#     1. Retrieval scores.\n#     2. Voken Diversity w.r.t words in specific language corpus.\n# Please run this after image_key_retrivel and tokenization. \n#    i.e., step 1 and step2 in readme.md\n\nMODEL=$2\nMODELPATH=snap/xmatching/$MODEL\nrm -rf $MODELPATH/analysis.log\n\n# Retrieval scores\nCUDA_VISIBLE_DEVICES=$1 unbuffer python vokenization/evaluate_retrieval.py \\\n    --load $MODELPATH \\\n    --image-sets coco_minival,cc_valid \\\n    | tee -a $MODELPATH/analysis.log\n\n# Diversity\n# Test diversity of vision-and-language (captioning) datasets\nCUDA_VISIBLE_DEVICES=$1 unbuffer python vokenization/evaluate_diversity.py \\\n    --load $MODELPATH \\\n    --image-sets vg_nococo \\\n    --corpus coco_minival,cc_valid \\\n    | tee -a $MODELPATH/analysis.log\n\n# Test diversity of pure-language corpus\nCUDA_VISIBLE_DEVICES=$1 unbuffer python vokenization/evaluate_diversity.py \\\n    --load $MODELPATH \\\n    --image-sets vg_nococo \\\n    --corpus data/wiki103-cased/wiki.valid.raw \\\n    --maxsents 95000 \\\n    | tee -a $MODELPATH/analysis.log\n"
  },
  {
    "path": "snap/bert/.gitkeep",
    "content": ""
  },
  {
    "path": "snap/vlm/.gitkeep",
    "content": ""
  },
  {
    "path": "snap/xmatching/.gitkeep",
    "content": "/*\n"
  },
  {
    "path": "tokenization/to_hdf5.py",
    "content": "import h5py\nimport numpy as np\nimport tqdm\n\nfrom transformers import AutoTokenizer\n\n\ndef validate_hdf5(fname, tokenizer_name):\n    print(\"--------------------------------------------\")\n    print(\"Start to valid the hdf5 file\", fname + '.' + tokenizer_name + '.hdf5')\n\n    with open(fname) as f:\n        lines = []\n        for line in f:\n            if 'wiki' in fname:\n                # Wiki103: remove document title\n                if line.startswith(' = '):\n                    continue\n                # Full Wiki: Remove the too short lines.\n                if len(line.strip().split(' ')) < 5:\n                    continue\n\n            if len(line.strip()) == 0:\n                # Always drop empty line\n                continue\n            lines.append(line)\n\n    # Use the slow tokenizer to validate the results of the fast tokenizer.\n    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)\n\n    h5_file = h5py.File(fname + '.' + tokenizer_name + '.hdf5', 'r')\n    tokens = h5_file['tokens']\n\n    print(\"Start to check the first 10 lines:\")\n    ids = []\n    for line in lines[:10]:\n        ids.extend(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(line)))\n    ids = np.array(ids)\n    first_tokens = np.array(tokens[:len(ids)])\n    if np.array_equal(ids, first_tokens):\n        print(\"PASS\")\n    else:\n        print(' '.join(tokenizer.convert_ids_to_tokens(ids)))\n        print()\n        print(' '.join(tokenizer.convert_ids_to_tokens(first_tokens)))\n        assert False, \"FAIL\"\n\n    print(\"Start to check the last 10 lines:\")\n    ids = []\n    for line in lines[-10:]:\n        ids.extend(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(line)))\n    ids = np.array(ids)\n    last_tokens = np.array(tokens[-len(ids):])\n    if np.array_equal(ids, last_tokens):\n        print(\"PASS\")\n    else:\n        print(' '.join(tokenizer.convert_ids_to_tokens(ids)))\n        print(' '.join(tokenizer.convert_ids_to_tokens(last_tokens)))\n        assert False, \"FAIL\"\n    print(\"--------------------------------------------\")\n\n\ndef to_hdf5(fname, tokenizer_name, validate=True):\n    print(\"Process %s\" % fname)\n\n    h5_file = h5py.File(fname + '.' + tokenizer_name + '.hdf5', 'w')\n    dset = h5_file.create_dataset(\"tokens\",\n                                  (0,),\n                                  maxshape=(None,),\n                                  dtype='int32')\n\n    dump_interval = 1000000\n    dump_iter = 0\n    with open('%s.%s' % (fname, tokenizer_name)) as f:\n        lines = 0\n        tokens = []\n        for line in tqdm.tqdm(f):\n            for token in map(int, line.split(' ')):\n                tokens.append(token)\n            if len(tokens) >= dump_interval:\n                dset.resize((dump_iter + len(tokens),))\n                dset[dump_iter: dump_iter + len(tokens)] = tokens\n                dump_iter += len(tokens)\n                tokens = []\n            lines += 1\n\n        dset.resize((dump_iter + len(tokens),))\n        dset[dump_iter: dump_iter + len(tokens)] = tokens\n        dump_iter += len(tokens)\n\n    assert len(dset) == dump_iter\n    h5_file.close()\n\n    if validate:\n        validate_hdf5(fname, tokenizer_name)\n\n    print()\n\n"
  },
  {
    "path": "tokenization/tokenize_dataset.py",
    "content": "# coding=utf-8\n# Copyleft 2020 project COL.\n\nimport argparse\nfrom pathlib import Path\n\nfrom transformers import AutoTokenizer\nimport time\n\nfrom to_hdf5 import to_hdf5\n\ndef tokenize_dataset(data_dir, fname, tokenizer_name, lines_are_sents=False):\n    data_path = Path(data_dir)\n\n    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, use_fast=True)\n\n    f = open(data_path / fname)\n    g = open((data_path / ('%s.%s' % (fname, tokenizer_name))), 'w')\n\n    # Statistics\n    dcmt_cnt = 0\n    token_cnt = 0\n    line_cnt = 0\n    line_starts = []\n\n    # Logging and dumping hyper-parameters\n    cache = ''\n    log_interval = log_iter = 1000000\n    dump_interval = dump_iter = 100000\n    start_time = time.time()\n\n    for i, line in enumerate(f):\n        # Identify the start of documents, ignore it.\n        if 'wiki103' in data_dir:\n            if line.startswith(' = '):\n                dcmt_cnt += 1\n                continue\n        elif 'wiki' in data_dir:\n            if len(line.strip().split(' ')) == 1:\n                dcmt_cnt += 1\n                continue\n\n        if 'wiki' in data_dir:\n            # Remove too short lines. Book corpus does not need this.\n            if len(line.strip().split(' ')) < 5:\n                continue\n\n        # Drop empty line (1)\n        if len(line.strip()) == 0:\n            continue\n\n        tokenized_line = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(line))\n        # tokenized_line = tokenizer.encode(line, add_special_tokens=False)\n        if len(tokenized_line) == 0:    # Drop empty line (2)\n            continue\n\n        line_cnt += 1\n        line_starts.append(token_cnt)\n        if i < 5:\n            print()\n            print('Line:', line)\n            print('Tokens:', ' '.join(tokenizer.convert_ids_to_tokens(tokenized_line)))\n        token_cnt += len(tokenized_line)\n        cache += ' '.join(map(str, tokenized_line)) + '\\n'\n\n        if (token_cnt + 1) > dump_iter:\n            g.write(cache)\n            cache = ''\n            dump_iter += dump_interval\n\n        if (token_cnt + 1) > log_iter:\n            used_time = time.time() - start_time\n            print(\"Process %d tokens in %d seconds, %0.4f tokens per second.\" % (\n                token_cnt, used_time, token_cnt / used_time))\n            log_iter += log_interval\n\n    # Deal with the last remaining tokens.\n    line_starts.append(token_cnt)\n    g.write(cache)\n\n    # Dump Line starts\n    identifier = 'sent' if lines_are_sents else 'line'\n    with open(data_path / ('%s.%s.%s' % (fname, tokenizer_name, identifier)), 'w') as f:\n        for line_start in line_starts:\n            f.write(str(line_start) + \"\\n\")\n\n    f.close()\n    g.close()\n    print(f\"Documents: {dcmt_cnt}, Lines: {line_cnt}, Words: {token_cnt} in dataset {fname}\")\n\n    to_hdf5(str(data_path / fname), tokenizer_name)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n\n    # Required parameters\n    parser.add_argument(\n        \"datadir\", default=None, type=str, help=\"The input training data file (a text file).\"\n    )\n    parser.add_argument(\n        \"fname\", default=None, type=str, help=\"The input training data file (a text file).\"\n    )\n    parser.add_argument(\n        \"tokenizer_name\", default=None, type=str, help=\"The input training data file (a text file).\"\n    )\n    parser.add_argument(\n        \"--lines-are-sents\", action='store_true',\n        help=\"Add this if the line are already segmented to sentences, instead of paragraphs.\"\n    )\n\n    param = parser.parse_args()\n\n    tokenize_dataset(\n        param.datadir,\n        param.fname,\n        param.tokenizer_name,\n        param.lines_are_sents,\n    )\n\n"
  },
  {
    "path": "tokenization/tokenize_wiki103_bert.bash",
    "content": "DATA_DIR=data/wiki103-cased\nTOKENIZER=bert-base-uncased\npython tokenization/tokenize_dataset.py $DATA_DIR wiki.valid.raw $TOKENIZER\npython tokenization/tokenize_dataset.py $DATA_DIR wiki.test.raw $TOKENIZER\npython tokenization/tokenize_dataset.py $DATA_DIR wiki.train.raw $TOKENIZER\n"
  },
  {
    "path": "tokenization/tokenize_wiki103_roberta.bash",
    "content": "DATA_DIR=data/wiki103-cased\nTOKENIZER=roberta-base\npython tokenization/tokenize_dataset.py $DATA_DIR wiki.valid.raw $TOKENIZER\npython tokenization/tokenize_dataset.py $DATA_DIR wiki.test.raw $TOKENIZER\npython tokenization/tokenize_dataset.py $DATA_DIR wiki.train.raw $TOKENIZER\n"
  },
  {
    "path": "tokenization/tokenize_wiki_bert.bash",
    "content": "DATA_DIR=data/wiki-cased\nTOKENIZER=bert-base-uncased\npython tokenization/tokenize_dataset.py $DATA_DIR en.valid.raw $TOKENIZER\npython tokenization/tokenize_dataset.py $DATA_DIR en.test.raw $TOKENIZER\npython tokenization/tokenize_dataset.py $DATA_DIR en.train.raw $TOKENIZER\n"
  },
  {
    "path": "tokenization/tokenize_wiki_roberta.bash",
    "content": "DATA_DIR=data/wiki-cased-untokenized/\nTOKENIZER=roberta-base\npython tokenization/tokenize_dataset.py $DATA_DIR en.valid.raw $TOKENIZER\npython tokenization/tokenize_dataset.py $DATA_DIR en.test.raw $TOKENIZER\npython tokenization/tokenize_dataset.py $DATA_DIR en.train.raw $TOKENIZER\n"
  },
  {
    "path": "vlm/__init__.py",
    "content": "import data\n"
  },
  {
    "path": "vlm/configs/bert-12L-768H.json",
    "content": "{\n  \"architectures\": [\n    \"BertForMaskedLM\"\n  ],\n  \"attention_probs_dropout_prob\": 0.1,\n  \"hidden_act\": \"gelu\",\n  \"hidden_dropout_prob\": 0.1,\n  \"hidden_size\": 768,\n  \"initializer_range\": 0.02,\n  \"intermediate_size\": 3072,\n  \"max_position_embeddings\": 512,\n  \"num_attention_heads\": 12,\n  \"num_hidden_layers\": 12,\n  \"type_vocab_size\": 2,\n  \"vocab_size\": 30522\n}\n"
  },
  {
    "path": "vlm/configs/bert-4L-768H.json",
    "content": "{\n  \"architectures\": [\n    \"BertForMaskedLM\"\n  ],\n  \"attention_probs_dropout_prob\": 0.1,\n  \"hidden_act\": \"gelu\",\n  \"hidden_dropout_prob\": 0.1,\n  \"hidden_size\": 768,\n  \"initializer_range\": 0.02,\n  \"intermediate_size\": 3072,\n  \"max_position_embeddings\": 512,\n  \"num_attention_heads\": 12,\n  \"num_hidden_layers\": 4,\n  \"type_vocab_size\": 2,\n  \"vocab_size\": 30522\n}\n"
  },
  {
    "path": "vlm/configs/bert-6L-512H.json",
    "content": "{\n  \"architectures\": [\n    \"BertForMaskedLM\"\n  ],\n  \"attention_probs_dropout_prob\": 0.1,\n  \"hidden_act\": \"gelu\",\n  \"hidden_dropout_prob\": 0.1,\n  \"hidden_size\": 512,\n  \"initializer_range\": 0.02,\n  \"intermediate_size\": 2048,\n  \"max_position_embeddings\": 512,\n  \"num_attention_heads\": 8,\n  \"num_hidden_layers\": 6,\n  \"type_vocab_size\": 2,\n  \"vocab_size\": 30522\n}\n"
  },
  {
    "path": "vlm/configs/bert_base.json",
    "content": "{\n  \"architectures\": [\n    \"BertForMaskedLM\"\n  ],\n  \"attention_probs_dropout_prob\": 0.1,\n  \"hidden_act\": \"gelu\",\n  \"hidden_dropout_prob\": 0.1,\n  \"hidden_size\": 768,\n  \"initializer_range\": 0.02,\n  \"intermediate_size\": 3072,\n  \"max_position_embeddings\": 512,\n  \"num_attention_heads\": 12,\n  \"num_hidden_layers\": 12,\n  \"type_vocab_size\": 2,\n  \"vocab_size\": 30522\n}\n"
  },
  {
    "path": "vlm/data.py",
    "content": "import copy\nimport os\nimport random\n\nimport h5py\nimport torch\nfrom torch.utils.data import DataLoader, Dataset\nimport tqdm\n\n\nclass CoLDataset(Dataset):\n    IGNORE_ID = -100\n    sent_strategy = 'first'\n\n    def __init__(self, file_path, tokenizer_name, tokenizer, block_size=512,\n                 split_sent=False, voken_dir=None, suffix=None, verbose=False,\n                 voken_ablation=None):\n\n        # Open token's hdf5\n        token_path = file_path + '.' + tokenizer_name + '.hdf5'\n        assert os.path.isfile(token_path)\n        if verbose:\n            print(\"-------- Load Data -------\")\n            print(\"Load tokens from\", token_path)\n        self.token_hdf5 = h5py.File(token_path, 'r')\n        self.tokenizer = tokenizer\n        self.tokens = self.token_hdf5['tokens']\n        self.verbose = verbose\n        self.voken_ablation = voken_ablation\n        self._iter_cnt = 0\n\n        # Open voken's hdf5 and load voken ids\n        if voken_dir is not None:\n            assert suffix is not None, 'Please provide suffix of the voken, e.g., vg_nococo.5000.'\n            self.sent_level = 'sent' in voken_dir\n            dset_fname = os.path.split(file_path)[-1]\n            voken_path = os.path.join(voken_dir, f\"{dset_fname}.{suffix}.hdf5\")\n            voken_ids_path = os.path.join(voken_dir, f\"{dset_fname}.{suffix}.ids\")\n            if verbose:\n                print(\"Load vokens from\", voken_path)\n            self.voken_hdf5 = h5py.File(voken_path, 'r')\n            self.vokens = self.voken_hdf5['vokens']\n            assert len(self.vokens) == len(self.tokens)\n            self._voken_ids = list(\n                map(lambda x: x.strip(),\n                    open(voken_ids_path).readlines())\n            )\n            if verbose:\n                print(\"\\t with voken size\", self.voken_size)\n                print(\"\\t top 5 voken ids are:\", self._voken_ids[:5])\n        else:\n            self.vokens = None\n\n        # Split for every block_size tokens\n        # The last block without full length will be dropped.\n        num_tokens = len(self.tokens)\n        self.starts = list(range(0, num_tokens, block_size))\n        self.batches = list(zip(self.starts[:-1], self.starts[1:]))\n\n        manual_filtered =False\n        if \"en.train.raw\" in file_path and tokenizer_name == \"bert-base-uncased\":\n            self.batches = manual_filter(self.batches)\n            if verbose:\n                print(\"Data: Mannually filter the range for counties.\")\n            manual_filtered = True\n\n        # batch_info\n        if verbose:\n            print(\"Split sent with block size\", block_size)\n            print(f\"Total batches: {len(self.batches)}\")\n            print(f\"Total tokens: {len(self.tokens)}\")\n            if voken_dir is not None:\n                print(f\"Total vokens: {len(self.vokens)}\")\n            if voken_ablation is not None:\n                print(\"The model will process voken ablation strategy:\", voken_ablation)\n            print()\n\n        block_check(self.batches, block_size, fixed_size=True, manual_filtered=manual_filtered)\n        if self.voken_ablation == 'token':\n            self._voken_ids = list(range(30522))\n\n    @property\n    def voken_size(self):\n        return len(self._voken_ids)\n\n    @property\n    def voken_ids(self):\n        return copy.copy(self._voken_ids)\n\n    def assert_equal_vokens(self, dataset):\n        assert self.voken_size == dataset.voken_size\n        for vid, vid1 in zip(self.voken_ids, dataset.voken_ids):\n            assert vid == vid1\n\n    def __len__(self):\n        return len(self.batches) - 1\n\n    def __getitem__(self, item):\n        token_start, token_end = self.batches[item]\n        if self._iter_cnt < 5 and self.verbose:\n            print(f\"Data Loader: data iteration {self._iter_cnt}, with range {token_start} to {token_end}.\")\n            self._iter_cnt += 1\n        tokens = list(self.tokens[token_start: token_end])\n        token_tensor = torch.tensor(\n            self.tokenizer.build_inputs_with_special_tokens(tokens),\n            dtype=torch.long)\n        if self.vokens is not None:\n            vokens = list(self.vokens[token_start: token_end])\n\n            vokens = self.maybe_do_sent_level(vokens)\n            vokens = self.maybe_do_ablation_study(vokens, tokens)\n\n            voken_tensor = torch.tensor(\n                [self.IGNORE_ID] + vokens + [self.IGNORE_ID],\n                dtype=torch.long\n            )\n\n            return token_tensor, voken_tensor\n        else:\n            return token_tensor\n\n    def maybe_do_sent_level(self, vokens):\n        if not self.sent_level:\n            return vokens\n        else:\n            if self.sent_strategy == 'all':\n                vokens = [\n                    (-voken-1 if voken < 0 else voken)\n                    for voken in vokens\n                ]\n            elif self.sent_strategy == 'first':\n                vokens = [\n                    (self.IGNORE_ID if voken < 0 else voken)\n                    for voken in vokens\n                ]\n            return vokens\n\n    def maybe_do_ablation_study(self, vokens, tokens):\n        if self.voken_ablation is None:\n            return vokens\n        else:\n            if self._iter_cnt < 5 and self.verbose:\n                print(\"Before voken ablation: \", vokens)\n            if self.voken_ablation == 'random':\n                vokens = [random.randint(0, self.voken_size - 1)\n                          for _ in range(len(vokens))]\n            elif self.voken_ablation == 'shuffle':\n                random.shuffle(vokens)\n            elif self.voken_ablation == 'reverse':\n                vokens = vokens[::-1]\n            elif self.voken_ablation == 'token':\n                vokens = tokens\n            if self._iter_cnt < 5 and self.verbose:\n                print(\"After voken ablation: \", vokens)\n            return vokens\n\n    def get_item_info(self, item):\n        token_start = self.batches[item]\n        token_end = self.batches[item + 1]\n        return token_start, token_end\n\n    def __del__(self):\n        self.token_hdf5.close()\n        if self.vokens is not None:\n            self.voken_hdf5.close()\n\n\nFORBIDDEN_RANGE = (\n    119314944,      # Start of iter 3700\n    187053048       # End of iter 5800\n)\n\n\ndef intersect(x, y):\n    x1, x2 = x\n    y1, y2 = y\n    if x2 <= y1 or x2 >= y2:\n        # Case 1: [   x    )[   y    )\n        # Case 2: [   y    )[   x    )\n        return False\n    return True\n\n\ndef manual_filter(batches):\n    batches = list(filter(\n        lambda x: not intersect(x, FORBIDDEN_RANGE),\n        batches\n    ))\n    return batches\n\n\ndef block_check(batches, block_size, fixed_size=False, manual_filtered=False):\n    \"\"\"\n    Check whether the batches satisfy following requirements.\n        1. Monotonic\n        2. Mutually exclusive\n        3. Range < block_size\n    \"\"\"\n    last_end = 0\n    for start_token, end_token in batches:\n        assert last_end <= start_token\n        if fixed_size:\n            assert (end_token - start_token) == block_size, 'len([%d, %d)) != %d' % (start_token, end_token, block_size)\n        else:\n            assert (end_token - start_token) <= block_size, 'len([%d, %d)) > %d' % (start_token, end_token, block_size)\n        if manual_filtered:\n            assert not intersect((start_token, end_token), FORBIDDEN_RANGE)\n        last_end = end_token\n\n\ndef get_voken_feats(dataset: CoLDataset, feat_dir: str):\n    \"\"\"\n    Load pre-extracted visual features regarding img_ids of vokens.\n    \"\"\"\n    set2id2feat = {}\n    voken_feats = []\n    for voken_id in dataset.voken_ids:\n        voken_img_set, voken_img_id = voken_id.split('/')\n        if voken_img_set not in set2id2feat:\n            img_ids = list(map(\n                lambda x: x.rstrip(),\n                open(os.path.join(feat_dir, f\"{voken_img_set}.ids\"))\n            ))\n            img_feats = h5py.File(\n                os.path.join(feat_dir, f\"{voken_img_set}.hdf5\"), 'r'\n            )['keys'][:]\n            id2feat = {}\n            assert len(img_ids) == len(img_feats)\n            for img_id, img_feat in zip(img_ids, img_feats):\n                id2feat[img_id] = img_feat\n            set2id2feat[voken_img_set] = id2feat\n        voken_feats.append(set2id2feat[voken_img_set][voken_img_id])\n    return voken_feats\n\n\n\n"
  },
  {
    "path": "vlm/model.py",
    "content": "import math\n\nimport torch\nimport torch.nn.functional as F\nfrom torch.nn import CrossEntropyLoss, MSELoss, SmoothL1Loss\nfrom torch import nn\nfrom transformers import (\n    BertConfig,\n    BertForMaskedLM,\n)\n\nfrom transformers.modeling_bert import BertOnlyMLMHead\n\n\nBertLayerNorm = torch.nn.LayerNorm\n\n\n# The GLUE function is copied from huggingface transformers:\n# https://github.com/huggingface/transformers/blob/c6acd246ec90857b70f449dcbcb1543f150821fc/src/transformers/activations.py\ndef _gelu_python(x):\n    \"\"\" Original Implementation of the gelu activation function in Google Bert repo when initially created.\n        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):\n        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))\n\n\nif torch.__version__ < \"1.4.0\":\n    gelu = _gelu_python\nelse:\n    gelu = F.gelu\n\n\nclass CoLBertConfig(BertConfig):\n    def __init__(self, *args, **kwargs):\n        super().__init__(*args, **kwargs)\n        self.voken_size = None\n        self.voken_dim = None\n        self.do_voken_cls = False\n        self.do_voken_reg = False\n        self.do_voken_ctr = False\n        self.shared_head = False\n        self.verbose = False\n\n\nclass BertSharedHead(BertOnlyMLMHead):\n    \"\"\"Bert Head for masked language modeling.\"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.do_voken_cls = config.do_voken_cls\n        self.do_voken_ctr = config.do_voken_ctr\n\n        assert int(self.do_voken_cls) + int(self.do_voken_ctr) == 1\n        if self.do_voken_cls:\n            self.visn_decoder = nn.Linear(config.hidden_size, config.voken_size, bias=True)\n\n        if self.do_voken_ctr:\n            self.visn_decoder = nn.Linear(config.voken_dim, config.hidden_size, bias=True)\n\n    def forward(self, features, **kwargs):\n        \"\"\"\n        :param features: [batch, length, dim]\n        :return: lang_scores [batch, length, vocab_size],\n                 visn_scores [batch, length, voken_size]\n        \"\"\"\n        x = self.predictions.transform(features)    # batch_size, length, dim\n\n        lang_scores = self.predictions.decoder(x) + self.predictions.bias\n\n        if self.do_voken_cls:\n            visn_scores = self.visn_decoder(x)\n        elif self.do_voken_ctr:\n            voken_feats = kwargs['voken_feats']\n            y = self.visn_decoder(voken_feats)  # voken_size, dim\n            visn_scores = torch.einsum('bik,jk->bij', x, y)\n        else:\n            assert False\n\n        return lang_scores, visn_scores\n\n\nclass BertVLMClassificationHead(nn.Module):\n    \"\"\"Bert Head for masked language modeling.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n        self.decoder = nn.Linear(config.hidden_size, config.voken_size, bias=True)\n        # self.decoder = nn.Sequential(\n        #     nn.Linear(config.hidden_size, 256, bias=True),\n        #     nn.Linear(256, config.voken_size, bias=True),\n        # )\n        if config.verbose:\n            print(f\"VLM Classification Head: Build model with voken_size {config.voken_size}\")\n\n    def forward(self, features, **kwargs):\n        x = self.dense(features)\n        x = gelu(x)\n        x = self.layer_norm(x)\n        x = self.decoder(x)\n\n        return x\n\n\nclass BertVLMContrastiveHeadNew(nn.Module):\n    \"\"\"Bert Head for masked language modeling.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.joint_dim = 512\n        print(f\"Contrastive Head: Using joint dim {self.joint_dim}\")\n        self.voken_size = config.voken_size\n        self.dense = nn.Linear(config.hidden_size, self.joint_dim)\n        self.layer_norm_x = BertLayerNorm(self.joint_dim, eps=config.layer_norm_eps)\n\n        self.decoder_voken_feat = nn.Linear(config.voken_dim, self.joint_dim, bias=False)\n        self.layer_norm_y = BertLayerNorm(self.joint_dim, eps=config.layer_norm_eps)\n\n    def forward(self, bert_output, voken_feats, **kwargs):\n        # Process the bert output\n        x = self.dense(bert_output)\n        x = gelu(x)\n        x = self.layer_norm_x(x)\n\n        # Process the pre-trained voken feats.\n        y = self.decoder_voken_feat(voken_feats)      # [v, f] --> [v, 64]\n        y = self.layer_norm_y(y)\n\n        score = torch.einsum('ijf,vf->ijv', x, y) / math.sqrt(self.joint_dim)\n        assert score.dim() == 3 and score.shape[2] == self.voken_size\n\n        return score\n\n\nclass BertVLMContrastiveHead(nn.Module):\n    \"\"\"Bert Head for masked language modeling.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.voken_size = config.voken_size\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n        self.joint_dim = 64\n        self.decoder_bert_output = nn.Linear(config.hidden_size, self.joint_dim, bias=False)\n        self.decoder_voken_feat = nn.Linear(config.voken_dim, self.joint_dim, bias=False)\n\n    def forward(self, bert_output, voken_feats, **kwargs):\n        # Process the bert output\n        x = self.dense(bert_output)\n        x = gelu(x)\n        x = self.layer_norm(x)\n        x = self.decoder_bert_output(x)                   # [b, l, f] --> [b, l, 64]\n\n        # Process the pre-trained voken feats.\n        y = self.decoder_voken_feat(voken_feats)      # [v, f] --> [v, 64]\n\n        score = torch.einsum('ijf,vf->ijv', x, y) / math.sqrt(self.joint_dim)\n        assert score.dim() == 3 and score.shape[2] == self.voken_size\n\n        return score\n\n\nclass BertVLMRegressionHead(nn.Module):\n    \"\"\"Bert Head for masked language modeling.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n        self.decoder = nn.Linear(config.hidden_size, config.voken_dim, bias=True)\n\n    def forward(self, features, **kwargs):\n        x = self.dense(features)\n        x = gelu(x)\n        x = self.layer_norm(x)\n\n        # project back to size of vocabulary with bias\n        x = self.decoder(x)\n\n        return x\n\n\nclass CoLwithBert(BertForMaskedLM):\n    config_class = CoLBertConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.do_voken_cls = config.do_voken_cls\n        self.do_voken_reg = config.do_voken_reg\n        self.do_voken_ctr = config.do_voken_ctr\n        self.shared_head = config.shared_head\n        self.verbose = config.verbose\n\n        if self.verbose:\n            print(f\"Model: do voken cls -- {self.do_voken_cls}, do_voken_reg -- {self.do_voken_reg},\"\n                  f\" do voken ctr -- {self.do_voken_ctr}\")\n\n        self.token_cls_loss_fct = CrossEntropyLoss()\n\n        if self.shared_head:\n            if self.verbose:\n                print(\"Model: Using shared head for Voken and Token predictions.\")\n            self.cls = BertSharedHead(config)\n            # Reinit the weight of the new head.\n            self.init_weights()\n        else:\n            # Voken Classification\n            if config.do_voken_cls:\n                self.visual_cls_head = BertVLMClassificationHead(config)\n\n            # Voken Regression\n            if config.do_voken_reg:\n                assert config.voken_dim is not None, \"you need to set voken dim in the config.\"\n                self.visual_reg_head = BertVLMRegressionHead(config)\n\n            # Voken Constrastive\n            if config.do_voken_ctr:\n                assert config.voken_dim is not None, \"you need to set voken dim in the config.\"\n                self.visual_ctr_head = BertVLMContrastiveHeadNew(config)\n\n        # Build voken features embeddings if needed.\n        if self.do_voken_ctr or self.do_voken_reg:\n            # The voken emb will be preloaded by func \"init_voken_feat_emb\"\n            self.voken_feat_emb = nn.Embedding(\n                config.voken_size,\n                config.voken_dim\n            )\n            # Freeze this embedding\n            for p in self.voken_feat_emb.parameters():\n                p.requires_grad = False\n\n        # Build Loss functions\n        if config.do_voken_cls:\n            # Voken Classification\n            self.voken_cls_loss_fct = CrossEntropyLoss()\n        if config.do_voken_reg:\n            # Voken Regression\n            self.voken_reg_loss_fct = SmoothL1Loss(reduction='none')\n            # self.voken_reg_loss_fct = torch.nn.L1Loss(reduction='none')\n        if config.do_voken_ctr:\n            # Voken Constrastive\n            self.voken_ctr_loss_fct = CrossEntropyLoss()\n\n    def init_voken_feat_emb(self, feats):\n        if self.verbose:\n            print(f\"Model: load the voken features with shape {feats.shape}\")\n            print(\"\\tBefore Loading, std and mean are: \", self.voken_feat_emb.weight.std(), self.voken_feat_emb.weight.mean())\n        assert feats.shape == (self.config.voken_size, self.config.voken_dim)\n        self.voken_feat_emb.weight.data[:] = torch.Tensor(feats)\n        self.original_voken_feats = torch.Tensor(feats).clone()\n        self.original_voken_feats = self.original_voken_feats.half()\n        if self.verbose:\n            print(\"\\tAfter Loading, std and mean are: \", self.voken_feat_emb.weight.std(), self.voken_feat_emb.weight.mean())\n            print(\"\\tThe 1st, 2nd, and last voken feats are: \")\n            print(\"\\t\", self.voken_feat_emb.weight[0])\n            print(\"\\t\", self.voken_feat_emb.weight[1])\n            print(\"\\t\", self.voken_feat_emb.weight[-1])\n        assert not self.voken_feat_emb.weight.requires_grad\n        # print(self.voken_feat_emb.weight.dtype)\n        # assert torch.all(torch.eq(self.voken_feat_emb.weight.cuda(),\n        #                           self.original_voken_feats)), \"The voken feats have been updated during training.\"\n\n    def to(self, *args):\n        if self.do_voken_ctr or self.do_voken_reg:\n            self.original_voken_feats = self.original_voken_feats.to(*args)\n        return super().to(*args)\n\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            position_ids=None,\n            head_mask=None,\n            inputs_embeds=None,\n            masked_lm_labels=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n            lm_labels=None,\n            voken_labels=None,\n    ):\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n        )\n        sequence_output = outputs[0]\n\n        if not self.shared_head:\n            voken_loss = 0.\n            if self.do_voken_cls:\n                assert voken_labels is not None\n                voken_scores = self.visual_cls_head(sequence_output)\n                voken_cls_loss = self.voken_cls_loss_fct(voken_scores.view(-1, self.config.voken_size), voken_labels.view(-1))\n                voken_loss += voken_cls_loss\n\n            if self.do_voken_reg:\n                assert voken_labels is not None\n                voken_prediction = self.visual_reg_head(sequence_output)\n\n                # Get the mask and pre-trained features\n                voken_label_mask = (voken_labels != -100)               # Get a mask of [0, 1, 1, ...., 1, 0], [b, len]\n                safe_voken_labels = voken_labels.clone()\n                safe_voken_labels[~voken_label_mask] = 0\n                voken_feats = self.voken_feat_emb(safe_voken_labels)         # [b, len] --> [b, len, f]\n\n                # Loss\n                voken_reg_loss = self.voken_reg_loss_fct(voken_prediction, voken_feats)   # [b, len, f]\n\n                # [b, l, f] * ([b,l] --> [b, l, 1]) = [b, l, f]\n                voken_reg_loss = (voken_reg_loss * voken_label_mask.float().unsqueeze(-1))\n\n                # [b, l, f] --sum-> [b, l] --mean-> [1,]\n                voken_reg_loss = voken_reg_loss.sum(-1).mean()\n\n                voken_loss += voken_reg_loss\n\n            if self.do_voken_ctr:\n                assert torch.all(torch.eq(self.voken_feat_emb.weight,\n                                          self.original_voken_feats)), \"The voken feats have been updated during training.\"\n\n                voken_scores = self.visual_ctr_head(\n                    sequence_output, self.voken_feat_emb.weight\n                )\n                voken_ctr_loss = self.voken_ctr_loss_fct(\n                    voken_scores.view(-1, self.config.voken_size),\n                    voken_labels.view(-1)\n                )\n                voken_loss += voken_ctr_loss\n\n            if masked_lm_labels is not None:\n                prediction_scores = self.cls(sequence_output)\n                token_loss = self.token_cls_loss_fct(\n                    prediction_scores.view(-1, self.config.vocab_size),\n                    masked_lm_labels.view(-1))\n            else:\n                token_loss = torch.tensor(0.)\n        else:\n            voken_loss, token_loss = self.calculate_shared_loss(\n                sequence_output,\n                masked_lm_labels,\n                voken_labels,\n            )\n\n        return voken_loss, token_loss\n\n    def calculate_shared_loss(self, sequence_output, masked_lm_labels, voken_labels):\n        if self.do_voken_cls:\n            lang_scores, visn_scores = self.cls(sequence_output)\n        else:\n            lang_scores, visn_scores = self.cls(\n                sequence_output,\n                voken_feats=self.voken_feat_emb.weight\n            )\n\n        assert voken_labels is not None\n\n        voken_loss_func = self.voken_cls_loss_fct if self.do_voken_cls else self.voken_ctr_loss_fct\n        voken_loss = voken_loss_func(\n            visn_scores.view(-1, self.config.voken_size),\n            voken_labels.view(-1)\n        )\n\n        if masked_lm_labels is not None:\n            token_loss = self.token_cls_loss_fct(\n                lang_scores.view(-1, self.config.vocab_size),\n                masked_lm_labels.view(-1)\n            )\n        else:\n            token_loss = torch.tensor(0.)\n\n        return voken_loss, token_loss\n\n\nclass SimpleBertForMaskedLM(BertForMaskedLM):\n\n    def __init__(self, config):\n        super().__init__(config)\n\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            position_ids=None,\n            head_mask=None,\n            inputs_embeds=None,\n            masked_lm_labels=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n            lm_labels=None,\n    ):\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n        )\n        sequence_output = outputs[0]\n\n        prediction_scores = self.cls(sequence_output)\n        loss_fct = CrossEntropyLoss()\n        token_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n\n        return token_loss,\n"
  },
  {
    "path": "vlm/param.py",
    "content": "import argparse\n\n\ndef process_args():\n    parser = argparse.ArgumentParser()\n\n    # Datasets\n    parser.add_argument(\n        \"--train_data_file\", default=None, type=str,\n        help=\"The input training data file (a text file).\")\n    parser.add_argument(\n        \"--eval_data_file\", default=None, type=str,\n        help=\"An optional input evaluation data file to evaluate the perplexity on (a text file).\")\n    parser.add_argument(\"--do_train\", action=\"store_true\", help=\"Whether to run training.\")\n    parser.add_argument(\"--do_eval\", action=\"store_true\", help=\"Whether to run eval on the dev set.\")\n\n    # Data loader\n    parser.add_argument(\"--col_data\", action=\"store_true\", help=\"Using the specific dataset object in data.py\")\n    parser.add_argument(\"--split_sent\", action=\"store_true\", help=\"Overwrite the cached training and evaluation sets\")\n    parser.add_argument(\"--shuffle\", action=\"store_true\", help=\"Shuffle the training dataset\")\n    parser.add_argument(\n        \"--block_size\", default=-1, type=int,\n        help=\"Optional input sequence length after tokenization.\"\n             \"The training dataset will be truncated in block of this size for training.\"\n             \"Default to the model max input length for single sentence inputs (take into account special tokens).\",\n    )\n\n    # Logging and Saving\n    parser.add_argument(\"--logging_steps\", type=int, default=500, help=\"Log every X updates steps.\")\n    parser.add_argument(\n        \"--output_dir\", type=str,\n        help=\"The output directory where the model predictions and checkpoints will be written.\",)\n    parser.add_argument(\n        \"--overwrite_output_dir\", action=\"store_true\",\n        help=\"Overwrite the content of the output directory\")\n\n    # Model types\n    parser.add_argument(\n        \"--model_type\", type=str, help=\"The model architecture to be trained or fine-tuned.\",)\n    parser.add_argument(\n        \"--should_continue\", action=\"store_true\", help=\"Whether to continue from latest checkpoint in output_dir\")\n    parser.add_argument(\n        \"--model_name_or_path\", default=None, type=str,\n        help=\"The model checkpoint for weights initialization. Leave None if you want to train a model from scratch.\",)\n    parser.add_argument(\n        \"--config_name\", default=None, type=str,\n        help=\"Optional pretrained config name or path if not the same as model_name_or_path. If both are None, initialize a new config.\",)\n    parser.add_argument(\n        \"--tokenizer_name\", default=None, type=str,\n        help=\"Optional pretrained tokenizer name or path if not the same as model_name_or_path. If both are None, initialize a new tokenizer.\",)\n    parser.add_argument(\n        \"--cache_dir\", default=None, type=str,\n        help=\"Optional directory to store the pre-trained models downloaded from s3 (instead of the default one)\",)\n    parser.add_argument(\n        \"--overwrite_cache\", action=\"store_true\",\n        help=\"Overwrite the cached training and evaluation sets\")\n\n    # MLM tasks\n    parser.add_argument(\n        \"--mlm\", action=\"store_true\", help=\"Train with masked-language modeling loss instead of language modeling.\")\n    parser.add_argument(\n        \"--mlm_probability\", type=float, default=0.15, help=\"Ratio of tokens to mask for masked language modeling loss\")\n    parser.add_argument(\n        \"--mlm_ratio\", type=float, default=1., help=\"The ratio of mlm loss in the total loss.\")\n\n    # VLM related params\n    parser.add_argument(\"--voken_dir\", type=str, default='snap1/coco_hinge05_dim64_resxt101_robertal4/vokens',\n                        help='Where the vokens are saved')\n    parser.add_argument(\"--voken_suffix\", type=str, default='vg_nococo.10000',\n                        help='The suffix after the voken file, e.g., en.train.raw.{suffix} where suffix==vgcoco.1000')\n    parser.add_argument(\"--voken_labels\", type=str, default='all',\n                        help='all: Calculate voken loss for all tokens;'\n                             'mask: Calculate voken loss for masked tokens.'\n                             'nonmask: Calculate voken loss for non-masked tokens.')\n    parser.add_argument(\"--voken_feat_dir\", type=str, default=None,\n                        help='Where the vokens are saved')\n    parser.add_argument(\"--do_voken_cls\", action='store_true', help='Will do voken classification task')\n    parser.add_argument(\"--do_voken_reg\", action='store_true', help='Will do voken regression task (not used in this paper)')\n    parser.add_argument(\"--do_voken_ctr\", action='store_true', help='Will do voken contrastive task (not used in this paper)')\n    parser.add_argument(\"--shared_head\", action='store_true', help='Share the head if more than one tasks (e.g., cls, reg, ctr) are used (not used in this paper)')\n\n    # Batch Size and Training Steps\n    parser.add_argument(\"--seed\", type=int, default=95, help=\"random seed for initialization\")\n    parser.add_argument(\"--per_gpu_train_batch_size\", default=4, type=int, help=\"Batch size per GPU/CPU for training.\")\n    parser.add_argument(\"--per_gpu_eval_batch_size\", default=4, type=int, help=\"Batch size per GPU/CPU for evaluation.\")\n    parser.add_argument(\"--gradient_accumulation_steps\", type=int, default=1,\n        help=\"Number of updates steps to accumulate before performing a backward/update pass.\",)\n    parser.add_argument(\"--num_train_epochs\", default=1.0, type=float, help=\"Total number of training epochs to perform.\")\n    parser.add_argument(\"--max_steps\", default=-1, type=int,\n        help=\"If > 0: set total number of training steps to perform. Override num_train_epochs.\",)\n\n    # Optimizer\n    parser.add_argument(\"--lamb\", action=\"store_true\", help='Use the LAMB optimizer in apex')\n    parser.add_argument(\"--warmup_steps\", default=0, type=int, help=\"Linear warmup over warmup_steps.\")\n    parser.add_argument(\"--warmup_ratio\", default=0., type=float, help=\"Linear warmup over warmup_steps.\")\n    parser.add_argument(\"--learning_rate\", default=5e-5, type=float, help=\"The initial learning rate for Adam.\")\n    parser.add_argument(\"--weight_decay\", default=0.01, type=float, help=\"Weight decay if we apply some.\")\n    parser.add_argument(\"--adam_epsilon\", default=1e-6, type=float, help=\"Epsilon for Adam optimizer.\")\n    parser.add_argument(\"--max_grad_norm\", default=1.0, type=float, help=\"Max gradient norm.\")\n\n    # Distributed Training\n    parser.add_argument(\"--local_rank\", type=int, default=-1, help=\"For distributed training: local_rank\")\n    parser.add_argument(\"--nodes\", type=int, default=1)\n    parser.add_argument(\"--nr\", type=int, default=0)\n\n    # Half Precision\n    parser.add_argument(\n        \"--fp16\", action=\"store_true\",\n        help=\"Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit\",)\n    parser.add_argument(\n        \"--fp16_opt_level\", type=str, default=\"O1\",\n        help=\"For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3'].\"\n             \"See details at https://nvidia.github.io/apex/amp.html\",)\n\n    # Ablation Study\n    parser.add_argument(\"--voken_ablation\", default=None,\n                        help=\"random, shuffle, reverse, token\")\n\n\n    args = parser.parse_args()\n    return args\n"
  },
  {
    "path": "vlm/run_glue.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Finetuning the library models for sequence classification on GLUE (Bert, XLM, XLNet, RoBERTa, Albert, XLM-RoBERTa).\"\"\"\n\n\nimport argparse\nimport glob\nimport json\nimport logging\nimport os\nimport random\n\nimport numpy as np\nimport torch\nfrom torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset\nfrom torch.utils.data.distributed import DistributedSampler\nfrom tqdm import tqdm, trange\n\nfrom transformers import (\n    MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,\n    WEIGHTS_NAME,\n    AdamW,\n    AutoConfig,\n    AutoModelForSequenceClassification,\n    AutoTokenizer,\n    get_linear_schedule_with_warmup,\n    glue_compute_metrics as compute_metrics,\n    glue_convert_examples_to_features as convert_examples_to_features,\n    glue_output_modes as output_modes,\n    glue_processors as processors,\n)\n# from transformers import glue_compute_metrics as compute_metrics\n# from transformers import glue_convert_examples_to_features as convert_examples_to_features\n# from transformers import glue_output_modes as output_modes\n# from transformers import glue_processors as processors\n\n\ntry:\n    from torch.utils.tensorboard import SummaryWriter\nexcept ImportError:\n    from tensorboardX import SummaryWriter\n\n\nlogger = logging.getLogger(__name__)\n\n#MODEL_CONFIG_CLASSES = list(MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys())\n#MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)\n\n#ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in MODEL_CONFIG_CLASSES), (),)\n\n\ndef set_seed(args):\n    random.seed(args.seed)\n    np.random.seed(args.seed)\n    torch.manual_seed(args.seed)\n    if args.n_gpu > 0:\n        torch.cuda.manual_seed_all(args.seed)\n\n\ndef train(args, train_dataset, model, tokenizer):\n    \"\"\" Train the model \"\"\"\n    # if args.local_rank in [-1, 0]:\n    #    tb_writer = SummaryWriter()\n\n    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)\n    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)\n    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)\n\n    if args.max_steps > 0:\n        t_total = args.max_steps\n        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1\n    else:\n        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs\n\n    # Prepare optimizer and schedule (linear warmup and decay)\n    no_decay = [\"bias\", \"LayerNorm.weight\"]\n    optimizer_grouped_parameters = [\n        {\n            \"params\": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],\n            \"weight_decay\": args.weight_decay,\n        },\n        {\"params\": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], \"weight_decay\": 0.0},\n    ]\n\n    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)\n    num_warmup_steps = int(t_total * args.warmup_steps)\n    scheduler = get_linear_schedule_with_warmup(\n        optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=t_total\n    )\n\n    # Check if saved optimizer or scheduler states exist\n    #if os.path.isfile(os.path.join(args.model_name_or_path, \"optimizer.pt\")) and os.path.isfile(\n        #os.path.join(args.model_name_or_path, \"scheduler.pt\")\n    #):\n        ## Load in optimizer and scheduler states\n        #optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, \"optimizer.pt\")))\n        #scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, \"scheduler.pt\")))\n\n    if args.fp16:\n        try:\n            from apex import amp\n        except ImportError:\n            raise ImportError(\"Please install apex from https://www.github.com/nvidia/apex to use fp16 training.\")\n        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)\n\n    # multi-gpu training (should be after apex fp16 initialization)\n    if args.n_gpu > 1:\n        model = torch.nn.DataParallel(model)\n\n    # Distributed training (should be after apex fp16 initialization)\n    if args.local_rank != -1:\n        model = torch.nn.parallel.DistributedDataParallel(\n            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True,\n        )\n\n    # Train!\n    logger.info(\"***** Running training *****\")\n    logger.info(\"  Num examples = %d\", len(train_dataset))\n    logger.info(\"  Num Epochs = %d\", args.num_train_epochs)\n    logger.info(\"  Instantaneous batch size per GPU = %d\", args.per_gpu_train_batch_size)\n    logger.info(\n        \"  Total train batch size (w. parallel, distributed & accumulation) = %d\",\n        args.train_batch_size\n        * args.gradient_accumulation_steps\n        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),\n    )\n    logger.info(\"  Gradient Accumulation steps = %d\", args.gradient_accumulation_steps)\n    logger.info(\"  Total optimization steps = %d\", t_total)\n\n    global_step = 0\n    epochs_trained = 0\n    steps_trained_in_current_epoch = 0\n    # Check if continuing training from a checkpoint\n    #if os.path.exists(args.model_name_or_path):\n        # set global_step to global_step of last saved checkpoint from model path\n        #try:\n            #global_step = int(args.model_name_or_path.split(\"-\")[-1].split(\"/\")[0])\n        #except ValueError:\n            #global_step = 0\n        #epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)\n        #steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)\n\n        #logger.info(\"  Continuing training from checkpoint, will skip to saved global_step\")\n        #logger.info(\"  Continuing training from epoch %d\", epochs_trained)\n        #logger.info(\"  Continuing training from global step %d\", global_step)\n        #logger.info(\"  Will skip the first %d steps in the first epoch\", steps_trained_in_current_epoch)\n\n    tr_loss, logging_loss = 0.0, 0.0\n    model.zero_grad()\n    train_iterator = trange(\n        epochs_trained, int(args.num_train_epochs), desc=\"Epoch\", disable=args.local_rank not in [-1, 0],\n    )\n    set_seed(args)  # Added here for reproductibility\n    for _ in train_iterator:\n        epoch_iterator = tqdm(train_dataloader, desc=\"Iteration\", disable=args.local_rank not in [-1, 0])\n        for step, batch in enumerate(epoch_iterator):\n\n            # Skip past any already trained steps if resuming training\n            if steps_trained_in_current_epoch > 0:\n                steps_trained_in_current_epoch -= 1\n                continue\n\n            model.train()\n            batch = tuple(t.to(args.device) for t in batch)\n            inputs = {\"input_ids\": batch[0], \"attention_mask\": batch[1], \"labels\": batch[3]}\n            if args.model_type != \"distilbert\":\n                inputs[\"token_type_ids\"] = (\n                    batch[2] if args.model_type in [\"bert\", \"xlnet\", \"albert\"] else None\n                )  # XLM, DistilBERT, RoBERTa, and XLM-RoBERTa don't use segment_ids\n            outputs = model(**inputs)\n            loss = outputs[0]  # model outputs are always tuple in transformers (see doc)\n\n            if args.n_gpu > 1:\n                loss = loss.mean()  # mean() to average on multi-gpu parallel training\n            if args.gradient_accumulation_steps > 1:\n                loss = loss / args.gradient_accumulation_steps\n\n            if args.fp16:\n                with amp.scale_loss(loss, optimizer) as scaled_loss:\n                    scaled_loss.backward()\n            else:\n                loss.backward()\n\n            tr_loss += loss.item()\n            if (step + 1) % args.gradient_accumulation_steps == 0 or (\n                # last step in epoch but step is always smaller than gradient_accumulation_steps\n                len(epoch_iterator) <= args.gradient_accumulation_steps\n                and (step + 1) == len(epoch_iterator)\n            ):\n                if args.fp16:\n                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)\n                else:\n                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)\n\n                optimizer.step()\n                scheduler.step()  # Update learning rate schedule\n                model.zero_grad()\n                global_step += 1\n\n                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:\n                    logs = {}\n                    if (\n                        args.local_rank == -1 and args.evaluate_during_training\n                    ):  # Only evaluate when single GPU otherwise metrics may not average well\n                        results = evaluate(args, model, tokenizer)\n                        for key, value in results.items():\n                            eval_key = \"eval_{}\".format(key)\n                            logs[eval_key] = value\n\n                    loss_scalar = (tr_loss - logging_loss) / args.logging_steps\n                    learning_rate_scalar = scheduler.get_lr()[0]\n                    logs[\"learning_rate\"] = learning_rate_scalar\n                    logs[\"loss\"] = loss_scalar\n                    logging_loss = tr_loss\n\n                    #for key, value in logs.items():\n                        #tb_writer.add_scalar(key, value, global_step)\n                    print(json.dumps({**logs, **{\"step\": global_step}}))\n\n                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:\n                    # Save model checkpoint\n                    output_dir = os.path.join(args.output_dir, \"checkpoint-{}\".format(global_step))\n                    if not os.path.exists(output_dir):\n                        os.makedirs(output_dir)\n                    model_to_save = (\n                        model.module if hasattr(model, \"module\") else model\n                    )  # Take care of distributed/parallel training\n                    model_to_save.save_pretrained(output_dir)\n                    tokenizer.save_pretrained(output_dir)\n\n                    torch.save(args, os.path.join(output_dir, \"training_args.bin\"))\n                    logger.info(\"Saving model checkpoint to %s\", output_dir)\n\n                    torch.save(optimizer.state_dict(), os.path.join(output_dir, \"optimizer.pt\"))\n                    torch.save(scheduler.state_dict(), os.path.join(output_dir, \"scheduler.pt\"))\n                    logger.info(\"Saving optimizer and scheduler states to %s\", output_dir)\n\n            if args.max_steps > 0 and global_step > args.max_steps:\n                epoch_iterator.close()\n                break\n        if args.max_steps > 0 and global_step > args.max_steps:\n            train_iterator.close()\n            break\n\n    #if args.local_rank in [-1, 0]:\n        #tb_writer.close()\n\n    return global_step, tr_loss / global_step\n\n\ndef evaluate(args, model, tokenizer, prefix=\"\"):\n    # Loop to handle MNLI double evaluation (matched, mis-matched)\n    eval_task_names = (\"mnli\", \"mnli-mm\") if args.task_name == \"mnli\" else (args.task_name,)\n    eval_outputs_dirs = (args.output_dir, args.output_dir + \"-MM\") if args.task_name == \"mnli\" else (args.output_dir,)\n\n    results = {}\n    for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):\n        eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)\n\n        if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:\n            os.makedirs(eval_output_dir)\n\n        args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)\n        # Note that DistributedSampler samples randomly\n        eval_sampler = SequentialSampler(eval_dataset)\n        eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)\n\n        # multi-gpu eval\n        if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):\n            model = torch.nn.DataParallel(model)\n\n        # Eval!\n        logger.info(\"***** Running evaluation {} *****\".format(prefix))\n        logger.info(\"  Num examples = %d\", len(eval_dataset))\n        logger.info(\"  Batch size = %d\", args.eval_batch_size)\n        eval_loss = 0.0\n        nb_eval_steps = 0\n        preds = None\n        out_label_ids = None\n        for batch in tqdm(eval_dataloader, desc=\"Evaluating\"):\n            model.eval()\n            batch = tuple(t.to(args.device) for t in batch)\n\n            with torch.no_grad():\n                inputs = {\"input_ids\": batch[0], \"attention_mask\": batch[1], \"labels\": batch[3]}\n                if args.model_type != \"distilbert\":\n                    inputs[\"token_type_ids\"] = (\n                        batch[2] if args.model_type in [\"bert\", \"xlnet\", \"albert\"] else None\n                    )  # XLM, DistilBERT, RoBERTa, and XLM-RoBERTa don't use segment_ids\n                outputs = model(**inputs)\n                tmp_eval_loss, logits = outputs[:2]\n\n                eval_loss += tmp_eval_loss.mean().item()\n            nb_eval_steps += 1\n            if preds is None:\n                preds = logits.detach().cpu().numpy()\n                out_label_ids = inputs[\"labels\"].detach().cpu().numpy()\n            else:\n                preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)\n                out_label_ids = np.append(out_label_ids, inputs[\"labels\"].detach().cpu().numpy(), axis=0)\n\n        eval_loss = eval_loss / nb_eval_steps\n        if args.output_mode == \"classification\":\n            preds = np.argmax(preds, axis=1)\n        elif args.output_mode == \"regression\":\n            preds = np.squeeze(preds)\n        result = compute_metrics(eval_task, preds, out_label_ids)\n        results.update(result)\n\n        print(eval_output_dir, prefix)\n        output_eval_file = os.path.join(eval_output_dir, prefix, \"eval_results.txt\")\n        with open(output_eval_file, \"w\") as writer:\n            logger.info(\"***** Eval results {} *****\".format(prefix))\n            for key in sorted(result.keys()):\n                logger.info(\"  %s = %s\", key, str(result[key]))\n                writer.write(\"%s = %s\\n\" % (key, str(result[key])))\n\n    return results\n\n\ndef load_and_cache_examples(args, task, tokenizer, evaluate=False):\n    if args.local_rank not in [-1, 0] and not evaluate:\n        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache\n\n    processor = processors[task]()\n    output_mode = output_modes[task]\n    # Load data features from cache or dataset file\n    cached_features_file = os.path.join(\n        args.data_dir,\n        \"cached_{}_{}_{}_{}\".format(\n            \"dev\" if evaluate else \"train\",\n            #list(filter(None, args.model_name_or_path.split(\"/\"))).pop(),\n            args.tokenizer_name,\n            str(args.max_seq_length),\n            str(task),\n        ),\n    )\n    if os.path.exists(cached_features_file) and not args.overwrite_cache:\n        logger.info(\"Loading features from cached file %s\", cached_features_file)\n        features = torch.load(cached_features_file)\n    else:\n        logger.info(\"Creating features from dataset file at %s\", args.data_dir)\n        label_list = processor.get_labels()\n        if task in [\"mnli\", \"mnli-mm\"] and args.model_type in [\"roberta\", \"xlmroberta\"]:\n            # HACK(label indices are swapped in RoBERTa pretrained model)\n            label_list[1], label_list[2] = label_list[2], label_list[1]\n        examples = (\n            processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)\n        )\n        features = convert_examples_to_features(\n            examples,\n            tokenizer,\n            label_list=label_list,\n            max_length=args.max_seq_length,\n            output_mode=output_mode,\n            # pad_on_left=bool(args.model_type in [\"xlnet\"]),  # pad on the left for xlnet\n            # pad_token=tokenizer.pad_token_id,\n            # pad_token_segment_id=tokenizer.pad_token_type_id,\n        )\n        if args.local_rank in [-1, 0]:\n            logger.info(\"Saving features into cached file %s\", cached_features_file)\n            torch.save(features, cached_features_file)\n    for i in range(3):\n        print('ids:', features[i].input_ids)\n        print('tokens:', tokenizer.convert_ids_to_tokens(features[i].input_ids))\n        print('att:', features[i].attention_mask)\n\n    if args.local_rank == 0 and not evaluate:\n        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache\n\n    # Convert to Tensors and build dataset\n    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)\n    all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)\n    all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)\n    if output_mode == \"classification\":\n        all_labels = torch.tensor([f.label for f in features], dtype=torch.long)\n    elif output_mode == \"regression\":\n        all_labels = torch.tensor([f.label for f in features], dtype=torch.float)\n\n    dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)\n    return dataset\n\n\ndef main():\n    parser = argparse.ArgumentParser()\n\n    # Required parameters\n    parser.add_argument(\n        \"--data_dir\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The input data dir. Should contain the .tsv files (or other data files) for the task.\",\n    )\n    parser.add_argument(\n        \"--model_type\",\n        default=None,\n        type=str,\n        required=True,\n        #help=\"Model type selected in the list: \" + \", \".join(MODEL_TYPES),\n    )\n    parser.add_argument(\n        \"--model_name_or_path\",\n        default=None,\n        type=str,\n        required=True,\n        #help=\"Path to pre-trained model or shortcut name selected in the list: \" + \", \".join(ALL_MODELS),\n    )\n    parser.add_argument(\n        \"--task_name\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The name of the task to train selected in the list: \" + \", \".join(processors.keys()),\n    )\n    parser.add_argument(\n        \"--output_dir\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The output directory where the model predictions and checkpoints will be written.\",\n    )\n\n    # Other parameters\n    parser.add_argument(\n        \"--config_name\", default=\"\", type=str, help=\"Pretrained config name or path if not the same as model_name\",\n    )\n    parser.add_argument(\n        \"--tokenizer_name\",\n        default=\"\",\n        type=str,\n        help=\"Pretrained tokenizer name or path if not the same as model_name\",\n    )\n    parser.add_argument(\n        \"--cache_dir\",\n        default=\"\",\n        type=str,\n        help=\"Where do you want to store the pre-trained models downloaded from s3\",\n    )\n    parser.add_argument(\n        \"--max_seq_length\",\n        default=128,\n        type=int,\n        help=\"The maximum total input sequence length after tokenization. Sequences longer \"\n        \"than this will be truncated, sequences shorter will be padded.\",\n    )\n    parser.add_argument(\"--do_train\", action=\"store_true\", help=\"Whether to run training.\")\n    parser.add_argument(\"--do_eval\", action=\"store_true\", help=\"Whether to run eval on the dev set.\")\n    parser.add_argument(\n        \"--evaluate_during_training\", action=\"store_true\", help=\"Run evaluation during training at each logging step.\",\n    )\n    parser.add_argument(\n        \"--do_lower_case\", action=\"store_true\", help=\"Set this flag if you are using an uncased model.\",\n    )\n\n    parser.add_argument(\n        \"--per_gpu_train_batch_size\", default=8, type=int, help=\"Batch size per GPU/CPU for training.\",\n    )\n    parser.add_argument(\n        \"--per_gpu_eval_batch_size\", default=8, type=int, help=\"Batch size per GPU/CPU for evaluation.\",\n    )\n    parser.add_argument(\n        \"--gradient_accumulation_steps\",\n        type=int,\n        default=1,\n        help=\"Number of updates steps to accumulate before performing a backward/update pass.\",\n    )\n    parser.add_argument(\"--learning_rate\", default=5e-5, type=float, help=\"The initial learning rate for Adam.\")\n    parser.add_argument(\"--weight_decay\", default=0.0, type=float, help=\"Weight decay if we apply some.\")\n    parser.add_argument(\"--adam_epsilon\", default=1e-6, type=float, help=\"Epsilon for Adam optimizer.\")\n    parser.add_argument(\"--max_grad_norm\", default=1.0, type=float, help=\"Max gradient norm.\")\n    parser.add_argument(\n        \"--num_train_epochs\", default=3.0, type=float, help=\"Total number of training epochs to perform.\",\n    )\n    parser.add_argument(\n        \"--max_steps\",\n        default=-1,\n        type=int,\n        help=\"If > 0: set total number of training steps to perform. Override num_train_epochs.\",\n    )\n    parser.add_argument(\"--warmup_steps\", default=0, type=float, help=\"Linear warmup over warmup_steps.\")\n\n    parser.add_argument(\"--logging_steps\", type=int, default=500, help=\"Log every X updates steps.\")\n    parser.add_argument(\"--save_steps\", type=int, default=500, help=\"Save checkpoint every X updates steps.\")\n    parser.add_argument(\n        \"--eval_all_checkpoints\",\n        action=\"store_true\",\n        help=\"Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number\",\n    )\n    parser.add_argument(\"--no_cuda\", action=\"store_true\", help=\"Avoid using CUDA when available\")\n    parser.add_argument(\"--from_scratch\", action=\"store_true\", help=\"Avoid using CUDA when available\")\n    parser.add_argument(\n        \"--overwrite_output_dir\", action=\"store_true\", help=\"Overwrite the content of the output directory\",\n    )\n    parser.add_argument(\n        \"--overwrite_cache\", action=\"store_true\", help=\"Overwrite the cached training and evaluation sets\",\n    )\n    parser.add_argument(\n        \"--nopooler\", action=\"store_true\", help=\"Do not load the pooler\",\n    )\n    parser.add_argument(\"--seed\", type=int, default=9595, help=\"random seed for initialization\")\n\n    parser.add_argument(\n        \"--fp16\",\n        action=\"store_true\",\n        help=\"Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit\",\n    )\n    parser.add_argument(\n        \"--fp16_opt_level\",\n        type=str,\n        default=\"O1\",\n        help=\"For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3'].\"\n        \"See details at https://nvidia.github.io/apex/amp.html\",\n    )\n    parser.add_argument(\"--local_rank\", type=int, default=-1, help=\"For distributed training: local_rank\")\n    parser.add_argument(\"--server_ip\", type=str, default=\"\", help=\"For distant debugging.\")\n    parser.add_argument(\"--server_port\", type=str, default=\"\", help=\"For distant debugging.\")\n    args = parser.parse_args()\n\n    if (\n        os.path.exists(args.output_dir)\n        and os.listdir(args.output_dir)\n        and args.do_train\n        and not args.overwrite_output_dir\n    ):\n        raise ValueError(\n            \"Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.\".format(\n                args.output_dir\n            )\n        )\n\n    # Setup distant debugging if needed\n    if args.server_ip and args.server_port:\n        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script\n        import ptvsd\n\n        print(\"Waiting for debugger attach\")\n        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)\n        ptvsd.wait_for_attach()\n\n    # Setup CUDA, GPU & distributed training\n    if args.local_rank == -1 or args.no_cuda:\n        device = torch.device(\"cuda\" if torch.cuda.is_available() and not args.no_cuda else \"cpu\")\n        args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count()\n    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs\n        torch.cuda.set_device(args.local_rank)\n        device = torch.device(\"cuda\", args.local_rank)\n        torch.distributed.init_process_group(backend=\"nccl\")\n        args.n_gpu = 1\n    args.device = device\n\n    # Setup logging\n    logging.basicConfig(\n        format=\"%(asctime)s - %(levelname)s - %(name)s -   %(message)s\",\n        datefmt=\"%m/%d/%Y %H:%M:%S\",\n        level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,\n    )\n    logger.warning(\n        \"Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s\",\n        args.local_rank,\n        device,\n        args.n_gpu,\n        bool(args.local_rank != -1),\n        args.fp16,\n    )\n\n    # Set seed\n    set_seed(args)\n\n    # Prepare GLUE task\n    args.task_name = args.task_name.lower()\n    if args.task_name not in processors:\n        raise ValueError(\"Task not found: %s\" % (args.task_name))\n    processor = processors[args.task_name]()\n    args.output_mode = output_modes[args.task_name]\n    label_list = processor.get_labels()\n    num_labels = len(label_list)\n\n    # Load pretrained model and tokenizer\n    if args.local_rank not in [-1, 0]:\n        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab\n\n    args.model_type = args.model_type.lower()\n    config = AutoConfig.from_pretrained(\n        args.config_name if args.config_name else args.model_name_or_path,\n        num_labels=num_labels,\n        finetuning_task=args.task_name,\n        cache_dir=args.cache_dir if args.cache_dir else None,\n    )\n    tokenizer = AutoTokenizer.from_pretrained(\n        args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,\n        do_lower_case=args.do_lower_case,\n        cache_dir=args.cache_dir if args.cache_dir else None,\n    )\n    model = AutoModelForSequenceClassification.from_pretrained(\n        args.model_name_or_path,\n        from_tf=bool(\".ckpt\" in args.model_name_or_path),\n        config=config,\n        cache_dir=args.cache_dir if args.cache_dir else None,\n    )\n\n    if args.nopooler:\n        model.bert.pooler.apply(model._init_weights)\n\n    if args.local_rank == 0:\n        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab\n\n    model.to(args.device)\n\n    logger.info(\"Training/evaluation parameters %s\", args)\n\n    # Training\n    if args.do_train:\n        train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)\n        global_step, tr_loss = train(args, train_dataset, model, tokenizer)\n        logger.info(\" global_step = %s, average loss = %s\", global_step, tr_loss)\n\n    # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()\n    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):\n        # Create output directory if needed\n        if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:\n            os.makedirs(args.output_dir)\n\n        logger.info(\"Saving model checkpoint to %s\", args.output_dir)\n        # Save a trained model, configuration and tokenizer using `save_pretrained()`.\n        # They can then be reloaded using `from_pretrained()`\n        model_to_save = (\n            model.module if hasattr(model, \"module\") else model\n        )  # Take care of distributed/parallel training\n        model_to_save.save_pretrained(args.output_dir)\n        tokenizer.save_pretrained(args.output_dir)\n\n        # Good practice: save your training arguments together with the trained model\n        torch.save(args, os.path.join(args.output_dir, \"training_args.bin\"))\n\n        # Load a trained model and vocabulary that you have fine-tuned\n        model = AutoModelForSequenceClassification.from_pretrained(args.output_dir)\n        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)\n        model.to(args.device)\n\n    # Evaluation\n    results = {}\n    if args.do_eval and args.local_rank in [-1, 0]:\n        tokenizer = AutoTokenizer.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)\n        checkpoints = [args.output_dir]\n        if args.eval_all_checkpoints:\n            checkpoints = list(\n                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + \"/**/\" + WEIGHTS_NAME, recursive=True))\n            )\n            logging.getLogger(\"transformers.modeling_utils\").setLevel(logging.WARN)  # Reduce logging\n        logger.info(\"Evaluate the following checkpoints: %s\", checkpoints)\n        for checkpoint in checkpoints:\n            global_step = checkpoint.split(\"-\")[-1] if len(checkpoints) > 1 else \"\"\n            prefix = checkpoint.split(\"/\")[-1] if checkpoint.find(\"checkpoint\") != -1 else \"\"\n            prefix = prefix if 'checkpoint' in prefix else ''\n\n            model = AutoModelForSequenceClassification.from_pretrained(checkpoint)\n            model.to(args.device)\n            result = evaluate(args, model, tokenizer, prefix=prefix)\n            result = dict((k + \"_{}\".format(global_step), v) for k, v in result.items())\n            results.update(result)\n\n    return results\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "vlm/run_glue_epochs.py",
    "content": "import argparse\nimport math\nimport os\nfrom pathlib import Path\nfrom pprint import pprint\nimport subprocess\nimport threading\nimport time\n\nimport torch\n\nparser = argparse.ArgumentParser()\nparser.add_argument(\n    \"--load\", default=None, type=str,\n    help=\"The model loaded, e.g., snap/vlm/wiki103_small\"\n)\nparser.add_argument(\n    \"--gpus\", default=None, type=str,\n    help=\"The list of GPU ids, separated by comma, e.g., '2,3'\"\n)\nparser.add_argument(\n    \"--snaps\", default=1, type=int,\n    help=\"The number of snaps evaluated with GLUE benchmark. \"\n         \"-1 means all.\"\n)\nparser.add_argument(\n    \"--start-from\", default=0, type=int\n)\nargs = parser.parse_args()\n\nif args.gpus is None:\n    # Get all gpus available in this server.\n    num_gpus = torch.cuda.device_count()\n    # The device id are labeled from 0 to num_gpus-1.\n    available_gpus = list(range(num_gpus))\nelse:\n    available_gpus = [int(gpu_id) for gpu_id in args.gpus.split(\",\")]\n    num_gpus = len(available_gpus)\n\nresource = threading.Semaphore(num_gpus)\n\n\ndef get_snap_paths(load):\n    load_path = Path(load)\n    paths = []\n    for dir_path in load_path.iterdir():\n        if dir_path.name.startswith(\"checkpoint-\"):\n            paths.append(dir_path)\n    return paths\n\n\ndef sorted_paths(paths):\n    pathXkey = []\n    for path in paths:\n        name = path.name\n        identifier = name[len(\"checkpoint-\"):]\n        if identifier == 'last':\n            continue\n        if 'epoch' in identifier:\n            key = identifier\n        else:\n            key = int(identifier)\n        pathXkey.append((path, key))\n    pathXkey = sorted(pathXkey, key=lambda x: x[1])\n    paths = list(map(lambda x: x[0], pathXkey))\n    return paths\n\n\ndef get_test_paths(paths, snaps):\n    \"\"\"\n    Return $snaps paths to be tested on GLUE\n    \"\"\"\n    if snaps == -1:\n        return paths\n    interval = len(paths) * 1. / snaps\n    test_paths = []\n    for i in range(1, snaps+1):\n        idx = int(math.ceil(interval * i)) - 1\n        test_paths.append(paths[idx])\n    return test_paths\n\n\n# Get all paths needs to be processed\npaths = get_snap_paths(args.load)\npaths = sorted_paths(paths)\npaths = paths[args.start_from:]\npaths = get_test_paths(paths, args.snaps)\npaths = paths[::-1]         # Run the last epochs first.\npath_lock = threading.Lock()\n\n\ndef run_glue():\n    while True:\n        # Only have one atomic operation (list.pop) here, do not need lock.\n        # A Semaphore is enough to control the resources.\n        resource.acquire()\n        gpu_id = available_gpus.pop(0)\n\n        # Involve multiple atomic operations (list.__len__, list.pop),\n        # thus introduce a lock here.\n        path_lock.acquire()\n        if len(paths) > 0:\n            path = paths.pop(0)\n        else:\n            path_lock.release()\n            break\n        path_lock.release()\n\n        model = path.parent\n        ckpt = path.name\n        print(gpu_id, model, ckpt)\n        process = subprocess.Popen(\n            ['bash',\n             'scripts/run_glue_at_epoch.bash',\n             str(gpu_id),    # Use GPU\n             '3',            # Number of epochs\n             model,\n             ckpt\n             ],\n            stdout=subprocess.PIPE,\n            stderr=subprocess.PIPE)\n        stdout, stderr = process.communicate()\n\n        available_gpus.append(gpu_id)\n        resource.release()\n\n        # Sleep here allows the script (run_glue_at_epoch.bash) to finish\n        # thus all memory in GPU will be cleared.\n        time.sleep(5)\n    return\n\n\n# Allocate #threads which equals to #GPUs\nthreads = []\nfor _ in range(num_gpus):\n    threads.append(\n        threading.Thread(target=run_glue)\n    )\nfor thread in threads:\n    thread.start()\n\n# Join to the main thread, thus the main thread will wait for all the threads.\nfor thread in threads:\n    thread.join()\n"
  },
  {
    "path": "vlm/run_lm_distributed.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"\nFine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).\nGPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned\nusing a masked language modeling (MLM) loss.\n\"\"\"\n\n\nimport argparse\nimport glob\nimport json\nimport logging\nimport os\nimport pickle\nimport random\nimport re\nimport shutil\nimport sys\nfrom typing import Dict, List, Tuple\nfrom datetime import datetime\n\nimport numpy as np\nimport torch\nfrom torch.nn.utils.rnn import pad_sequence\nimport torch.multiprocessing as mp\nfrom torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler\nfrom torch.utils.data.distributed import DistributedSampler\nfrom tqdm import tqdm, trange\nfrom transformers import (\n    WEIGHTS_NAME,\n    AdamW,\n    BertConfig,\n    BertForMaskedLM,\n    BertTokenizer,\n    CamembertConfig,\n    CamembertForMaskedLM,\n    CamembertTokenizer,\n    DistilBertConfig,\n    DistilBertForMaskedLM,\n    DistilBertTokenizer,\n    GPT2Config,\n    GPT2LMHeadModel,\n    GPT2Tokenizer,\n    OpenAIGPTConfig,\n    OpenAIGPTLMHeadModel,\n    OpenAIGPTTokenizer,\n    PreTrainedModel,\n    PreTrainedTokenizer,\n    RobertaConfig,\n    RobertaForMaskedLM,\n    RobertaTokenizer,\n    get_linear_schedule_with_warmup,\n)\n\nsys.path.append(\n    os.path.dirname(os.path.dirname(os.path.abspath(__file__)))\n)\nfrom vlm.data import CoLDataset\nfrom vlm.param import process_args\nfrom vlm.model import SimpleBertForMaskedLM\n\n\ntry:\n    from torch.utils.tensorboard import SummaryWriter\nexcept ImportError:\n    from tensorboardX import SummaryWriter\n\n\nlogger = logging.getLogger(__name__)\n\n\nMODEL_CLASSES = {\n    \"gpt2\": (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),\n    \"openai-gpt\": (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),\n    \"bert\": (BertConfig, SimpleBertForMaskedLM, BertTokenizer),\n    \"roberta\": (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer),\n    \"distilbert\": (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer),\n    \"camembert\": (CamembertConfig, CamembertForMaskedLM, CamembertTokenizer),\n}\n\n\nclass TextDataset(Dataset):\n    def __init__(self, tokenizer: PreTrainedTokenizer, args, file_path: str, block_size=512):\n        assert os.path.isfile(file_path)\n\n        block_size = block_size - (tokenizer.max_len - tokenizer.max_len_single_sentence)\n\n        directory, filename = os.path.split(file_path)\n        cached_features_file = os.path.join(\n            directory, args.model_type + \"_cached_lm_\" + str(block_size) + \"_\" + filename\n        )\n\n        if os.path.exists(cached_features_file) and not args.overwrite_cache:\n            logger.info(\"Loading features from cached file %s\", cached_features_file)\n            with open(cached_features_file, \"rb\") as handle:\n                self.examples = pickle.load(handle)\n        else:\n            logger.info(\"Creating features from dataset file at %s\", directory)\n\n            self.examples = []\n            with open(file_path, encoding=\"utf-8\") as f:\n                text = f.read()\n\n            tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))\n\n            for i in range(0, len(tokenized_text) - block_size + 1, block_size):  # Truncate in block of block_size\n                self.examples.append(tokenizer.build_inputs_with_special_tokens(tokenized_text[i : i + block_size]))\n            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)\n            # If your dataset is small, first you should loook for a bigger one :-) and second you\n            # can change this behavior by adding (model specific) padding.\n\n            logger.info(\"Saving features into cached file %s\", cached_features_file)\n            with open(cached_features_file, \"wb\") as handle:\n                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)\n\n    def __len__(self):\n        return len(self.examples)\n\n    def __getitem__(self, item):\n        return torch.tensor(self.examples[item], dtype=torch.long)\n\n\nclass LineByLineTextDataset(Dataset):\n    def __init__(self, tokenizer: PreTrainedTokenizer, args, file_path: str, block_size=512):\n        assert os.path.isfile(file_path)\n        # Here, we do not cache the features, operating under the assumption\n        # that we will soon use fast multithreaded tokenizers from the\n        # `tokenizers` repo everywhere =)\n        logger.info(\"Creating features from dataset file at %s\", file_path)\n\n        with open(file_path, encoding=\"utf-8\") as f:\n            lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]\n\n        self.examples = tokenizer.batch_encode_plus(lines, add_special_tokens=True, max_length=block_size)[\"input_ids\"]\n\n    def __len__(self):\n        return len(self.examples)\n\n    def __getitem__(self, i):\n        return torch.tensor(self.examples[i], dtype=torch.long)\n\n\ndef load_and_cache_examples(args, tokenizer, evaluate=False):\n    file_path = args.eval_data_file if evaluate else args.train_data_file\n    if args.col_data:\n        return CoLDataset(file_path, args.tokenizer_name, tokenizer, args.block_size,\n                          split_sent=args.split_sent,\n                          verbose=(args.gpu == 0))\n    elif args.line_by_line:\n        return LineByLineTextDataset(tokenizer, args, file_path=file_path, block_size=args.block_size)\n    else:\n        return TextDataset(tokenizer, args, file_path=file_path, block_size=args.block_size)\n\n\ndef set_seed(args):\n    random.seed(args.seed)\n    np.random.seed(args.seed)\n    torch.manual_seed(args.seed)\n\n\ndef mask_tokens(inputs: torch.Tensor, tokenizer: PreTrainedTokenizer, args) -> Tuple[torch.Tensor, torch.Tensor]:\n    \"\"\" Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. \"\"\"\n\n    if tokenizer.mask_token is None:\n        raise ValueError(\n            \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n        )\n\n    labels = inputs.clone()\n    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n    probability_matrix = torch.full(labels.shape, args.mlm_probability)\n    special_tokens_mask = [\n        tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n    ]\n    probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n    if tokenizer._pad_token is not None:\n        padding_mask = labels.eq(tokenizer.pad_token_id)\n        probability_matrix.masked_fill_(padding_mask, value=0.0)\n    masked_indices = torch.bernoulli(probability_matrix).bool()\n    labels[~masked_indices] = -100  # We only compute loss on masked tokens\n\n    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])\n    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices\n    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)\n\n    # 10% of the time, we replace masked input tokens with random word\n    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced\n    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)\n    inputs[indices_random] = random_words[indices_random]\n\n    # The rest of the time (10% of the time) we keep the masked input tokens unchanged\n    return inputs, labels\n\n\ndef train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:\n    set_seed(args)  # Added here for reproducibility\n\n    \"\"\" Train the model \"\"\"\n    if args.gpu == 0:\n        current_time = datetime.now().strftime('%b%d_%H-%M-%S')\n        tb_writer = SummaryWriter(args.output_dir + '/runs/' + current_time)\n\n    args.train_batch_size = args.per_gpu_train_batch_size\n\n    def collate(examples: List[torch.Tensor]):\n        if tokenizer._pad_token is None:\n            return pad_sequence(examples, batch_first=True)\n        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)\n\n    if args.shuffle:\n        logger.info(f\"Shuffle the dataset in training,\"\n                       f\"GPU: {args.gpu},\"\n                       f\"Rank: {args.rank},\"\n                       f\"Total: {args.world_size}\")\n    train_sampler = DistributedSampler(\n        train_dataset,\n        num_replicas=args.world_size,\n        rank=args.rank,\n        shuffle=args.shuffle,\n    )\n    train_dataloader = DataLoader(\n        train_dataset, sampler=train_sampler, shuffle=False, num_workers=0,\n        batch_size=args.train_batch_size, collate_fn=collate, pin_memory=True\n    )\n\n    if args.max_steps > 0:\n        t_total = args.max_steps\n        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1\n    else:\n        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs\n\n    # Prepare optimizer and schedule (linear warmup and decay)\n    no_decay = [\"bias\", \"LayerNorm.weight\"]\n    optimizer_grouped_parameters = [\n        {\n            \"params\": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],\n            \"weight_decay\": args.weight_decay,\n        },\n        {\"params\": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], \"weight_decay\": 0.0},\n    ]\n    optimizer = AdamW(optimizer_grouped_parameters,\n                      # betas=(0.9, 0.98),\n                      lr=args.learning_rate,\n                      eps=args.adam_epsilon)\n    if args.warmup_ratio > 0.:\n        assert args.warmup_steps == 0\n        args.warmup_steps = int(t_total * args.warmup_ratio)\n    if args.gpu == 0:\n        print(\"Optimized with lr %f, steps %d, warmup steps %d, and use beta, epsilon %0.8f.\" % (\n            args.learning_rate, t_total, args.warmup_steps, optimizer.defaults['eps']\n        ), optimizer.defaults['betas'])\n    scheduler = get_linear_schedule_with_warmup(\n        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total\n    )\n\n    # Check if saved optimizer or scheduler states exist\n    if (\n        args.model_name_or_path\n        and os.path.isfile(os.path.join(args.model_name_or_path, \"optimizer.pt\"))\n        and os.path.isfile(os.path.join(args.model_name_or_path, \"scheduler.pt\"))\n    ):\n        # Load in optimizer and scheduler states\n        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, \"optimizer.pt\")))\n        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, \"scheduler.pt\")))\n\n    if args.fp16:\n        try:\n            from apex import amp\n        except ImportError:\n            raise ImportError(\"Please install apex from https://www.github.com/nvidia/apex to use fp16 training.\")\n        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level,\n                                          verbosity=0)\n        from apex.parallel import DistributedDataParallel as DDP\n        model = DDP(model)\n    else:\n        model = torch.nn.parallel.DistributedDataParallel(\n            model, device_ids=[args.gpu], find_unused_parameters=True\n        )\n\n    # Train!\n    logger.info(\"***** Running training *****\")\n    logger.info(\"  Num examples = %d\", len(train_dataset))\n    logger.info(\"  Num Epochs = %d\", args.num_train_epochs)\n    logger.info(\"  Instantaneous batch size per GPU = %d\", args.per_gpu_train_batch_size)\n    logger.info(\n        \"  Total train batch size (w. distributed & accumulation) = %d\",\n        args.train_batch_size\n        * args.gradient_accumulation_steps\n        * args.world_size\n    )\n    logger.info(\"  Gradient Accumulation steps = %d\", args.gradient_accumulation_steps)\n    logger.info(\"  Total optimization steps = %d\", t_total)\n\n    global_step = 0\n    epochs_trained = 0\n    # Check if continuing training from a checkpoint\n    # if args.model_name_or_path and os.path.exists(args.model_name_or_path):\n    #     try:\n    #         # set global_step to gobal_step of last saved checkpoint from model path\n    #         checkpoint_suffix = args.model_name_or_path.split(\"-\")[-1].split(\"/\")[0]\n    #         epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)\n    #         steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)\n    #         logger.info(\"  Continuing training from checkpoint, will skip to saved global_step\")\n    #         logger.info(\"  Continuing training from epoch %d\", epochs_trained)\n    #     except ValueError:\n    #         logger.info(\"  Do not load model from %s, restart training\" % args.model_name_or_path)\n\n    # model_to_resize = model.module if hasattr(model, \"module\") else model  # Take care of distributed/parallel training\n    # model_to_resize.resize_token_embeddings(len(tokenizer))\n\n    model.zero_grad()\n    train_iterator = trange(\n        epochs_trained, int(args.num_train_epochs), desc=\"Epoch\", disable=args.gpu != 0\n    )\n    for epoch in train_iterator:\n        epoch_iterator = tqdm(train_dataloader, desc=\"Iteration\", disable=args.gpu != 0)\n        tr_loss, logging_loss = 0.0, 0.0\n        model.zero_grad()       # Support of accumulating gradients\n        for step, batch in enumerate(epoch_iterator):\n            inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch)\n            inputs = inputs.to(args.device)\n            labels = labels.to(args.device)\n            # If some of the input is padded, then the attention mask is needed\n            attention_mask = (inputs != tokenizer.pad_token_id)         # word_tokens --> 1, pad_token --> 0\n            if attention_mask.all():\n                attention_mask = None\n\n            if epoch == 0 and step < 3 and args.gpu == 0:\n                print(inputs.shape)\n                print(inputs[0])\n                print(tokenizer.convert_ids_to_tokens(inputs[0].cpu().numpy()))\n                print(labels[0])\n                print(attention_mask)\n\n            model.train()\n            outputs = model(inputs,\n                            attention_mask=attention_mask,\n                            masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)\n            loss = outputs[0]  # model outputs are always tuple in transformers (see doc)\n\n            if args.gradient_accumulation_steps > 1:\n                loss = loss / args.gradient_accumulation_steps\n\n            if args.fp16:\n                with amp.scale_loss(loss, optimizer) as scaled_loss:\n                    scaled_loss.backward()\n            else:\n                loss.backward()\n\n            tr_loss += loss.item()\n            if (step + 1) % args.gradient_accumulation_steps == 0:\n                if args.max_grad_norm > 0.:\n                    if args.fp16:\n                        total_norm = torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)\n                    else:\n                        total_norm =torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)\n                optimizer.step()\n                scheduler.step()  # Update learning rate schedule\n                model.zero_grad()\n                global_step += 1\n\n                if args.gpu == 0 and args.logging_steps > 0 and (step + 1) % args.logging_steps == 0:\n                    # Log metrics\n                    tb_writer.add_scalar(\"lr\", scheduler.get_lr()[0], global_step)\n                    if args.fp16:\n                        try:\n                            from apex.amp import _amp_state\n                            tb_writer.add_scalar(\"loss_scale\", _amp_state.loss_scalers[0]._loss_scale, global_step)\n                            tb_writer.add_scalar(\"scaled_loss\", scaled_loss.item(), global_step)\n                        except ImportError:\n                            logger.warning(\"Cannot import apex.amp._amp_state, \"\n                                           \"would not state the loss_scale in the log\")\n                    if args.max_grad_norm > 0.:  # Only clip the grad when it is valid\n                        tb_writer.add_scalar(\"grad_norm\", total_norm, global_step)\n                    tb_writer.add_scalar(\"loss\", (tr_loss - logging_loss) / args.logging_steps, global_step)\n                    logging_loss = tr_loss\n\n            if args.max_steps > 0 and global_step >= args.max_steps:\n                break\n\n        # Save it each epoch\n        if args.gpu == 0:\n            # Save checkpoints\n            checkpoint_name = \"checkpoint-epoch%04d\" % epoch\n            save_model(args, checkpoint_name, model, tokenizer, optimizer, scheduler)\n            last_path = os.path.join(args.output_dir, 'checkpoint-last')\n            # if os.path.exists(last_path):\n            #     print(last_path)\n            #     os.remove(last_path)\n            # os.symlink(os.path.join(args.output_dir, checkpoint_name), last_path)\n\n            # Evaluate the model\n            logger.info(\" Training loss of Epoch %d: %0.4f\" % (epoch, tr_loss / step))\n            logger.info(\" Evaluation Results of Epoch %d: \" % epoch)\n            results = evaluate(args, model, tokenizer)\n            for key, value in results.items():\n                tb_writer.add_scalar(\"eval_{}\".format(key), value, global_step)\n                logger.info(\"\\t %s: %0.4f\" % (key, value))\n            output_eval_file = os.path.join(args.output_dir, checkpoint_name, \"eval_results.json\")\n            json.dump(results, open(output_eval_file, 'w'), sort_keys=True, indent=4)\n\n        if args.max_steps > 0 and global_step >= args.max_steps:\n            epoch_iterator.close()\n            train_iterator.close()\n            break\n\n    if args.gpu == 0:\n        tb_writer.close()\n\n\ndef save_model(args, name, model, tokenizer, optimizer, scheduler):\n    # Save model checkpoint\n    output_dir = os.path.join(args.output_dir, name)\n    os.makedirs(output_dir, exist_ok=True)\n    model_to_save = (\n        model.module if hasattr(model, \"module\") else model\n    )  # Take care of distributed/parallel training\n    model_to_save.save_pretrained(output_dir)\n    tokenizer.save_pretrained(output_dir)\n\n    torch.save(args, os.path.join(output_dir, \"training_args.bin\"))\n    logger.info(\"Saving model checkpoint to %s\", output_dir)\n\n    # torch.save(optimizer.state_dict(), os.path.join(output_dir, \"optimizer.pt\"))\n    # torch.save(scheduler.state_dict(), os.path.join(output_dir, \"scheduler.pt\"))\n    # logger.info(\"Saving optimizer and scheduler states to %s\", output_dir)\n\n\ndef evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, prefix=\"\") -> Dict:\n    # Loop to handle MNLI double evaluation (matched, mis-matched)\n    eval_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)\n\n    args.eval_batch_size = args.per_gpu_eval_batch_size\n    # Note that DistributedSampler samples randomly\n\n    def collate(examples: List[torch.Tensor]):\n        if tokenizer._pad_token is None:\n            return pad_sequence(examples, batch_first=True)\n        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)\n\n    eval_sampler = SequentialSampler(eval_dataset)\n    eval_dataloader = DataLoader(\n        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate\n    )\n\n    # Eval!\n    logger.info(\"***** Running evaluation {} *****\".format(prefix))\n    logger.info(\"  Num examples = %d\", len(eval_dataset))\n    logger.info(\"  Batch size = %d\", args.eval_batch_size)\n    eval_loss = 0.0\n    nb_eval_steps = 0\n    model.eval()\n\n    for batch in tqdm(eval_dataloader, desc=\"Evaluating\"):\n        inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch)\n        inputs = inputs.to(args.device)\n        labels = labels.to(args.device)\n        # If some of the input is padded, then the attention mask is needed\n        attention_mask = (inputs != tokenizer.pad_token_id)  # word_tokens --> 1, pad_token --> 0\n        if attention_mask.all():\n            attention_mask = None\n\n        with torch.no_grad():\n            outputs = model(inputs, attention_mask=attention_mask, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)\n            lm_loss = outputs[0]\n            eval_loss += lm_loss.mean().item()\n        nb_eval_steps += 1\n\n    eval_loss = eval_loss / nb_eval_steps\n    perplexity = torch.exp(torch.tensor(eval_loss)).item()\n\n    result = {\"perplexity\": perplexity}\n\n    return result\n\n\ndef is_port_in_use(port):\n    import socket\n    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:\n        return s.connect_ex(('localhost', port)) == 0\n\n\ndef main():\n    args = process_args()\n    os.environ['MASTER_ADDR'] = '127.0.0.1'\n    port = 9595\n    while is_port_in_use(port):\n        port += 1\n    print(\"Use port\", port)\n    os.environ['MASTER_PORT'] = str(port)\n\n    # Using all available gpus for multi-processing distributed\n    args.gpus = torch.cuda.device_count()\n    print(\"Use gpus \", list(range(args.gpus)))\n    args.world_size = args.gpus * args.nodes\n    mp.spawn(setup, nprocs=args.gpus, args=(args,))\n\n\ndef setup(gpu, args):\n    if args.should_continue:\n        args.model_name_or_path = 'checkpoint-last'\n\n    # Setup CUDA, GPU & distributed training\n    torch.cuda.set_device(gpu)\n    device = torch.device(\"cuda\", gpu)\n    args.gpu = gpu                                  # Local device id.\n    args.device = device                            # Local device object.\n    args.rank = args.nr * args.gpus + gpu           # The gpu id in the world.\n    torch.distributed.init_process_group(\n        backend=\"nccl\",\n        init_method='env://',\n        world_size=args.world_size,\n        rank=args.rank\n    )\n\n    # Setup logging\n    logging.basicConfig(\n        format=\"%(asctime)s - %(levelname)s - %(name)s -   %(message)s\",\n        datefmt=\"%m/%d/%Y %H:%M:%S\",\n        level=logging.INFO if args.gpu == 0 else logging.WARN,\n    )\n    logger.warning(\n        \"Process GPU: %s, num_of_total_GPUs: %s, distributed training: True, 16-bits training: %s\",\n        args.gpu, args.gpus, args.fp16,\n    )\n\n    # Set seed\n    set_seed(args)\n\n    # Load pretrained model and token\n    # Barrier to make sure only the first process in distributed training\n    # download model & vocabizer\n    if gpu != 0:\n        torch.distributed.barrier()\n\n    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]\n\n    # Get Config\n    if args.config_name:\n        config = config_class.from_pretrained(args.config_name, cache_dir=args.cache_dir)\n    elif args.model_name_or_path:\n        config = config_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)\n    else:\n        raise ValueError(\n            \"Why do you want the default config?? Please use --config_name or --model_name_or_path\"\n        )\n\n    # Get Tokenizer\n    if args.tokenizer_name:\n        tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)\n        # BERT always needs lower cased tokens.\n        assert tokenizer.init_kwargs.get(\"do_lower_case\", False)\n    elif args.model_name_or_path:\n        tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)\n    else:\n        raise ValueError(\n            \"You are instantiating a new {} tokenizer. This is not supported, \"\n            \"but you can do it from another script, save it,\"\n            \"and load it from here, using --tokenizer_name\".format(tokenizer_class.__name__)\n        )\n\n    assert args.block_size <= tokenizer.max_len\n\n    if args.model_name_or_path:\n        model = model_class.from_pretrained(\n            args.model_name_or_path,\n            from_tf=bool(\".ckpt\" in args.model_name_or_path),\n            config=config,\n            cache_dir=args.cache_dir,\n        )\n    else:\n        logger.info(\"Training new model from scratch\")\n        model = model_class(config=config)\n\n    model.to(args.device)\n\n    # End of barrier to make sure only the first process waiting other processes\n    if gpu == 0:\n        torch.distributed.barrier()\n\n    logger.info(\"Training/evaluation parameters %s\", args)\n\n    # Training\n    if args.do_train:\n        # Barrier to make sure only the first process in distributed training process the dataset,\n        # and the others will use the cache\n        if gpu != 0:\n            torch.distributed.barrier()\n        train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)\n        if gpu == 0:\n            torch.distributed.barrier()\n\n        train(args, train_dataset, model, tokenizer)\n\n    # Evaluation\n    if args.do_eval and gpu == 0:\n        result = evaluate(args, model, tokenizer)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "vlm/run_vlm_distributed.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"\nFine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).\nGPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned\nusing a masked language modeling (MLM) loss.\n\"\"\"\n\nfrom datetime import datetime\nimport json\nimport logging\nimport os\nimport random\nimport sys\nimport time\nfrom typing import Dict, List, Tuple\n\nimport numpy as np\nimport torch\nfrom torch.nn.utils.rnn import pad_sequence\nimport torch.multiprocessing as mp\nfrom torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler\nfrom torch.utils.data.distributed import DistributedSampler\nfrom tqdm import tqdm, trange\nfrom transformers import (\n    MODEL_WITH_LM_HEAD_MAPPING,\n    WEIGHTS_NAME,\n    AdamW,\n    AutoConfig,\n    AutoModelWithLMHead,\n    AutoTokenizer,\n    BertConfig,\n    BertForMaskedLM,\n    BertTokenizer,\n    CamembertConfig,\n    CamembertForMaskedLM,\n    CamembertTokenizer,\n    DistilBertConfig,\n    DistilBertForMaskedLM,\n    DistilBertTokenizer,\n    GPT2Config,\n    GPT2LMHeadModel,\n    GPT2Tokenizer,\n    OpenAIGPTConfig,\n    OpenAIGPTLMHeadModel,\n    OpenAIGPTTokenizer,\n    PreTrainedModel,\n    PreTrainedTokenizer,\n    RobertaConfig,\n    RobertaForMaskedLM,\n    RobertaTokenizer,\n    get_linear_schedule_with_warmup,\n)\n\nsys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\nfrom vlm.data import CoLDataset, get_voken_feats\nfrom vlm.param import process_args\nfrom vlm.model import CoLBertConfig, CoLwithBert\n\n\ntry:\n    from torch.utils.tensorboard import SummaryWriter\nexcept ImportError:\n    from tensorboardX import SummaryWriter\n\n\nlogger = logging.getLogger(__name__)\n\n\nMODEL_CLASSES = {\n    \"gpt2\": (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),\n    \"openai-gpt\": (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),\n    \"bert\": (CoLBertConfig, CoLwithBert, BertTokenizer),\n    \"roberta\": (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer),\n    \"distilbert\": (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer),\n    \"camembert\": (CamembertConfig, CamembertForMaskedLM, CamembertTokenizer),\n}\n\n\ndef load_and_cache_examples(args, tokenizer, evaluate=False):\n    file_path = args.eval_data_file if evaluate else args.train_data_file\n    return CoLDataset(file_path, args.tokenizer_name, tokenizer, args.block_size,\n                      split_sent=args.split_sent, voken_dir=args.voken_dir,\n                      suffix=args.voken_suffix,\n                      verbose=(args.gpu == 0),\n                      voken_ablation=args.voken_ablation)\n\n\ndef set_seed(args):\n    random.seed(args.seed)\n    np.random.seed(args.seed)\n    torch.manual_seed(args.seed)\n\n\ndef mask_tokens(tokens: torch.Tensor, vokens: torch.Tensor, tokenizer: PreTrainedTokenizer, args) \\\n        -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:\n    \"\"\" Notice that this function would have a side affect of manipulating the Tensor tokens.\n    Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. \"\"\"\n\n    if tokenizer.mask_token is None:\n        raise ValueError(\n            \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n        )\n\n    labels = tokens.clone()\n    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n    probability_matrix = torch.full(labels.shape, args.mlm_probability)\n    special_tokens_mask = [\n        tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n    ]\n    probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n    if tokenizer._pad_token is not None:\n        padding_mask = labels.eq(tokenizer.pad_token_id)\n        probability_matrix.masked_fill_(padding_mask, value=0.0)\n    masked_indices = torch.bernoulli(probability_matrix).bool()\n    labels[~masked_indices] = -100  # We only compute loss on masked tokens\n\n    if args.voken_labels == 'mask':\n        vokens[~masked_indices] = -100\n    elif args.voken_labels == 'nonmask':\n        vokens[masked_indices] = -100\n    elif args.voken_labels == 'all':\n        pass\n    else:\n        assert \"Do not support the voken loss of type %s\" % args.voken_labels\n\n    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])\n    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices\n    tokens[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)\n\n    # 10% of the time, we replace masked input tokens with random word\n    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced\n    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)\n    tokens[indices_random] = random_words[indices_random]\n\n    # The rest of the time (10% of the time) we keep the masked input tokens unchanged\n    return tokens, labels, vokens\n\n\ndef train(args, train_dataset: CoLDataset, valid_dataset: CoLDataset,\n          model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:\n    set_seed(args)  # Added here for reproducibility\n\n    \"\"\" Train the model \"\"\"\n    if args.gpu == 0:\n        current_time = datetime.now().strftime('%b%d_%H-%M-%S')\n        tb_writer = SummaryWriter(args.output_dir + '/runs/' + current_time)\n\n    args.train_batch_size = args.per_gpu_train_batch_size\n\n    def col_collate(examples):\n        tokens, vokens = zip(*examples)\n        if tokenizer._pad_token is None:\n            tokens = pad_sequence(tokens, batch_first=True)\n        else:\n            tokens = pad_sequence(tokens, batch_first=True, padding_value=tokenizer.pad_token_id)\n        vokens = pad_sequence(vokens, batch_first=True, padding_value=-100)\n        return tokens, vokens\n\n    if args.shuffle:\n        logger.info(f\"Shuffle the dataset in training,\"\n                       f\"GPU: {args.gpu},\"\n                       f\"Rank: {args.rank},\"\n                       f\"Total: {args.world_size}\")\n    train_sampler = DistributedSampler(\n        train_dataset,\n        num_replicas=args.world_size,\n        rank=args.rank,\n        shuffle=args.shuffle,\n    )\n    train_dataloader = DataLoader(\n        train_dataset, sampler=train_sampler, shuffle=False, num_workers=0,\n        batch_size=args.train_batch_size, collate_fn=col_collate, pin_memory=True\n    )\n\n    if args.max_steps > 0:\n        t_total = args.max_steps\n        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1\n        # args.num_train_epochs = 9595\n    else:\n        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs\n\n    # Prepare optimizer and schedule (linear warmup and decay)\n    if args.lamb:\n        no_decay = ['bias', 'gamma', 'beta', 'LayerNorm']\n    else:\n        no_decay = [\"bias\", \"LayerNorm.weight\"]\n    optimizer_grouped_parameters = [\n        {\n            \"params\": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],\n            \"weight_decay\": args.weight_decay,\n        },\n        {\n            \"params\": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],\n            \"weight_decay\": 0.0,\n        },\n    ]\n    if args.lamb:\n        logger.info(f\"Using LAMB Optimizer with max grad norm {args.max_grad_norm}\")\n        import apex\n        optimizer = apex.optimizers.FusedLAMB(\n            optimizer_grouped_parameters,\n            lr=args.learning_rate,\n            eps=args.adam_epsilon,\n            max_grad_norm=args.max_grad_norm\n        )\n    else:\n        optimizer = AdamW(optimizer_grouped_parameters,\n                          lr=args.learning_rate,\n                          #betas=(0.9, 0.98),\n                          eps=args.adam_epsilon)\n    if args.gpu == 0:\n        print(f\"Optimized with lr: {optimizer.defaults['lr']}, total steps: {t_total},\"\n              f\" warmup steps: {args.warmup_steps}, epsilon {optimizer.defaults['eps']},\"\n              f\" beta: {optimizer.defaults['betas']}, weight decay {args.weight_decay}.\")\n    scheduler = get_linear_schedule_with_warmup(\n        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total\n    )\n\n    # Check if saved optimizer or scheduler states exist\n    if (\n        args.model_name_or_path\n        and os.path.isfile(os.path.join(args.model_name_or_path, \"optimizer.pt\"))\n        and os.path.isfile(os.path.join(args.model_name_or_path, \"scheduler.pt\"))\n    ):\n        # Load in optimizer and scheduler states\n        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, \"optimizer.pt\")))\n        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, \"scheduler.pt\")))\n\n    if args.fp16:\n        try:\n            from apex import amp\n        except ImportError:\n            raise ImportError(\"Please install apex from https://www.github.com/nvidia/apex to use fp16 training.\")\n        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)\n        from apex.parallel import DistributedDataParallel as DDP\n        model = DDP(model)\n    else:\n        model = torch.nn.parallel.DistributedDataParallel(\n            model, device_ids=[args.gpu], find_unused_parameters=True\n        )\n\n    # Allow not calculating the lm heads.\n    if args.mlm_ratio == 0.:\n        model.lm_head = None\n\n\n    # Train!\n    logger.info(\"***** Running training *****\")\n    logger.info(\"  Num examples = %d\", len(train_dataset))\n    logger.info(\"  Num Epochs = %d\", args.num_train_epochs)\n    logger.info(\"  Instantaneous batch size per GPU = %d\", args.per_gpu_train_batch_size)\n    logger.info(\n        \"  Total train batch size (w. distributed & accumulation) = %d\",\n        args.train_batch_size\n        * args.gradient_accumulation_steps\n        * args.world_size\n    )\n    logger.info(\"  Gradient Accumulation steps = %d\", args.gradient_accumulation_steps)\n    logger.info(\"  Total optimization steps = %d\", t_total)\n\n    global_step = 0\n    epochs_trained = 0\n    # Check if continuing training from a checkpoint\n    # if args.model_name_or_path and os.path.exists(args.model_name_or_path):\n    #     try:\n    #         # set global_step to gobal_step of last saved checkpoint from model path\n    #         checkpoint_suffix = args.model_name_or_path.split(\"-\")[-1].split(\"/\")[0]\n    #         epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)\n    #         steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)\n    #         logger.info(\"  Continuing training from checkpoint, will skip to saved global_step\")\n    #         logger.info(\"  Continuing training from epoch %d\", epochs_trained)\n    #     except ValueError:\n    #         logger.info(\"  Do not load model from %s, restart training\" % args.model_name_or_path)\n\n    model_to_resize = model.module if hasattr(model, \"module\") else model  # Take care of distributed/parallel training\n    assert model_to_resize.config.vocab_size == len(tokenizer)\n    # model_to_resize.resize_token_embeddings(len(tokenizer))\n\n    model.zero_grad()\n    train_iterator = trange(\n        epochs_trained, int(args.num_train_epochs), desc=\"Epoch\", disable=args.gpu != 0\n    )\n    set_seed(args)  # Added here for reproducibility\n    LOSS_NAMES = ['token_loss', 'voken_loss', 'total_loss']\n    for epoch in train_iterator:\n        epoch_iterator = tqdm(train_dataloader, desc=\"Iteration\", disable=args.gpu != 0)\n        tr_loss, logging_loss = np.zeros(len(LOSS_NAMES)), 0.0\n        model.zero_grad()\n        for step, (tokens, vokens) in enumerate(epoch_iterator):\n            token_inputs, token_labels, voken_labels = mask_tokens(tokens, vokens, tokenizer, args)\n            token_inputs = token_inputs.to(args.device)\n            token_labels = token_labels.to(args.device) if args.mlm_ratio != 0. else None\n            voken_labels = voken_labels.to(args.device)\n            # If some of the input is padded, then the attention mask is needed\n            attention_mask = (token_inputs != tokenizer.pad_token_id)         # word_tokens --> 1, pad_token --> 0\n            if attention_mask.all():\n                attention_mask = None\n\n            if epoch == 0 and step < 3 and args.gpu == 0:\n                print()\n                print(\"Token inputs:\", token_inputs.shape, token_inputs[0])\n                print(\"Token inputs (in str): \", tokenizer.convert_ids_to_tokens(token_inputs[0].cpu().numpy()))\n                print(\"Attention Mask:\", attention_mask)\n                print(\"Token Labels: \", token_labels[0] if token_labels is not None else token_labels)\n                print(\"Token Labels (in str): \", tokenizer.convert_ids_to_tokens(token_labels[0].cpu().numpy()) if token_labels is not None else token_labels)\n                print(\"Voken Labels: \", voken_labels[0])\n                print()\n\n            model.train()\n            outputs = model(token_inputs,\n                            attention_mask=attention_mask,\n                            masked_lm_labels=token_labels,\n                            voken_labels=voken_labels)\n            voken_loss = outputs[0]\n            token_loss = outputs[1]\n\n            if args.mlm_ratio == 0.:\n                loss = voken_loss\n            else:\n                loss = voken_loss + args.mlm_ratio * token_loss\n\n            if args.gradient_accumulation_steps > 1:\n                loss = loss / args.gradient_accumulation_steps\n\n            if args.fp16:\n                with amp.scale_loss(loss, optimizer) as scaled_loss:\n                    scaled_loss.backward()\n            else:\n                loss.backward()\n\n            # print(f\"GPU: {args.gpu}, Global Step: {global_step + 1}, \"\n            #       f\"Step: {step}, \"\n            #       f\"Range: {train_dataset.get_item_info(step * args.world_size + args.gpu)}, \"\n            #       f\"Loss: {loss.item()}, \"\n            #       f\"Scaled Loss: {scaled_loss.item()}\")\n\n            tr_loss += np.array((token_loss.item() / args.gradient_accumulation_steps,\n                                 voken_loss.item() / args.gradient_accumulation_steps,\n                                 loss.item()))\n\n            if (step + 1) % args.gradient_accumulation_steps == 0:\n                if args.max_grad_norm > 0. and not args.lamb:\n                    # Only clip the grad when it is valid and not using LAMB optimizer,\n                    # because the LAMB optimizer already apply grad clipping\n                    if args.fp16:\n                        total_norm = torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)\n                    else:\n                        total_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)\n                elif args.max_grad_norm <= 0. and step <= args.gradient_accumulation_steps:\n                    logger.warning(\"Have not clipped the gradient because \"\n                                   \"the max_grad_norm is set to %0.2f\" % args.max_grad_norm)\n                optimizer.step()\n                scheduler.step()  # Update learning rate schedule\n                model.zero_grad()\n                global_step += 1\n\n                if args.gpu == 0 and args.logging_steps > 0 and (step + 1) % args.logging_steps == 0:\n                    # Log metrics\n                    tb_writer.add_scalar(\"lr\", scheduler.get_lr()[0], global_step)\n                    if args.fp16:\n                        try:\n                            from apex.amp import _amp_state\n                            tb_writer.add_scalar(\"loss_scale\", _amp_state.loss_scalers[0]._loss_scale, global_step)\n                            tb_writer.add_scalar(\"scaled_loss\", scaled_loss.item(), global_step)\n                        except ImportError:\n                            logger.warning(\"Cannot import apex.amp._amp_state, \"\n                                           \"would not state the loss_scale in the log\")\n                    if args.max_grad_norm > 0. and not args.lamb:  # Only clip the grad when it is valid\n                        tb_writer.add_scalar(\"grad_norm\", total_norm, global_step)\n                    interval_loss = (tr_loss - logging_loss) / args.logging_steps\n                    for loss_idx, loss_name in enumerate(LOSS_NAMES):\n                        tb_writer.add_scalar(loss_name, interval_loss[loss_idx], global_step)\n                    logging_loss = tr_loss.copy()\n\n            if args.max_steps > 0 and global_step >= args.max_steps:\n                break\n\n            # if step == 200:\n            #     break\n            #\n        # Save it each epoch\n        if args.gpu == 0:\n            # Save checkpoints\n            checkpoint_name = \"checkpoint-epoch%04d\" % epoch\n            save_model(args, checkpoint_name, model, tokenizer, optimizer, scheduler)\n\n            # last_path = os.path.join(args.output_dir, 'checkpoint-last')\n            # if os.path.exists(last_path):\n            #     os.remove(last_path)\n            # os.symlink(os.path.join(args.output_dir, checkpoint_name), last_path)\n\n            # Evaluate the model\n            for loss_idx, loss_name in enumerate(LOSS_NAMES):\n                logger.info(\" Training %s of Epoch %d: %0.4f\" % (\n                    loss_name, epoch, tr_loss[loss_idx] / len(train_dataloader)))\n\n            if args.do_eval:\n                logger.info(\" Evaluation Results of Epoch %d: \" % epoch)\n                old_eval_batch_size = args.per_gpu_eval_batch_size\n                while args.per_gpu_eval_batch_size > 0:\n                    try:\n                        results = evaluate(args, valid_dataset, model, tokenizer)\n                        break\n                    except RuntimeError as e:\n                        args.per_gpu_eval_batch_size = int(args.per_gpu_eval_batch_size / 2)\n                        print(\"HALVE THE BATCH SIZE in EVAL.\")\n                        if args.per_gpu_eval_batch_size == 0:\n                            raise e\n                        time.sleep(5)\n                args.per_gpu_eval_batch_size = old_eval_batch_size\n\n                for key, value in results.items():\n                    tb_writer.add_scalar(\"eval_{}\".format(key), value, global_step)\n                    logger.info(\"\\t %s: %0.4f\" % (key, value))\n                tb_writer.add_scalar(\"epoch\", epoch, global_step)\n                output_eval_file = os.path.join(args.output_dir, checkpoint_name, \"eval_results.json\")\n                json.dump(results, open(output_eval_file, 'w'), sort_keys=True, indent=4)\n            # Currently, only GPU 0 is responsible for the evaluation.\n            # torch.cuda.empty_cache()\n            # torch.distributed.barrier()\n        else:\n            pass\n            # torch.cuda.empty_cache()\n            # torch.distributed.barrier()\n\n        if args.max_steps > 0 and global_step >= args.max_steps:\n            epoch_iterator.close()\n            train_iterator.close()\n            break\n\n    if args.gpu == 0:\n        tb_writer.close()\n\n\ndef save_model(args, name, model, tokenizer, optimizer, scheduler):\n    # Save model checkpoint\n    output_dir = os.path.join(args.output_dir, name)\n    os.makedirs(output_dir, exist_ok=True)\n    model_to_save = (\n        model.module if hasattr(model, \"module\") else model\n    )  # Take care of distributed/parallel training\n    model_to_save.save_pretrained(output_dir)\n    tokenizer.save_pretrained(output_dir)\n\n    torch.save(args, os.path.join(output_dir, \"training_args.bin\"))\n    logger.info(\"Saving model checkpoint to %s\", output_dir)\n\n    # torch.save(optimizer.state_dict(), os.path.join(output_dir, \"optimizer.pt\"))\n    # torch.save(scheduler.state_dict(), os.path.join(output_dir, \"scheduler.pt\"))\n    # logger.info(\"Saving optimizer and scheduler states to %s\", output_dir)\n\n\ndef evaluate(args, eval_dataset: CoLDataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, prefix=\"\") -> Dict:\n    torch.cuda.empty_cache() \n    # # Loop to handle MNLI double evaluation (matched, mis-matched)\n    # eval_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)\n\n    args.eval_batch_size = args.per_gpu_eval_batch_size\n    # Note that DistributedSampler samples randomly\n\n    def col_collate(examples):\n        tokens, vokens = zip(*examples)\n        if tokenizer._pad_token is None:\n            tokens = pad_sequence(tokens, batch_first=True)\n        else:\n            tokens = pad_sequence(tokens, batch_first=True, padding_value=tokenizer.pad_token_id)\n        vokens = pad_sequence(vokens, batch_first=True, padding_value=-100)\n        return tokens, vokens\n\n    eval_sampler = SequentialSampler(eval_dataset)\n    eval_dataloader = DataLoader(\n        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=col_collate\n    )\n\n    # Eval!\n    logger.info(\"***** Running evaluation {} *****\".format(prefix))\n    logger.info(\"  Num examples = %d\", len(eval_dataset))\n    logger.info(\"  Batch size = %d\", args.eval_batch_size)\n    total_token_loss = 0.0\n    total_voken_loss = 0.0\n    nb_eval_steps = 0\n    model.eval()\n\n    for tokens, vokens in tqdm(eval_dataloader, desc=\"Evaluating\"):\n        token_inputs, token_labels, voken_labels = mask_tokens(tokens, vokens, tokenizer, args)\n        token_inputs = token_inputs.to(args.device)\n        token_labels = token_labels.to(args.device) if args.mlm_ratio != 0 else None\n        voken_labels = voken_labels.to(args.device)\n        # If some of the input is padded, then the attention mask is needed\n        attention_mask = (token_inputs != tokenizer.pad_token_id)  # word_tokens --> 1, pad_token --> 0\n        if attention_mask.all():\n            attention_mask = None\n\n        with torch.no_grad():\n            outputs = model(token_inputs,\n                            attention_mask=attention_mask,\n                            masked_lm_labels=token_labels,\n                            voken_labels=voken_labels)\n            voken_loss = outputs[0]\n            token_loss = outputs[1]\n\n            total_voken_loss += voken_loss.item()\n            total_token_loss += token_loss.item()\n\n        nb_eval_steps += 1\n\n    total_token_loss = total_token_loss / nb_eval_steps\n    perplexity = torch.exp(torch.tensor(total_token_loss)).item()\n\n    result = {\"perplexity\": perplexity,\n              \"voken_loss\": total_voken_loss / nb_eval_steps}\n    torch.cuda.empty_cache() \n\n    return result\n\n\ndef is_port_in_use(port):\n    import socket\n    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:\n        return s.connect_ex(('localhost', port)) == 0\n\n\ndef main():\n    args = process_args()\n    os.environ['MASTER_ADDR'] = '127.0.0.1'\n    port = 9595\n    while is_port_in_use(port):\n        port += 1\n    print(\"Use port\", port)\n    os.environ['MASTER_PORT'] = str(port)\n\n    # Using all available gpus for multi-processing distributed\n    args.gpus = torch.cuda.device_count()\n    print(\"Use gpus \", list(range(args.gpus)))\n    args.world_size = args.gpus * args.nodes\n    mp.spawn(setup, nprocs=args.gpus, args=(args,))\n\n\ndef setup(gpu, args):\n    if args.should_continue:\n        args.model_name_or_path = 'checkpoint-last'\n\n    # Setup CUDA, GPU & distributed training\n    torch.cuda.set_device(gpu)\n    device = torch.device(\"cuda\", gpu)\n    args.gpu = gpu                                  # Local device id.\n    args.device = device                            # Local device object.\n    args.rank = args.nr * args.gpus + gpu           # The gpu id in the world.\n    torch.distributed.init_process_group(\n        backend=\"nccl\",\n        init_method='env://',\n        world_size=args.world_size,\n        rank=args.rank\n    )\n\n    # Setup logging\n    logging.basicConfig(\n        format=\"%(asctime)s - %(levelname)s - %(name)s -   %(message)s\",\n        datefmt=\"%m/%d/%Y %H:%M:%S\",\n        level=logging.INFO if args.gpu == 0 else logging.WARN,\n    )\n    logger.warning(\n        \"Process GPU: %s, num_of_total_GPUs: %s, distributed training: True, 16-bits training: %s\",\n        args.gpu, args.gpus, args.fp16,\n    )\n\n    # Set seed\n    set_seed(args)\n\n    # Load pretrained model and token\n    # Barrier to make sure only the first process in distributed training\n    # download model & vocabizer\n    if gpu != 0:\n        torch.distributed.barrier()\n\n    # Use self-defined models, thus avoiding Auto***.\n    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]\n\n    # Next, we will initialize the training process in the following order:\n    #   1. tokenizer --> 2. dataset --> 3. config --> 4. model.\n    # because A) dataset relies on the tokenizer.special_tokens.\n    #         B) config relies on the dataset.voken_size.\n\n    # Get Tokenizer\n    if args.tokenizer_name:\n        tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)\n    elif args.model_name_or_path:\n        tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)\n    else:\n        raise ValueError(\n            \"You are instantiating a new {} tokenizer. This is not supported, \"\n            \"but you can do it from another script, save it,\"\n            \"and load it from here, using --tokenizer_name\".format(tokenizer_class.__name__)\n        )\n\n    assert args.block_size <= tokenizer.max_len\n\n    # Barrier to make sure only the first process in distributed training process the dataset,\n    # and the others will use the cache\n    if gpu != 0:\n        torch.distributed.barrier()\n    train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)\n    valid_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)\n    if gpu == 0:\n        torch.distributed.barrier()\n\n    # Assert the vokens are equal in valid and eval.\n    valid_dataset.assert_equal_vokens(train_dataset)\n\n    config_kwargs = {}\n    if args.do_voken_reg or args.do_voken_ctr:\n        assert args.voken_feat_dir is not None\n        voken_feats = get_voken_feats(train_dataset, args.voken_feat_dir)\n        config_kwargs['voken_dim'] = len(voken_feats[0])\n        if gpu == 0:\n            logger.info(f\"Load voken feats from {args.voken_feat_dir}\"\n                        f\"with {len(voken_feats)} features and dimension {len(voken_feats[0])}\")\n\n    # Get Config\n    if args.config_name:\n        config = config_class.from_pretrained(\n            args.config_name,\n            cache_dir=args.cache_dir,\n            voken_size=train_dataset.voken_size,\n            do_voken_cls=args.do_voken_cls,\n            do_voken_reg=args.do_voken_reg,\n            do_voken_ctr=args.do_voken_ctr,\n            shared_head=args.shared_head,\n            verbose=(args.gpu == 0),\n            **config_kwargs\n        )\n    elif args.model_name_or_path:\n        config = config_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)\n    else:\n        raise ValueError(\n            \"Why do you want the default config?? Please use --config_name or --model_name_or_path\"\n        )\n\n    if args.model_name_or_path:\n        logger.info(f\"Training model from the weight {args.model_name_or_path}.\")\n        model = model_class.from_pretrained(\n            args.model_name_or_path,\n            from_tf=bool(\".ckpt\" in args.model_name_or_path),\n            config=config,\n            cache_dir=args.cache_dir,\n        )\n    else:\n        logger.info(\"Training new model from scratch\")\n        model = model_class(config=config)\n\n    if args.do_voken_reg or args.do_voken_ctr:\n        voken_feats = torch.tensor(voken_feats)\n        model.init_voken_feat_emb(voken_feats)\n\n    model.to(args.device)\n\n    # End of barrier to make sure only the first process waiting other processes\n    if gpu == 0:\n        torch.distributed.barrier()\n\n    if args.model_name_or_path:\n        if gpu == 0:\n            logger.info(\"Evaluate the performance of the loaded model.\")\n            results = evaluate(args, valid_dataset, model, tokenizer)\n            for key, value in results.items():\n                logger.info(\"\\t %s: %0.4f\" % (key, value))\n            torch.distributed.barrier()\n        else:\n            torch.distributed.barrier()\n\n    logger.info(\"Training/evaluation parameters %s\", args)\n\n    # Training\n    if args.do_train:\n        train(args, train_dataset, valid_dataset, model, tokenizer)\n\n    # Evaluation\n    if args.do_eval and gpu == 0:\n        results = evaluate(args, valid_dataset, model, tokenizer)\n        for key, value in results.items():\n            logger.info(\"\\t %s: %0.4f\" % (key, value))\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "vlm/show_glue_results_epochs.py",
    "content": "import os\nfrom pathlib import Path\n\nroot = Path(\n    'snap'\n)\n\ntask2major = {\n    'QQP': 'acc_and_f1',\n    'STS-B': 'corr',\n    'MRPC': 'acc_and_f1',\n}\n\n# The tasks sorted by the amount of data\nall_tasks = [\n    # 'WNLI',\n    'RTE',\n    'MRPC',\n    'STS-B',\n    'CoLA',\n    'SST-2',\n    'QNLI',\n    'QQP',\n    'MNLI',\n    'MNLI-MM',\n]\n\n\ndef print_result(glue_dir):\n    print(glue_dir)\n    results = {}\n    for task in glue_dir.iterdir():\n        if task.is_dir():\n            eval_fpath = task / 'eval_results.txt'\n            task_name = task.name\n            if eval_fpath.exists():\n                with eval_fpath.open() as f:\n                    for line in f:\n                        metric, value = line.split('=')\n                        metric = metric.strip()\n                        value = float(value.strip())\n                        if task_name in task2major:\n                            if metric == task2major[task_name]:\n                                results[task_name] = value\n                        else:\n                            results[task_name] = value\n    if len(results) > 0:\n        # sorted_keys = sorted(list(results.keys()))\n        # for key in sorted_keys:\n        #     print(\"%8s\" % key, end='')\n        # print(\"%8s\" % 'GLUE', end='')\n        # print()\n        # for key in sorted_keys:\n        #     print(\"%8.2f\" % (results[key] * 100.), end='')\n        # print(\"%8.2f\" % (sum(results.values()) * 100. / len(results)), end='')\n        # print()\n        for task in all_tasks:\n            print(\"%8s\" % task, end='')\n        print(\"%8s\" % 'GLUE', end='')\n        print()\n        for task in all_tasks:\n            if task in results:\n                result = results[task]\n                print(\"%8.2f\" % (result * 100), end='')\n            else:\n                print(\" \" * 8, end='')\n        mean = lambda x: sum(x) / max(len(x), 1)\n        avg_result = mean([value for key, value in results.items() if key in all_tasks])\n        print(\"%8.2f\" % (avg_result * 100.), end='')\n        print()\n\n\ndef search(path):\n    def sorted_key(path):\n        try:\n            return path.stat().st_mtime\n        except Exception:\n            return 0.\n    path_list = sorted(\n        path.iterdir(),\n        key=sorted_key\n        # x.name\n    )\n    for subdir in path_list:\n        if subdir.is_dir():\n            if 'glueepoch_' in subdir.name:\n                print_result(subdir)\n            else:\n                search(subdir)\n\nsearch(root)\n"
  },
  {
    "path": "vokenization/__init__.py",
    "content": ""
  },
  {
    "path": "vokenization/common.py",
    "content": "import os\n\n# Name of image sets\nIMAGE_SETS = [\n    'coco_train',\n    'coco_nominival',\n    'coco_minival',\n    'vg_nococo',\n    'cc_train',\n    'cc_valid',\n]\n\n# Root of each dataset\n# CC_ROOT, COCO_ROOT, VG_ROOT should contain the `images` folder\n# CC_ROOT -- images\n#              |-- training\n#                      |-- training_00009486    # Jpeg files but does not have the extension.\n#                      |-- ....\n#              |-- validation\n#                      |-- validation_00009486\n#                      |-- ...\n# CC_ROOT = os.getenv('CC_ROOT', 'data/cc')\n# COCO_ROOT = os.getenv('COCO_ROOT', 'data/mscoco')\n# VG_ROOT = os.getenv('VG_ROOT', 'data/vg')\n# LXRT_ROOT = os.getenv('LXRT_ROOT', 'data/lxrt')\nCC_ROOT = 'data/cc'\nCOCO_ROOT = 'data/mscoco'\nVG_ROOT = 'data/vg'\nLXRT_ROOT = 'data/lxmert'\n\n# THe local directory to save essential image infos\n#       (e.g., image ids for the vokenizer, image paths in this server)\n# LOCAL_DIR\n#   |- images\n#         |- coco_train_ids.txt\n#         |- coco_train_paths.txt\n#         |- cc_train_ids.txt\n#         |- cc_train_paths.txt\n#         |- ..............\n# Running create_image_ids.py will build *_ids.txt\n# Running extract_vision_keys.py will build *_paths.txt\nLOCAL_DIR = 'data/vokenization'\n\n"
  },
  {
    "path": "vokenization/create_image_ids.py",
    "content": "import json\nimport os\nfrom pathlib import Path\nimport sys\n\n# sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\nimport common\n\nimgset2lxrtfname = {\n    'coco_train': 'mscoco_train.json',\n    'coco_nominival': 'mscoco_nominival.json',\n    'coco_minival': 'mscoco_minival.json',\n    'vg_nococo': 'vgnococo.json',\n}\n\nimgset2ccfname = {\n    'cc_train': 'training.tsv',\n    'cc_valid': 'validation.tsv'\n}\n\n\ndef write_ids(img_set, img_ids):\n    \"\"\"\n    Write the indexed image ids 'img_ids' for image set 'img_set' to\n    the local file.\n    \"\"\"\n    info_dir = os.path.join(common.LOCAL_DIR, 'images')\n    os.makedirs(info_dir, exist_ok=True)\n    print(\"Write %d image ids for image set %s to %s.\" % (\n        len(img_ids), img_set, os.path.join(info_dir, img_set + '.ids')))\n    ids_path = os.path.join(info_dir, img_set + '.ids')\n    if os.path.exists(ids_path):\n        # If there is an existing ids_path, make sure that they are the same.\n        print(f\"Already exist the image ids for image set {img_set} at path {ids_path}.\")\n        print(\"Now, we want to make sure that they are equal:\")\n        with open(ids_path, 'r') as f:\n            exist_img_ids = list(map(lambda x: x.strip(), f.readlines()))\n        success = True\n        for i, (exist_img_id, img_id) in zip(exist_img_ids):\n            if exist_img_id != img_id:\n                print(f\"The image id at line {i} is different:\")\n                print(f\"\\tIn the file: {exist_img_id}, In this script: {img_id}\")\n                success = False\n        if success:\n            print(\"PASS!\")\n    else:\n        with open(ids_path, 'w') as f:\n            for img_id in img_ids:\n                f.write(img_id + '\\n')\n\n\nfor img_set in common.IMAGE_SETS:\n    if img_set in imgset2lxrtfname:\n        lxrt_path = Path(common.LXRT_ROOT)\n        img_ids = []\n        fname = imgset2lxrtfname[img_set]\n        for datum in json.load((lxrt_path / fname).open()):\n            img_id = datum['img_id']\n            img_ids.append(img_id)\n\n        write_ids(img_set, img_ids)\n\n    if img_set in imgset2ccfname:\n        cc_path = Path(common.CC_ROOT)\n        img_ids = []\n        fname = imgset2ccfname[img_set]\n        if not (cc_path / fname).exists():\n            print(\"No such file\", cc_path / fname)\n            continue\n        for i, line in enumerate((cc_path / fname).open()):\n            sent, img_id = line.split('\\t')\n            img_ids.append(img_id.strip())\n\n        write_ids(img_set, img_ids)\n"
  },
  {
    "path": "vokenization/evaluate_diversity.py",
    "content": "import argparse\nfrom collections import defaultdict\nimport json\nimport os\nimport sys\n\nimport numpy as np\nimport tqdm\n\nfrom vokenization import Vokenizer, load_model_and_tokenizer\nimport common\n\nimgset2fname = {\n    'coco_train': 'mscoco_train.json',\n    'coco_nominival': 'mscoco_nominival.json',\n    'coco_minival': 'mscoco_minival.json',\n    'vg_nococo': 'vgnococo.json',\n    'cc_train': 'training.tsv',\n    'cc_valid': 'validation.tsv',\n}\n\ntokenizer_name = 'bert-base-uncased'\n\n\ndef load_lang_data(corpus_name, topk=10000):\n    \"\"\"\n    Load {topk} sentences from the corpus named by {corpus_name}.\n    \"\"\"\n    fpath = corpus_name + '.' + tokenizer_name\n    tokens = []\n    with open(fpath) as f:\n        for i, line in enumerate(f):\n            tokens.append(list(map(int, line.split(' '))))\n            if (i + 1) == topk:\n                break\n    print(\"Read %d sentences from the corpus %s located at %s.\" % (\n        len(tokens), corpus_name, fpath\n    ))\n    return tokens\n\n\ndef load_cc_data(img_set):\n    fname = os.path.join(common.CC_ROOT, imgset2fname[img_set])\n    sents = []\n    with open(fname) as f:\n        for line in f:\n            sent, _ = line.split('\\t')\n            sents.append(sent)\n    print(\"Load the %d sentences for image set %s from %s\" % (\n        len(sents), img_set, fname))\n    return sents\n\n\ndef load_lxrt_data(img_set):\n    fname = os.path.join(common.LXRT_ROOT, imgset2fname[img_set])\n    sents = []\n    with open(fname) as f:\n        data = json.load(f)\n        for datum in data:\n            sents.extend(datum['sentf']['mscoco'])\n    print(\"Load the %d sentences for image set %s from %s\" % (\n        len(sents), img_set, fname))\n    return sents\n\n\ndef analyze(token2info):\n    \"\"\"\n    :param token2info: token2info: token --> (img_id --> cnt)\n    :return:\n    \"\"\"\n    names = ['Num Images', 'Max Cnt', 'Avg Cnt', 'Std Cnt']\n    results = np.zeros(4)\n    num_tokens = 0\n    for token in token2info:\n        img2cnt = token2info[token]\n        cnts = np.array(list(img2cnt.values()))\n        num_imgs = len(cnts)\n        max_cnt = cnts.max()\n        avg_cnt = cnts.mean()\n        std_cnt = cnts.std()\n        results += (num_imgs, max_cnt, avg_cnt, std_cnt)\n        num_tokens += 1\n    print(\"With %d tokens, \" % num_tokens)\n    results /= num_tokens\n    for name, result in zip(names, results):\n        print(\"Average of %s is %0.2f\" % (name, result))\n\n    corpus_info = defaultdict(lambda: 0)\n    for info in token2info.values():\n        for img, cnt in info.items():\n            corpus_info[img] += cnt\n    print(\"Cover %d images\" % len(corpus_info))\n\n# load = '/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_bertl4'\nparser = argparse.ArgumentParser()\nparser.add_argument('--load', type=str, default='/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_robertal4',\n                    help='The directory saved the model (containing'\n                         'BEST.pth.model).')\nparser.add_argument('--image-sets', type=str, default='coco_minival',\n                    help='The splits of images to be extracted')\nparser.add_argument('--corpus', type=str, default='wiki103',\n                    help='Evaluated corpus')\nparser.add_argument('--maxsents', type=int, default=10000,\n                    help='The maximum sentences to be evaluated in the corpus')\nargs = parser.parse_args()\n\nkeys_path = os.path.join(args.load, 'keys')\n\nprint(\"Evaluate for model %s on image sets %s\" % (args.load, args.image_sets))\nmodel, tokenizer = load_model_and_tokenizer(args.load)\nimg_sets = args.image_sets.split(',')\nvokenizer = Vokenizer(model, tokenizer, keys_path, img_sets)\n\ncorpus_list = args.corpus.split(',')\nfor corpus in corpus_list:\n    corpus = corpus.strip()\n    print(\"\\nProcessing corpus %s for diversity test:\" % corpus)\n    # token2info: token --> (img_id --> cnt)\n    token2info = defaultdict(lambda: defaultdict(lambda: 0))\n\n    if corpus in imgset2fname:\n        if 'cc' in corpus:\n            sents = load_cc_data(corpus)\n        else:\n            sents = load_lxrt_data(corpus)\n        batch_size = 32\n        for start_id in tqdm.tqdm(range(0, len(sents), batch_size)):\n            batch_sents = sents[start_id: start_id + batch_size]\n            scores, ids, tokens, paths = vokenizer.vokenize_sents(batch_sents, topk=None)\n            for i in range(len(paths)):\n                for token, path in zip(tokens[i][1:-1], paths[i][1:-1]):\n                    token2info[token][path] += 1\n    else:\n        tokens_list = load_lang_data(corpus, args.maxsents)\n        batch_size = 16\n        for start_id in tqdm.tqdm(range(0, len(tokens_list), batch_size)):\n            batch_tokens = tokens_list[start_id: start_id + batch_size]\n            scores, ids, tokens, paths = vokenizer.vokenize_ids(batch_tokens, topk=None)\n            for i in range(len(paths)):\n                for token, path in zip(tokens[i][1:-1], paths[i][1:-1]):\n                    token2info[token][path] += 1\n\n    analyze(token2info)\n\n\n\n\n"
  },
  {
    "path": "vokenization/evaluate_retrieval.py",
    "content": "import argparse\nfrom collections import defaultdict\nimport json\nimport os\n\nimport tqdm\n\nfrom vokenization import Vokenizer, load_model_and_tokenizer\nimport common\n\nimgset2fname = {\n    'coco_train': 'mscoco_train.json',\n    'coco_nominival': 'mscoco_nominival.json',\n    'coco_minival': 'mscoco_minival.json',\n    'vg_nococo': 'vg_nococo.json',\n    'cc_train': 'training.tsv',\n    'cc_valid': 'validation.tsv',\n}\n\n\ndef load_cc_data(img_set):\n    fname = os.path.join(common.CC_ROOT, imgset2fname[img_set])\n    sentXimgname = []\n    with open(fname) as f:\n        for line in f:\n            sent, gt_img_name = line.split('\\t')\n            gt_img_name = gt_img_name.strip()\n            sentXimgname.append((sent, gt_img_name))\n    print(\"Load the %d (img, sent) pairs for image set %s from %s\" % (\n        len(sentXimgname), img_set, fname))\n    return sentXimgname\n\n\ndef load_lxrt_data(img_set):\n    fname = os.path.join(common.LXRT_ROOT, imgset2fname[img_set])\n    sentXimgname = []\n    with open(fname) as f:\n        data = json.load(f)\n        for datum in data:\n            gt_img_name = datum['img_id'] + '.jpg'\n            sents = datum['sentf']['mscoco']\n            for sent in sents:\n                sentXimgname.append((sent, gt_img_name))\n    print(\"Load the %d (img, sent) pairs for image set %s from %s\" % (\n        len(sentXimgname), img_set, fname))\n    return sentXimgname\n\n\n# load = '/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_bertl4'\nparser = argparse.ArgumentParser()\nparser.add_argument('--load', type=str, default='/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_robertal4',\n                    help='The directory saved the model (containing'\n                         'BEST.pth.model).')\nparser.add_argument('--image-sets', type=str, default='coco_minival',\n                    help='The splits of images to be extracted')\nargs = parser.parse_args()\n\nkeys_path = os.path.join(args.load, 'keys')\n\nprint(\"Evaluate for model %s on image sets %s\" % (args.load, args.image_sets))\nmodel, tokenizer = load_model_and_tokenizer(args.load)\nimg_sets = args.image_sets.split(',')\n\nsent_level = 'sent' in args.load\n\nfor img_set in img_sets:\n    vokenizer = Vokenizer(model, tokenizer, keys_path, [img_set],\n                          sent_level=sent_level)\n    if 'cc' in img_set:\n        sentXimgname = load_cc_data(img_set)\n    else:\n        sentXimgname = load_lxrt_data(img_set)\n\n    topks = [1, 5, 10]\n    print(\"\\nEvaluate image set\", img_set, \"for topk retrieval:\", topks)\n    total = 0\n    arg_topk = None if max(topks) == 1 else max(topks)\n    results = defaultdict(lambda: 0)\n    batch_size = 32\n    for start_id in tqdm.tqdm(range(0, len(sentXimgname), batch_size)):\n        batch_sentXimg = sentXimgname[start_id: start_id + batch_size]\n        sents, gt_img_names = zip(*batch_sentXimg)\n        sents = list(sents)\n\n        scores, ids, tokens, paths_list = vokenizer.vokenize_sents(sents, topk=arg_topk)\n        if sent_level:\n            paths_list = [x[:3] for x in paths_list]     # Only eval the first vokens.\n        if arg_topk is None:\n            paths_list = [[[img_id] for img_id in sent] for sent in paths_list]\n        for paths, gt_img_name in zip(paths_list, gt_img_names):                # for each sent in batch\n            for topk_paths in paths[1:-1]:      # for each token in sent\n                for k, kth_path in enumerate(topk_paths):     # for each img_path in topk image paths of a token\n                    img_name = os.path.split(kth_path)[-1]\n                    if img_name == gt_img_name:\n                        results[k + 1] += 1\n        total += sum(map(lambda x: len(x) - 2, paths_list))\n\n    accumulate = 0\n    for i in range(1, max(topks)+1):\n        accumulate += results[i]\n        if i in topks:\n            print(\"R%d: %0.2f%%, (Random: %0.4f%%)\" % (\n                i,\n                accumulate / total * 100.,\n                i / vokenizer.img_num * 100.\n            ))\n\n    del vokenizer\n\n\n\n\n"
  },
  {
    "path": "vokenization/extract_vision_keys.py",
    "content": "# In this file, we extract the vision features as the keys in retrieval.\nimport argparse\nimport os\nimport pickle\nimport shutil\nimport sys\n\nimport h5py\nimport torch\nfrom torchvision import transforms\nfrom torchvision.datasets.folder import default_loader\nimport tqdm\nfrom transformers import BertTokenizer\nfrom PIL import Image\n\nimport common\n\n# Load all images\nImage.MAX_IMAGE_PIXELS = None\n\n\ndef get_img_path(img_set, img_id):\n    \"\"\"\n    Get the paths regarding the img_set and img_id.\n    THIS FUNCTION MIGHT NEED TO BE MODIFIED.\n    \"\"\"\n    source, tag = img_set.split('_')\n    if source == 'cc':\n        split_tag, _ = img_id.split('_')\n        return \"%s/images/%s/%s\" % (common.CC_ROOT, split_tag, img_id)\n    elif 'COCO' in img_id:\n        _, split_tag, _ = img_id.split('_')\n        return \"%s/images/%s/%s\" % (common.COCO_ROOT, split_tag, img_id + '.jpg')\n    else:   # VG images\n        return \"%s/images/%s.jpg\" % (common.VG_ROOT, img_id)\n\n\ndef get_img_paths_and_ids(img_set):\n    \"\"\"\n    Return a list of images paths and image ids in this 'img_set'.\n    \"\"\"\n\n    # Load the image ids from the common local dir,\n    # thus make sure that the order of the images are the same.\n    info_dir = os.path.join(common.LOCAL_DIR, 'images')\n    img_paths = []\n    with open(os.path.join(info_dir, img_set + '.ids')) as f:\n        img_ids = list(map(lambda x: x.strip(), f.readlines()))\n    for img_id in img_ids:\n        img_paths.append(get_img_path(img_set, img_id))\n    return img_paths, img_ids\n\n\ndef save_img_paths_and_ids(img_set, img_paths, img_ids, output):\n    info_dir = os.path.join(common.LOCAL_DIR, 'images')\n\n    # Save Image Paths\n    curr_paths_fname = os.path.join(output, img_set + '.path')\n    print(\"\\tSave img paths to \", curr_paths_fname)\n    with open(curr_paths_fname, 'w') as f:\n        for path in img_paths:\n            f.write(path + \"\\n\")\n\n    # Save Image Ids\n    curr_ids_fname = os.path.join(output, img_set + '.ids')\n    print(\"\\tSave img ids to \", curr_ids_fname)\n    with open(curr_ids_fname, 'w') as f:\n        for idx in img_ids:\n            f.write(idx + \"\\n\")\n\n    common_paths_fname = os.path.join(info_dir, img_set + '.path')\n    if os.path.exists(common_paths_fname):\n        with open(common_paths_fname) as f:\n            common_img_paths = f.readlines()\n            common_img_paths = [img_path.strip() for img_path in common_img_paths]\n            # All feature extractor should extract for the same image set.\n            assert common_img_paths == img_paths\n    else:\n        shutil.copy(curr_paths_fname, common_paths_fname)\n\n\ndef extract_vision_feature_keys(model, img_transform, img_sets, output, batch_size):\n    \"\"\"\n\n    :param model: The visn_model which takes an image [b, channel, H, W] as input,\n                  and output with [b, f]\n    :param img_transform: The transformation of images, compatible with training.\n    :param img_sets: The sets of images to be extracted.\n    :param output: The directory to save the extracted keys.\n    :return:\n    \"\"\"\n    last_dim = -1\n    for img_set in img_sets:\n        print(\"Extracting feature keys for image set %s\" % img_set)\n        img_paths, img_ids = get_img_paths_and_ids(img_set)\n        saved_img_paths = []\n        saved_img_ids = []\n        img_keys = []\n        tensor_imgs = []\n        for i, img_path in enumerate(tqdm.tqdm(img_paths)):\n            try:\n                pil_img = default_loader(img_path)\n            except Exception as e:\n                print(e)\n                print(\"Skip image %s\" % img_path)\n                continue\n            saved_img_paths.append(img_path)\n            saved_img_ids.append(img_ids[i])\n\n            tensor_imgs.append(img_transform(pil_img))\n\n            if len(tensor_imgs) == batch_size:\n                visn_input = torch.stack(tensor_imgs).cuda()\n                with torch.no_grad():\n                    visn_output = model(visn_input)\n\n                # Check sizes of features are equal.\n                if last_dim == -1:\n                    last_dim = visn_output.shape[-1]\n                assert last_dim == visn_output.shape[-1]\n                last_dim = visn_output.shape[-1]\n\n                # Saved the features in hdf5\n                img_keys.extend(visn_output.detach().cpu().numpy())\n\n                tensor_imgs = []\n\n        if len(tensor_imgs) > 0:\n            visn_input = torch.stack(tensor_imgs).cuda()\n            with torch.no_grad():\n                visn_output = model(visn_input)\n            # Saved the features in hdf5\n            img_keys.extend(visn_output.detach().cpu().numpy())\n\n        assert len(img_keys) == len(saved_img_paths)\n        h5_path = os.path.join(output, img_set + '.hdf5')\n        print(f\"\\tSave features (keys) to {h5_path} with hdf5 dataset 'Keys'.\")\n        h5_file = h5py.File(h5_path, 'w')\n        dset = h5_file.create_dataset(\"keys\", (len(saved_img_paths), last_dim))\n        for i, img_key in enumerate(img_keys):\n            dset[i] = img_key\n        save_img_paths_and_ids(img_set, saved_img_paths, saved_img_ids, output)\n        h5_file.close()\n\n\n# This default transformation is used by PyTorch ResNet on ImageNet.\nnormalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],\n                                 std=[0.229, 0.224, 0.225])\ndefault_transform = transforms.Compose([\n    transforms.Resize(256),\n    transforms.CenterCrop(224),\n    transforms.ToTensor(),\n    normalize\n])\n\n\nimport torch\nfrom torch import nn\nimport torchvision.models as models\ndef get_visn_arch(arch):\n    try:\n        return getattr(models, arch)\n    except AttributeError as e:\n        print(e)\n        print(\"There is no arch %s in torchvision.\" % arch)\n\n# __all__ = ['ResNet', 'resnet18', 'resnet34', 'resnet50', 'resnet101',\n#            'resnet152', 'resnext50_32x4d', 'resnext101_32x8d',\n#            'wide_resnet50_2', 'wide_resnet101_2']\n\n\nclass VisnModel(nn.Module):\n    def __init__(self, arch='resnet50', pretrained=True):\n        \"\"\"\n        :param dim: dimension of the output\n        :param arch: backbone architecture,\n        :param pretrained: load feature with pre-trained vector\n        :param finetuning: finetune the model\n        \"\"\"\n        super().__init__()\n        # Setup Backbone\n        resnet = get_visn_arch(arch)(pretrained=pretrained)\n        for param in resnet.parameters():\n            param.requires_grad = False\n        resnet.fc = nn.Identity()\n        self.backbone = resnet\n\n    def forward(self, img):\n        \"\"\"\n        :param img: a tensor of shape [batch_size, H, W, C]\n        :return: a tensor of [batch_size, d]\n        \"\"\"\n        x = self.backbone(img)\n        x = x.detach()\n        # x = x / x.norm(2, dim=-1, keepdim=True)\n        return x\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--load-dir', type=str, default=None,\n                        help='The directory saved the model (containing'\n                             'BEST.pth.model).')\n    parser.add_argument('--torchvision-model', type=str, default=None,\n                        help='The directory saved the model (containing'\n                             'BEST.pth.model).')\n    parser.add_argument('--image-sets', type=str, default='coco_minival',\n                        help='The splits of images to be extracted')\n    parser.add_argument('--output-dir', type=str, default=None,\n                        help='The directory to save the extracted feature keys')\n    parser.add_argument('--batch-size', type=int, default=32)\n    args = parser.parse_args()\n\n    img_sets = [img_set.strip() for img_set in args.image_sets.split(',')]\n\n    if args.torchvision_model is not None:\n        assert args.load_dir is None, (\"either load from torch model using option 'torchvision_model'\"\n                                       \"or from pre-trained CoX model with option 'load_dir'\")\n        visn_model = VisnModel(arch=args.torchvision_model).eval().cuda()\n        if args.batch_size > 1:\n            # for multi-batch extraction, must use the same image size\n            img_transform = transforms.Compose([\n                transforms.Resize(256),\n                transforms.CenterCrop(224),\n                transforms.ToTensor(),\n                normalize\n            ])\n        else:\n            # For single-batch extraction, we want to extract high-quality features, with two processes:\n            #    1. Use large image sizes (400 - 600)\n            #    2. Keep the aspect ratio\n            MIN_SIZE = 400.\n            MAX_SIZE = 600.\n            def img_transform_func(img):\n                img_w, img_h = img.size     # PiL Image's size order is w, h\n                assert img_w > 0 and img_h > 0\n                scale = min(\n                    MIN_SIZE / min(img_w, img_h),\n                    MAX_SIZE / max(img_w, img_h),\n                )\n                # Keep the aspect ratio\n                want_w, want_h = int(img_w * scale), int(img_h * scale)\n\n                _img_transform = transforms.Compose([\n                    transforms.Resize((want_h, want_w)),    # PyTorch use size order h, w\n                    transforms.ToTensor(),\n                    normalize\n                ])\n                return _img_transform(img)\n            img_transform = img_transform_func\n    else:\n        # Load the model\n        if os.path.exists(args.load_dir + '/BEST.pth.model'):\n            print(\"Load model from %s.\" % (args.load_dir + '/BEST.pth.model'))\n            sys.path.append(args.load_dir + '/src')\n            for dirc in os.listdir(args.load_dir + '/src'):\n                sys.path.append(args.load_dir + '/src/' + dirc)\n            # import model        # The pickle has some issues... thus must load the library\n            joint_model = torch.load(args.load_dir + '/BEST.pth.model')\n            joint_model.eval()            # DO NOT FORGET THIS!!!\n            visn_model = joint_model.visn_model\n        else:\n            print(f\"No snapshot {args.load_dir + '/BEST.pth.model'}. Exit.\")\n            exit()\n\n        # Load the img-preprocessing transformation, which used in training CoX model.\n        if os.path.exists(args.load_dir + '/img_transform.pkl'):\n            print(\"Load img transformation from %s.\" % (args.load_dir + '/img_transform.pkl'))\n            with open(args.load_dir + '/img_transform.pkl', 'rb') as f:\n                img_transform = pickle.load(f)\n        else:\n            print(\"Using default image transformatioin\")\n            img_transform = default_transform\n\n    # Feature output directory\n    output_dir = args.output_dir\n    if args.output_dir is None:\n        output_dir = args.load_dir + '/keys'      # Save the keys with the model dict\n    os.makedirs(output_dir, exist_ok=True)\n\n    extract_vision_feature_keys(\n        visn_model,\n        img_transform,\n        img_sets,\n        output_dir,\n        args.batch_size\n    )\n"
  },
  {
    "path": "vokenization/indexing.py",
    "content": "import numpy as np\nimport torch\nimport tqdm\n\n\nclass GPUIndexer(object):\n    def __init__(self, keys, gpus=(0,), fp16=False):\n        self.gpus = gpus\n        self.gpu = gpus[0]\n        self.keys = keys\n        self.fp16 = fp16\n        self.dim = len(self.keys[0])\n\n    def topk(self, query, topk: int = 1):\n        raise NotImplementedError\n\n    def batch_topk(self, query, topk: int = 1):\n        raise NotImplementedError\n\n    def batch_top1(self, query):\n        raise NotImplementedError\n\n\nclass TorchGPUIndexer(GPUIndexer):\n    def __init__(self, keys, gpus=(0,), fp16=False):\n        super().__init__(keys, gpus, fp16)\n        self.gpu_keys = torch.tensor(keys).cuda(self.gpu)\n        print(f\"Build torch indexer on GPU {self.gpu}\")\n\n        if self.fp16:\n            self.gpu_keys = self.gpu_keys.half()\n\n    def topk(self, query, topk: int = 1):\n        if not type(query) is torch.Tensor:\n            query = torch.tensor(query)\n        query = query.cuda(self.gpu)\n        if self.fp16:\n            query = query.half()\n        score = (self.gpu_keys * query).sum(-1)\n        topk_score, topk_idx = score.topk(topk)\n        return topk_score, topk_idx\n\n    def batch_topk(self, query, topk: int = 1):\n        if not type(query) is torch.Tensor:\n            query = torch.tensor(query)\n        query = query.cuda(self.gpu)\n        if self.fp16:\n            query = query.half()\n        score = (self.gpu_keys.unsqueeze(0) * query.unsqueeze(1)).sum(-1)\n        topk_score, topk_idx = score.topk(topk, dim=1)\n        return topk_score, topk_idx\n\n    def batch_top1(self, query):\n        if not type(query) is torch.Tensor:\n            query = torch.tensor(query)\n        query = query.cuda(self.gpu)\n        if self.fp16:\n            query = query.half()\n        score = (self.gpu_keys.unsqueeze(0) * query.unsqueeze(1)).sum(-1)\n        topk_score, topk_idx = score.max(dim=1)\n        return topk_score, topk_idx\n\n    def batch_top1_l2(self, query):\n        if not type(query) is torch.Tensor:\n            query = torch.tensor(query)\n        query = query.cuda(self.gpu)\n        if self.fp16:\n            query = query.half()\n        # print(query.norm(dim=-1) - 1.)\n        # print(self.gpu_keys.norm(dim=-1) - 1.)\n        score = ((self.gpu_keys.unsqueeze(0) - query.unsqueeze(1)) ** 2).sum(-1)\n        topk_score, topk_idx = score.min(dim=1)\n        return topk_score, topk_idx\n\n\nclass FaissGPUIndexer(GPUIndexer):\n    def __init__(self, keys, gpus=(0,), fp16=False):\n        try:\n            import faiss\n        except Exception as e:\n            print(\"Faiss is not installed! Please see https://github.com/facebookresearch/faiss/blob/master/INSTALL.md.\")\n            raise e\n        super().__init__(keys, gpus, fp16)\n        res = faiss.StandardGpuResources()\n        index_flat = faiss.IndexFlatL2(self.dim)\n        # index_flat = faiss.IndexFlatIP(self.dim)\n        print(f\"Build faiss indexer on GPU {self.gpu}\")\n        print(keys.shape)\n        self.gpu_index_flat = faiss.index_cpu_to_gpu(res, self.gpu, index_flat)\n        self.gpu_index_flat.add(keys)\n\n    def batch_topk(self, query, topk: int = 1):\n        if type(query) is torch.Tensor:\n            query = query.cpu().numpy()\n        D, I = self.gpu_index_flat.search(query, topk)\n        D = D\n        I = I\n        D = torch.from_numpy(D)\n        I = torch.from_numpy(I)\n        return D, I\n\n    def batch_top1(self, query):\n        \"\"\"\n        :param query: shape of [b, f]\n        \"\"\"\n        if type(query) is torch.Tensor:\n            query = query.cpu().numpy()\n        D, I = self.gpu_index_flat.search(query, 1)\n        D = D[:, 0]\n        I = I[:, 0]\n        D = torch.from_numpy(D)\n        I = torch.from_numpy(I)\n        return D, I\n\n\nif __name__ == '__main__':\n    # 100k keys\n    keys = np.random.uniform(size=(1000000, 64)) * 0.01\n    querys = np.random.uniform(size=(1000000, 64)) * 0.01\n    indexer = GPUIndexer(keys, [0], fp16=True)\n    batch_size = 64\n    for start in tqdm.tqdm(range(0, len(querys), batch_size)):\n        query = querys[start: start + batch_size]\n        # indexer.batch_topk(query, 1)\n        top_score, top_idx = indexer.batch_top1(query)\n\n\n\n"
  },
  {
    "path": "vokenization/revokenization.py",
    "content": "# Copyleft 2020 project COL.\n\nfrom transformers import AutoTokenizer\n\n\nclass ReVokenizer:\n    \"\"\"\n    Convert a\n    \"\"\"\n    def __init__(self, forward_tokenizer_name, backward_tokenizer_name, vokenizer):\n        \"\"\"\n        :args forward_tokenizer:\n        :args backward_tokenizer:\n        :args vokenizer:\n        \"\"\"\n        self.forward_tokenizer = AutoTokenizer.from_pretrained(forward_tokenizer_name, use_fast=True)\n        self.backward_tokenizer = AutoTokenizer.from_pretrained(backward_tokenizer_name, use_fast=True)\n        self.slow_backward_tokenizer = AutoTokenizer.from_pretrained(backward_tokenizer_name)\n        self.vokenizer = vokenizer\n\n        self.prepare_for_unicode()\n\n    def vokenize_sent(self, sents, topk=None):\n        pass\n\n    def vokenize_ids(self, input_ids, topk=None, verbose=False):\n        \"\"\"\n           backward_input\n        <-- Backward Tokenizer\n        <--    Sentence   -->\n        Forward Tokenizer -->\n            forward_input --> Vokenizer --> forward_results\n        \"\"\"\n        sents, forward_input, backward_input = self.process(input_ids)\n        alignments = self.batch_calculate_alignment(\n            forward_input['offset_mapping'],\n            backward_input['offset_mapping'],\n        )\n        forward_results = self.vokenizer.vokenize_ids(\n            forward_input['input_ids'], topk\n        )\n        backward_results = self.batch_map_back(forward_results, alignments)\n        if verbose:\n        # if True:\n            self.show_alignments(\n                sents, forward_input, backward_input, alignments,\n                input_ids, backward_results)\n        return backward_results\n\n    def show_alignments(self, sents, forward_inputs, backward_inputs, alignments, input_ids,\n                        backward_results):\n        forward_ids = forward_inputs['input_ids']\n        forward_offsets = forward_inputs['offset_mapping']\n        backward_ids = backward_inputs['input_ids']\n        backward_offsets = backward_inputs['offset_mapping']\n        _, _, backward_result_tokens, _ = backward_results\n        for sent, forward_id, backward_id, forward_offset, backward_offset, alignment, input_id, backward_result_token in zip(\n            sents, forward_ids, backward_ids, forward_offsets, backward_offsets, alignments, input_ids, backward_result_tokens\n        ):\n            print(sent)\n            for backward_idx, forward_idx in enumerate(alignment):\n                def get_str(l, r):\n                    return sent[l: r]\n                print(\"%2d %2d %7s %7s %7s  |  %7s %7s %7s\" % (\n                    backward_idx, forward_idx,\n                    self.backward_tokenizer._convert_id_to_token(input_id[backward_idx]),\n                    self.backward_tokenizer._convert_id_to_token(backward_id[backward_idx]),\n                    get_str(*backward_offset[backward_idx]),\n                    self.forward_tokenizer._convert_id_to_token(forward_id[forward_idx]),\n                    backward_result_token[backward_idx + 1],\n                    get_str(*forward_offset[forward_idx]),\n                ))\n            print()\n\n    def show_input(self, sents, forward_inputs, backward_inputs, input_ids):\n        forward_ids = forward_inputs['input_ids']\n        forward_offsets = forward_inputs['offset_mapping']\n        backward_ids = backward_inputs['input_ids']\n        backward_offsets = backward_inputs['offset_mapping']\n\n        for sent, forward_id, backward_id, forward_offset, backward_offset, input_id in zip(\n                sents, forward_ids, backward_ids, forward_offsets, backward_offsets, input_ids\n        ):\n            print(sent)\n            for i, (backward_i, bo, input_i) in enumerate(zip(backward_id, backward_offset, input_id)):\n                print(\"%7s %7s\" % (\n                    self.backward_tokenizer._convert_id_to_token(backward_i),\n                    self.backward_tokenizer._convert_id_to_token(input_i),\n                    # self.forward_tokenizer._convert_id_to_token(forward_i),\n                ), bo, sent[bo[0]: bo[1]] if bo is not None else '')\n            print()\n\n\n    def backward_decode(self, input_id):\n        # return u''.join(self.backward_tokenizer.convert_ids_to_tokens(input_id)).replace('Ġ', ' ')\n        # return self.backward_tokenizer.decode(input_id)\n        tokens = self.slow_backward_tokenizer.convert_ids_to_tokens(input_id, skip_special_tokens=True)\n        # print(tokens)\n        return self.slow_backward_tokenizer.convert_tokens_to_string(\n            tokens\n        )\n\n    def process(self, input_ids):\n        \"\"\"\n        :return: two dicts (forward_input, backward_input)\n            with keys \"input_ids\" \"offset_mapping\"\n        \"\"\"\n        sents = [self.backward_decode(input_id) for input_id in input_ids]\n        tokenizer_kwargs = {\n            'return_token_type_ids': False,\n            'return_attention_mask': False,\n            'return_offsets_mapping': True,\n        }\n        # 'add_special_tokens': False,\n        forward_input = self.forward_tokenizer.batch_encode_plus(\n            sents,\n            **tokenizer_kwargs\n        )\n        backward_input = self.backward_tokenizer.batch_encode_plus(\n            sents,\n            **tokenizer_kwargs\n        )\n\n        # Avoid batch-1\n        self._safe_guard(forward_input)\n        self._safe_guard(backward_input)\n\n        # Remove <cls> and <sep>\n        self._remove_special_tokens(forward_input)\n        self._remove_special_tokens(backward_input)\n\n        # postprocessing of the backwards\n        self._calibrate_backward_offset(backward_input)\n        # self._fix_nouns(backward_input)\n        self._fix_length(backward_input, input_ids)\n\n        assert list(map(len, backward_input['input_ids'])) == \\\n               list(map(len, input_ids)), (list(map(len, backward_input['input_ids'])),\n               list(map(len, input_ids)))\n        return sents, forward_input, backward_input\n\n    @staticmethod\n    def _safe_guard(inputs):\n        ids = inputs['input_ids']\n        if type(ids[0]) is int:\n            for key, value in inputs.items():\n                inputs[key] = [value]\n\n    @staticmethod\n    def _remove_special_tokens(inputs):\n        if type(inputs) is dict:\n            for key in inputs:\n                inputs[key] = ReVokenizer._remove_special_tokens(inputs[key])\n            return inputs\n        return [input[1:-1] for input in inputs]\n\n    @staticmethod\n    def _fix_nouns(backward_input):\n        backward_offsets = backward_input['offset_mapping']\n        for backward_offset in backward_offsets:\n            last_not_noun_idx = -1\n            while backward_offset[last_not_noun_idx] is None:\n                last_not_noun_idx -= 1\n            for noun_idx in range(last_not_noun_idx + 1, 0):\n                backward_offset[noun_idx] = backward_offset[last_not_noun_idx]\n\n    @staticmethod\n    def _fix_length(backward_input, input_ids):\n        backward_ids = backward_input['input_ids']\n        backward_offsets = backward_input['offset_mapping']\n        for i in range(len(backward_ids)):\n            desired_length = len(input_ids[i])\n            if len(backward_ids[i]) > desired_length:\n                backward_ids[i] = backward_ids[i][:desired_length]\n                backward_offsets[i] = backward_offsets[i][:desired_length]\n\n            while len(backward_ids[i]) < desired_length:\n                backward_ids[i].append(backward_ids[i][-1])\n                backward_offsets[i].append(backward_offsets[i][-1])\n\n            # print(desired_length)\n            # print(len(backward_ids[i]))\n            assert desired_length == len(backward_ids[i]) == len(backward_offsets[i])\n\n    def _calibrate_backward_offset(self, backward_input):\n        batch_input_ids = backward_input['input_ids']\n        batch_new_offset = []\n        for input_ids in batch_input_ids:\n            now = 0\n            byte_list = []\n            new_offset = []\n            for input_id in input_ids:\n                token = self.backward_tokenizer._convert_id_to_token(input_id)\n                start = now\n                unicode_complete_flag = True\n                for char in token:\n                    byte = self.c2b[char]\n                    byte_list.append(byte)\n                    try:\n                        unicode_char = bytes(byte_list).decode('utf-8')\n                        byte_list = []\n                        now += 1\n                        unicode_complete_flag = True\n                    except UnicodeDecodeError as e:\n                        unicode_complete_flag = False\n                if unicode_complete_flag:\n                    left, right = start, now\n                else:\n                    left, right = start, now + 1\n                new_offset.append((left, right))\n            # print(token, sent[left: right].replace(' ', 'Ġ'))\n            batch_new_offset.append(new_offset)\n        backward_input['offset_mapping'] = batch_new_offset\n\n    def prepare_for_unicode(self):\n        def bytes_to_unicode():\n            \"\"\"\n            Returns list of utf-8 byte and a mapping to unicode strings.\n            We specifically avoids mapping to whitespace/control characters the bpe code barfs on.\n            The reversible bpe codes work on unicode strings.\n            This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.\n            When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.\n            This is a signficant percentage of your normal, say, 32K bpe vocab.\n            To avoid that, we want lookup tables between utf-8 bytes and unicode strings.\n            \"\"\"\n            bs = (\n                    list(range(ord(\"!\"), ord(\"~\") + 1)) + list(range(ord(\"¡\"), ord(\"¬\") + 1)) + list(\n                range(ord(\"®\"), ord(\"ÿ\") + 1))\n            )\n            cs = bs[:]\n            n = 0\n            for b in range(2 ** 8):\n                if b not in bs:\n                    bs.append(b)\n                    cs.append(2 ** 8 + n)\n                    n += 1\n            cs = [chr(n) for n in cs]\n            return dict(zip(bs, cs))\n        self.b2c = bytes_to_unicode()\n        self.c2b = {c: b for b, c in self.b2c.items()}\n\n    def show(self, ids_list):\n        print(\n            [self.backward_tokenizer.convert_ids_to_tokens(ids) for ids in ids_list]\n        )\n\n    @staticmethod\n    def batch_map_back(results, alignments):\n        if type(results) is tuple:\n            # Handle multiple output by the vokenizer\n            #   i.e., input_ids, input_scores, ...\n            return [ReVokenizer.batch_map_back(one_results, alignments) for one_results in results]\n        new_results = []\n        for result, alignment in zip(results, alignments):\n            # print(result)\n            # print(max(alignment), len(result))\n            new_results.append(\n                [result[0]] + [result[idx + 1] for idx in alignment] + [result[-1]])\n            assert max(alignment) < (len(result) - 2)\n        return new_results\n\n    @staticmethod\n    def batch_calculate_alignment(batch_forward_offsets, batch_backward_offsets):\n        \"\"\"\n        for each backward_token indicated by backward offset, align a forward token to it.\n        \"\"\"\n        alignments = []\n        for forward_offsets, backward_offsets in zip(batch_forward_offsets, batch_backward_offsets):\n            alignment = []\n            # Backward: I  ha ve a lov ely  c at.\n            # Sent:     I  have  a lovely   cat\n            # Forward:  I  hav e a lo ve ly cat.\n            now_idx = 0\n            for backward_offset in backward_offsets:\n                best_idx = now_idx\n                best_iou = IoU(forward_offsets[best_idx], backward_offset)\n                while (now_idx + 1 < len(forward_offsets)) and \\\n                      (forward_offsets[now_idx][1] < backward_offset[1]):\n                    now_idx += 1\n                    now_iou = IoU(forward_offsets[now_idx], backward_offset)\n                    if now_iou > best_iou:\n                        best_idx = now_idx\n                        best_iou = now_iou\n                alignment.append(best_idx)\n            alignments.append(alignment)\n        return alignments\n\n\ndef IoU(a, b):\n    x1, y1 = a\n    x2, y2 = b\n    len1 = y1 - x1\n    len2 = y2 - x2\n    I = max(min(y1, y2) - max(x1, x2), 0)\n    U = len1 + len2 - I\n    return I / max(U, 1)\n\n\nif __name__ == \"__main__\":\n    revokenizer = ReVokenizer('bert-base-uncased', 'roberta-base', None)\n    tokenizer = AutoTokenizer.from_pretrained('roberta-base')\n    sents = ['Do not panic. ',\n             ' iso have a dream .',\n             ' This is a test???',\n             'Congratulations to the LiLT Founder and CEO, @stanfordnlp grad, Spence Green!',\n             'Ay congrats Ethan! An awesome crew, well deserved',\n             ' By the fourth season, fewer than three million viewers tuned in each week despite what some fans and critics considered an increase in episode quality.',\n             'Filming of the final episode began on Friday, February 25, after the first half of the day was spent completing \"Terra Prime\". Principal photography took eight days to complete, one day longer than usual. ',\n             'sda asdo weij sdjf oweif bqosdj weorasd.?SdfasXX...',\n             ]\n\n    ids = [tokenizer.encode(sent, add_special_tokens=False) for sent in sents]\n    print(sents)\n    sents = [tokenizer.decode(idx) for idx in ids]\n    print(sents)\n    revokenizer.vokenize_ids(ids)\n\n"
  },
  {
    "path": "vokenization/revokenize_corpus_mp.py",
    "content": "# coding=utf-8\n# Copyleft 2020 project COL.\n\nimport argparse\nimport copy\nfrom multiprocessing import Queue, Process\nimport os\nimport queue\nimport sys\nimport time\n\nimport h5py\nimport torch\nimport tqdm\nfrom spacy.lang.en import English\n\nsys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\nfrom vokenization.vokenization import load_model_and_tokenizer, Vokenizer\nfrom vokenization.revokenization import ReVokenizer\n\n\n# Handle the GPU issue in multi-processing.\nfrom multiprocessing import set_start_method\ntry:\n    set_start_method('spawn')\nexcept RuntimeError:\n    pass\n\n\ndef processer(args, input_queue, output_queue):\n    print(f\"Setup workers on gpu {args.gpus}\")\n    img_sets = sorted([img_set.strip() for img_set in args.image_sets.split(',')])\n\n    print(\"Build models and tokenizer\")\n    # We will assign the GPU to model latter, thus load to cpu first!\n    model, tokenizer = load_model_and_tokenizer(args.load, cpu=True)\n    keys_dir = args.load + '/keys'  # Save the keys with the model dict\n\n    print(\"Build Retriever from %s with image sets\" % keys_dir, img_sets)\n    vokenizer = Vokenizer(model, tokenizer, keys_dir,\n                          img_sets=img_sets, max_img_num=args.max_img_num,\n                          gpus=args.gpus, sent_level=('sent' in args.load))\n    print(f\"GPU: {args.gpus}, build vokenizer with {vokenizer.img_num} images.\")\n\n    # Before vokenization, save the image ids\n    dset_name = os.path.split(args.corpus)[-1]\n    modifier = f\".{vokenizer.img_num}\" if vokenizer.img_num != 50000 else \"\"\n    vokens_img_ids_path = os.path.join(\n        args.output,\n        f\"{dset_name}.{'_'.join(img_sets)}{modifier}.ids\"\n    )\n    if args.gpus[0] == 0:\n        if os.path.exists(vokens_img_ids_path):\n            # If the img_ids file exists, assert that they are the same.\n            saved_img_ids = open(vokens_img_ids_path).readlines()\n            img_ids = vokenizer.img_ids\n            assert len(saved_img_ids) == len(img_ids)\n            for saved_img_id, img_id in zip(saved_img_ids, img_ids):\n                assert saved_img_id.strip() == img_id\n        else:\n            vokenizer.dump_img_ids(vokens_img_ids_path)\n\n    while True:\n        page_id, sents = input_queue.get()\n        # Print the first few sents for debugging\n        if args.gpus[0] == 0:\n            if page_id < 12 and sents is not None:\n                print('page_id:', page_id)\n                print('batch_size:', len(sents))\n                print('ids of sent[0]:', sents[0])\n                print('tokens of sent[0]:', tokenizer.convert_ids_to_tokens(sents[0]))\n                print()\n        # print(f\"Processer {args.gpus}: Get Page Id {page_id}\")\n        if sents is not None:\n            output_str = ''\n            results = vokenizer.vokenize_ids(sents)\n            idxs = results[1]\n            for j, idx in enumerate(idxs):\n                assert len(idx[1:-1]) == len(sents[j])\n                dump_idx = map(lambda x: str(x.item()), idx[1:-1])\n                output_str += ' '.join(dump_idx) + '\\n'\n\n            output_queue.put((page_id, output_str))\n        else:\n            break\n\n\ndef reducer(output_fname, output_queue, total_tokens):\n    next_page_id = 0\n    heap = queue.PriorityQueue()\n    output = open(output_fname, 'a')\n    cache = \"\"\n    start_time = None\n    processed_tokens = 0\n\n    while True:\n        page_id, result = output_queue.get()\n        if start_time is None:      # The clock starts to tick when receiving the first package.\n            start_time = time.time()\n        # print(\"Reducer: Get Page Id %d\" % page_id)\n        if result is not None:\n            # Put it into the heap\n            heap.put((page_id, result))\n\n            # Check the could-be-dumped data in the queue\n            while heap.qsize() > 0:\n                smallest_page_id, result = heap.get()\n                if smallest_page_id == next_page_id:\n                    # which means that this page is the next page, thus dump it\n                    # print(\"Reducer: Commit Page Id %d\" % next_page_id)\n                    processed_tokens += len(result.split(' '))\n                    cache += result\n                    next_page_id += 1\n                else:\n                    heap.put((smallest_page_id, result))\n                    break\n            # print(\"Reducer: Length of Cache Now\", len(cache))\n            if len(cache) > 1000000:\n                # Dump for every 1000000 characters to reduce IO calls\n                output.write(cache)\n                output.flush()\n                cache = ''\n                used_time = int(time.time() - start_time)\n                print(\"Process %d tokens, %d to go, with speed %0.2f tokens/second,\"\n                      \"finished in %0.2f hours\" % (\n                    processed_tokens,\n                    total_tokens - processed_tokens,\n                    processed_tokens / used_time,\n                    (total_tokens - processed_tokens) / (processed_tokens / used_time) / 3600\n                ))\n        else:\n            if len(cache) > 0:\n                output.write(cache)\n                output.flush()\n                cache = ''\n            break\n\n    output.close()\n\n\ndef setup_mp(args, tokens, sent_ranges, vokens_path):\n    QUEUE_SIZE = 10000\n    input_queue = Queue(maxsize=QUEUE_SIZE)\n    output_queue = Queue(maxsize=QUEUE_SIZE)\n\n    workers = []\n    num_gpu = torch.cuda.device_count()\n    for worker_id in range(args.num_workers):\n        gpu_id = worker_id % num_gpu\n        curr_args = copy.copy(args)\n        curr_args.gpus = (gpu_id,)\n        worker = Process(target=processer,\n                         args=(curr_args, input_queue, output_queue))\n        worker.daemon = True\n        worker.start()\n        workers.append(worker)\n\n    total_tokens = len(tokens) - sent_ranges[0][0] if len(sent_ranges) > 0 else 0\n    reduce = Process(target=reducer,\n                     args=(vokens_path, output_queue, total_tokens))\n    reduce.start()\n\n    for i, start_id in enumerate(range(0, len(sent_ranges), args.batch_size)):\n        sents = []\n        for left, right in sent_ranges[start_id: start_id + args.batch_size]:\n            sents.append(tokens[left: right])\n        input_queue.put((i, sents))\n\n    # Notifying workers the end of input\n    for _ in workers:\n        input_queue.put((-1, None))\n\n    # wait for workers to terminate\n    for w in workers:\n        w.join()\n\n    # Notify the reducer the end of output\n    output_queue.put((-1, None))\n\n    # wait for reducer to terminate\n    reduce.join()\n\n\ndef segment_sent(\n        tokens,\n        tokenizer,\n        tokens_line_info_path,\n        tokens_sent_info_path\n    ):\n    \"\"\"\n    Single-processed segmentation of sentences. We might need to parallel this as well.\n    \"\"\"\n    with open(tokens_line_info_path) as f:\n        line_starts = list(map(int, f.readlines()))\n\n    nlp = English()\n    sentencizer = nlp.create_pipe(\"sentencizer\")\n    nlp.add_pipe(sentencizer)\n\n    sent_starts = [0]\n    now = 0\n    for i in tqdm.tqdm(range(len(line_starts) - 1)):\n        start_token_idx = line_starts[i]\n        end_token_idx = line_starts[i + 1]\n        line_tokens = tokens[start_token_idx: end_token_idx]\n        line = ' '.join(tokenizer.convert_ids_to_tokens(line_tokens))\n        line = line.replace(\"[UNK]\", \"UNK\")\n\n        doc = nlp(line)\n        sents_len = 0\n        sents = []\n        for sent in doc.sents:\n            if i < 2:\n                print(sent)\n            sent = str(sent)\n            sents.append(sent)\n            words = sent.split(' ')\n            sent_len = len(words)\n            now += sent_len\n            sent_starts.append(now)\n            sents_len += sent_len\n\n        if sents_len != len(line_tokens):\n            print(sents_len)\n            print(sents)\n            print(len(line_tokens))\n            print(line)\n            assert False\n        assert sent_starts[-1] == end_token_idx\n\n    with open(tokens_sent_info_path, 'w') as f:\n        for sent_start in sent_starts:\n            f.write(str(sent_start) + \"\\n\")\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Text\n    parser.add_argument('--corpus', type=str, default='/ssd-playpen/data/wiki103/wiki.train.raw')\n    # Models\n    parser.add_argument('--load', type=str,\n                        default='/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_robertal4',\n                        help='The directory saved the model (containing'\n                             'BEST.pth.model).')\n    parser.add_argument('--output', type=str, default=None,\n                        help='The directory to save the extracted feature keys.'\n                             '\"None\" would save in the \"load\" dir')\n    parser.add_argument('--backward-tokenizer-name', type=str, default='roberta-base')\n    parser.add_argument('--forward-tokenizer-name', type=str, default='roberta-base')\n    # Vision: Define the vokens set\n    parser.add_argument('--image-sets', type=str, default='vg_nococo',\n                        help='The splits of images to be extracted')\n    parser.add_argument('--max-img-num', type=int, default=50000,\n                        help='number of images used. -1 means all images.')\n    # Speed Up Options:\n    parser.add_argument('--num-workers', type=int, default=-1,\n                        help='-1 will use all GPUs.')\n    parser.add_argument('--batch-size', type=int, default=16,\n                        help='The # of sentences in a batch.')\n    args = parser.parse_args()\n\n    if args.num_workers == -1:\n        args.num_workers = torch.cuda.device_count()\n\n    if args.output is None:\n        args.output = os.path.join(args.load, 'vokens')\n    os.makedirs(args.output, exist_ok=True)\n\n    dset_name = os.path.split(args.corpus)[-1]\n    img_sets = sorted([img_set.strip() for img_set in args.image_sets.split(',')])\n    print()\n    print(\"Main Th\"\n          \"read: Build a virtual vokenizer to check the number of images.\")\n    keys_dir = args.load + '/keys'  # Save the keys with the model dict\n    virtual_vokenizer = Vokenizer(\n        None, None, keys_dir,\n        img_sets=img_sets, max_img_num=args.max_img_num,\n        gpus=(-1,), sent_level=('sent' in args.load))\n    modifier = f\".{virtual_vokenizer.img_num}\" if virtual_vokenizer.img_num != 50000 else \"\"\n    vokens_path = os.path.join(\n        args.output,\n        f\"{dset_name}.{'_'.join(img_sets)}{modifier}\"\n    )\n    tokens_hdf5_path = f'{args.corpus}.{args.backward_tokenizer_name}.hdf5'\n    tokens_sent_info_path = f'{args.corpus}.{args.backward_tokenizer_name}.sent'\n\n    # \"Load\" tokens from hdf5\n    tokens_hdf5 = h5py.File(tokens_hdf5_path, 'r')\n    tokens = tokens_hdf5['tokens']\n\n    # Calibrate the start line if the vokens have been proceeded.\n    if not os.path.exists(tokens_sent_info_path):\n        tokens_line_info_path = f'{args.corpus}.{args.backward_tokenizer_name}.line'\n        model, tokenizer = load_model_and_tokenizer(args.load, cpu=True)\n        segment_sent(\n            tokens,\n            tokenizer,\n            tokens_line_info_path,\n            tokens_sent_info_path\n        )\n\n    # Load sent info and find the start sentence\n    with open(tokens_sent_info_path) as f:\n        sent_starts = list(map(int, f.readlines()))\n\n    # Skip the sentences which have been extracted.\n    extracted_tokens = 0\n    if os.path.isfile(vokens_path):\n        with open(vokens_path, 'r') as g:\n            for g_line in tqdm.tqdm(g):\n                extracted_tokens += len(g_line.strip().split(' '))\n    try:\n        start_sent_idx = sent_starts.index(extracted_tokens)\n    except ValueError as e:\n        print(\"The extracted tokens does not match a starting sentence.\")\n        print(e)\n\n    # Start to vokenize\n    print(\"Main Thread: Dump visual tokens to %s\" % vokens_path)\n    print(\"Main Thread: Start vokenization from the %d'th token\" % sent_starts[start_sent_idx])\n\n    sent_ranges = []\n    for i in range(start_sent_idx, len(sent_starts) - 1):\n        left_token_idx = sent_starts[i]\n        right_token_idx = sent_starts[i + 1]\n        sent_ranges.append((left_token_idx, right_token_idx))\n\n    setup_mp(args, tokens, sent_ranges, vokens_path)\n\n    # save into hdf5 file\n    if os.path.exists(vokens_path + '.hdf5'):\n        print(\"The hdf5 file %s already exists. So the hdf5 file is not converted.\"\n              % (vokens_path + '.hdf5'))\n    else:\n        with open(args.corpus + '.' + args.backward_tokenizer_name + \".sent\") as f:\n            for i, line in enumerate(f):\n                pass\n            num_tokens = int(line)\n            num_sents = i\n\n        h5_file = h5py.File(vokens_path + '.hdf5', 'w')\n        dset = h5_file.create_dataset(\"vokens\", (num_tokens,), dtype='int32')\n\n        dump_interval = 100000\n        dump_iter = 0\n        lines = 0\n\n        with open(vokens_path) as f:\n            tokens = []\n            for line in tqdm.tqdm(f, total=num_sents):\n                for token in map(int, line.split(' ')):\n                    tokens.append(token)\n                if len(tokens) >= dump_interval:\n                    dset[dump_iter: dump_iter + len(tokens)] = tokens\n                    dump_iter += len(tokens)\n                    tokens = []\n                lines += 1\n            dset[dump_iter: dump_iter + len(tokens)] = tokens\n            dump_iter += len(tokens)\n            assert num_tokens == dump_iter\n            print(lines, num_sents)\n            assert lines == num_sents\n        h5_file.close()\n\n\n"
  },
  {
    "path": "vokenization/vokenization.py",
    "content": "# coding=utf-8\n# Copyleft 2020 project COL.\n\nfrom collections import defaultdict\nimport math\nimport pickle\nimport os\nimport sys\n\nimport h5py\nimport numpy as np\nimport torch\nfrom torch.nn.utils.rnn import pad_sequence\nfrom transformers import BertTokenizer\n\nimport common\nfrom indexing import TorchGPUIndexer, FaissGPUIndexer\n\nVERY_LARGE = 9595959595\n\n\nclass Vokenizer:\n    def __init__(self, model, tokenizer, keys_dir, img_sets=('coco_minival',),\n                 max_img_num=VERY_LARGE, gpus=(0,), backend='faiss', upper_bound=128,\n                 sent_level=False):\n        \"\"\"\n\n        :param model: Hugginface language model\n        :param tokenizer: Hugginface Tokenizer\n        :param keys_dir: the directory which saves the keys.\n        :param img_sets: the img_sets to be loaded, see common.IMAGE_SETS for all options.\n        :param max_img_num: load up to #max_img_num images into the dictionary\n        :param gpus: The GPUs used in calculating the BERT outputs and indexing.\n                     Note: Currently only one GPU is supported!!!\n        \"\"\"\n        self.model = model.cuda(gpus[0]) if model is not None else model\n        self.tokenizer = tokenizer\n        self.img_sets = img_sets\n        self.gpus = gpus        # The GPUs used in the indexer\n        self.gpu = self.gpus[0]\n        self.backend = backend\n        self.upper_bound = upper_bound\n        self.sent_level = sent_level    # Otherwise use word level\n\n        max_img_num = VERY_LARGE if max_img_num == -1 else max_img_num\n        # These two are important, which indicates the mapping from\n        # vokens to their actual images.\n        self.img_paths = []\n        self.img_ids = []\n        for img_set in self.img_sets:\n            assert img_set in common.IMAGE_SETS, \"%s not in image sets %s\" % (\n                img_set, common.IMAGE_SETS)\n\n            # Load image paths corresponding to the keys.\n            # img_paths_fname = os.path.join(common.LOCAL_DIR, 'images', img_set + \"_paths.txt\")\n            # img_ids_fname = os.path.join(common.LOCAL_DIR, 'images', img_set + \"_ids.txt\")\n            img_paths_fname = os.path.join(keys_dir, f\"{img_set}.path\")\n            img_ids_fname = os.path.join(keys_dir, f\"{img_set}.ids\")\n            if not os.path.exists(img_paths_fname):\n                # If the actual images are not saved on the server, we would use the img_ids.\n                img_paths_fname = img_ids_fname\n            with open(img_paths_fname) as f:\n                all_img_paths = list(map(lambda x: x.strip(), f.readlines()))\n            with open(img_ids_fname) as g:\n                all_img_ids = list(map(lambda x: x.strip(), g.readlines()))\n            assert len(all_img_paths) == len(all_img_ids)\n            for img_path, img_id in zip(all_img_paths, all_img_ids):\n                if len(self.img_paths) < max_img_num:\n                    self.img_paths.append(img_path)\n                    self.img_ids.append(f\"{img_set}/{img_id}\")\n                else:\n                    break\n        assert len(self.img_paths) == len(self.img_ids)\n\n        # Lazy loading and indexing\n        self.keys = None\n        self.keys_dir = keys_dir\n        self.indexed = False\n        self.indexer = None\n\n    @property\n    def img_num(self):\n        return len(self.img_paths)\n\n    def dump_img_ids(self, fname):\n        \"\"\"\n        Dump the mapping from the voken_id to img_ids, to fname.\n        Saved in the format of array.\n        \"\"\"\n        with open(fname, 'w') as f:\n            for img_id in self.img_ids:\n                f.write(img_id + \"\\n\")\n\n    def __len__(self):\n        return self.img_num\n\n    def indexing(self):\n        self.model.eval()\n\n        # Load pre-extracted image keys.\n        self.keys = []\n        remain_img_num = self.img_num\n        for img_set in self.img_sets:\n            assert img_set in common.IMAGE_SETS, \"%s not in image sets %s\" % (\n                img_set, common.IMAGE_SETS)\n            keys_fname = os.path.join(self.keys_dir, img_set + '.hdf5')\n            if not os.path.exists(keys_fname):\n                assert False, \"keys of image set %s is not extracted, please save it at %s\" % (\n                    img_set, keys_fname\n                )\n\n            # Load Keys\n            h5_file = h5py.File(keys_fname, 'r')\n            dset = h5_file[\"keys\"]\n            load_img_num = min(remain_img_num, len(dset))\n            load_keys = dset[:load_img_num]\n            self.keys.append(load_keys)\n            remain_img_num -= load_img_num\n            h5_file.close()\n            if load_img_num == 0:\n                break\n\n        # Lazy indexing\n        self.keys = np.concatenate(self.keys, 0)\n        if self.backend == 'torch':\n            self.indexer = TorchGPUIndexer(self.keys, gpus=self.gpus, fp16=True)\n        elif self.backend == 'faiss':\n            self.indexer = FaissGPUIndexer(self.keys, gpus=self.gpus, fp16=True)\n        else:\n            raise NotImplementedError(f\"Backend {self.backend} is not supported\")\n\n        self.indexed = True\n\n    def vokenize_sents(self, sents, topk=None):\n\n        input_ids = []\n        for sent in sents:\n            input_ids.append(self.tokenizer.encode(\n                sent,\n                add_special_tokens=False,\n                # return_tensors='pt'     # Return PyTorch (pt) tensors\n            ))\n        return self.vokenize_ids(input_ids, attention_mask=None, topk=topk)\n\n    def vokenize_ids(self, input_ids, attention_mask=None, topk=None):\n        \"\"\"\n        :param input_ids:  A list of token_ids i.e.,\n                [[token_1_1, token_1_2, ...], [token_2_1, token_2_2, ...], ...]\n        :param attention_mask: I did not use it for now.\n        :param topk: Retrieve the topk vokens for each token.\n        :return: top_scores, top_idxs, input_tokens, top_paths\n            Note: 1. The results would consider the additional special tokens while the input_tokens do **not**.\n                  2. If topk=None, it will be a 2-d results with:\n                         [ [s11_top1, s12_top1, ...],\n                           [s21_top1, s22_top1, ...],\n                           ..... ]\n                     If topk!=None (e.g., 1, 5, 10), it will be a 3-d results with:\n                         [ [ [s11_top1, s11_top2, ...],\n                             [s12_top1, s12_top2, ...],\n                             ...... ],\n                           [ [s21_top1, s21_top2, ...],\n                             [s22_top1, s22_top2, ...],\n                             ...... ],\n                           ..... ],\n                    where s11_top1 means s1(the 1st sentence)1(the 1st token of the 1st sentence)_top1(the top-1 index)\n        \"\"\"\n        if not self.indexed:        # Index the keys at the first retrieval call.\n            self.indexing()\n\n        # The original tokens\n        input_tokens = [\n            ([self.tokenizer.cls_token] + [self.tokenizer._convert_id_to_token(idx) for idx in input_id] + [self.tokenizer.sep_token])\n            for input_id in input_ids]\n\n        # Deal with over-length tokens (because the BERT-style encoder has length limit due to the positional embedding)\n        # Here is a process to avoid very short sequence when cutting the long sentence:\n        # Suppose the sentence length is 18 and UPPER_BOUND is 8,\n        # we draw it as                         <----------------->, where \"<\" is bos, and \">\" is the last token\n        # instead of cut it as                  <------->------->->, which has very short sequence <-> in the end.\n        # we cut it with almost equal length:   <----->----->----->\n        input_ids = input_ids.copy()\n        sent2segs = defaultdict(list)\n        for i in range(len(input_ids)):\n            if len(input_ids[i]) > self.upper_bound:\n                num_segments = math.ceil(len(input_ids[i]) / self.upper_bound)\n                tokens_per_seg = int(len(input_ids[i]) / num_segments)\n                remaining = input_ids[i][tokens_per_seg:]\n                input_ids[i] = input_ids[i][:tokens_per_seg]\n                while len(remaining) > 0:\n                    # print(len(remaining))\n                    sent2segs[i].append(len(input_ids))\n                    input_ids.append(remaining[:tokens_per_seg])\n                    remaining = remaining[tokens_per_seg:]\n\n        # Convert to torch tensors.\n        if not type(input_ids) is torch.Tensor:\n            input_ids = [\n                torch.tensor(self.tokenizer.build_inputs_with_special_tokens(list(input_id)))\n                for input_id in input_ids\n            ]\n            input_ids = pad_sequence(input_ids,\n                                     batch_first=True,\n                                     padding_value=self.tokenizer.pad_token_id)\n            attention_mask = (input_ids != self.tokenizer.pad_token_id)         # word_tokens --> 1, pad_token --> 0\n            if attention_mask.all():\n                attention_mask = None\n\n        # Get lengths\n        if attention_mask is not None:\n            lengths = list(attention_mask.sum(1).numpy())\n        else:\n            lengths = [len(input_ids[0])] * len(input_ids)\n\n        if attention_mask is not None and type(input_ids) is not torch.Tensor:\n            attention_mask = torch.tensor(attention_mask)\n\n        # Lang model inference\n        input_ids = input_ids.cuda(self.gpu)\n        if attention_mask is not None:\n            attention_mask = attention_mask.cuda(self.gpu)\n\n        def apply_model(input_ids, attention_mask, lengths):\n            with torch.no_grad():\n                lang_output = self.model(input_ids, attention_mask)     # b, l, f\n                if type(lang_output) is list:\n                    lang_output = lang_output[0]\n\n            # Gather language output\n            if self.sent_level:\n                # lang_output of shape [batch_size, dim]\n                gathered_output = lang_output\n            else:\n                # lang_output of shape [batch_size, max_len, dim]\n                # --> gathered_output [ \\sum_i len(i), dim]\n                gathered_output = torch.cat([output[:length] for output, length in zip(lang_output, lengths)])\n\n            # Visn retrieval\n            if topk is None:\n                # It will call the function `max()` and return a 2-d tensor\n                top_score, top_idx = self.indexer.batch_top1(gathered_output)\n            else:\n                # It will call the function `topk(k)` and return a 3-d tensor\n                top_score, top_idx = self.indexer.batch_topk(gathered_output, topk=topk)\n\n            return top_score, top_idx\n\n        top_score, top_idx = memory_safe_apply(apply_model, input_ids, attention_mask, lengths)\n\n        # Split\n        top_score, top_idx = top_score.detach().cpu(), top_idx.detach().cpu()\n        if not self.sent_level:\n            # If word level, split it\n            top_scores = list(top_score.split(lengths))       # [ float_tensor(len1), float_tensor(len2), ...]\n            top_idxs = list(top_idx.split(lengths))           # [ int_tensor(len1), int_tensor(len2), ...]\n        else:\n            # If sent level, repeat the voken.\n            #   Use clone() here\n            top_scores = [ts.expand(length, *ts.shape).clone() for ts, length in zip(top_score, lengths)]\n            top_idxs = [tid.expand(length, *tid.shape).clone() for tid, length in zip(top_idx, lengths)]\n\n        if top_idxs[0].dim() == 1:\n            # Return the top1 paths\n            top_paths = [[self.img_paths[idx.item()] for idx in top_idx]\n                         for top_idx in top_idxs]\n        else:\n            # Return the topk paths related to the sentences\n            top_paths = [[[self.img_paths[k_idx.item()] for k_idx in topk_idx]\n                          for topk_idx in top_idx]\n                         for top_idx in top_idxs]\n\n        if self.sent_level:\n            for i, tid in enumerate(top_idxs):\n                # Keep the first positive and others negative, to mark the header of the sentence.\n                # [3] --> [3, 3, 3, 3] --> [-4, -4, -4, -4] --> [3, -4, -4, -4]\n                # \"-x-1\" is used to handle zero, [0] --> [1, 1, 1, 1] --> [-1, -1, -1, -1] --> [0, -1, -1, -1]\n                # print('Before conversion', tid)\n                tid[:] = tid * (-1) - 1\n                tid[1] = tid[1] * (-1) - 1  # The tid[0] is corresponding to <cls>\n                # print('After conversion', top_idxs[i])\n\n        # Put back the segments of over-length sentences\n        if len(sent2segs) > 0:\n            for sent_id, segment_ids in sent2segs.items():\n                for segment_id in segment_ids:\n                    # Append the results with the segments:\n                    #    ---------Now----------------   + ----Appended Segment-----\n                    #    [<cls1> I have a <sep1>][:-1]  + [<cls2> cat . <sep2>][1:]\n                    #  = [<cls1> I have a cat . <sep2>]\n                    top_scores[sent_id] = torch.cat([top_scores[sent_id][:-1], top_scores[segment_id][1:]])\n                    top_idxs[sent_id] = torch.cat([top_idxs[sent_id][:-1], top_idxs[segment_id][1:]])\n                    top_paths[sent_id] = top_paths[sent_id][:-1] + top_paths[segment_id][1:]\n            num_sents = len(input_tokens)\n            top_scores = top_scores[:num_sents]\n            top_idxs = top_idxs[:num_sents]\n            top_paths = top_paths[:num_sents]\n\n        return top_scores, top_idxs, input_tokens, top_paths\n\n\ndef memory_safe_apply(func, *args):\n    \"\"\"\n    If batch-wise applying exceeds the GPU memory, it would process each sample separately and sequentially\n    :param func: function with some constraints, see code for details.\n    :param args: args of this function\n    :return:\n    \"\"\"\n    try:\n        return func(*args)\n    except RuntimeError as e:\n        print(e)\n        batch_size = len(args[0])\n        outputs = []\n        for i in range(batch_size):\n            one_batch_args = tuple(a[i: i+1] for a in args)\n            output = func(*one_batch_args)\n            # **output of the func should be of the format**:\n            # (o1, o2, ...) where each o_i is a tensor of shape [1, ...]\n            assert type(output) is tuple or type(output) is list\n            outputs.append(output)\n    # outputs = ( (o1_1, o1_2, ...), (o2_1, o2_2, ...), ...)\n    # zip(*outputs) = ( (o1_1, o2_1, ...), (o1_2, o2_2, ...), ...)\n    outputs = tuple(torch.cat(output) for output in zip(*outputs))\n    return outputs\n\n\ndefault_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n\n\ndef load_model_and_tokenizer(load, cpu=False):\n    if os.path.exists(load + '/BEST.pth.model'):\n        sys.path.append(load + '/src')\n        for dirc in os.listdir(load + '/src'):\n            sys.path.append(load + '/src/' + dirc)\n        # import model  # The pickle has some issues... thus must load the library\n        if cpu:\n            device = torch.device('cpu')\n            joint_model = torch.load(load + '/BEST.pth.model',\n                                     map_location=device)\n        else:\n            joint_model = torch.load(load + '/BEST.pth.model')\n        joint_model.eval()  # DO NOT FORGET THIS!!!\n    else:\n        print(\"No snapshots there, exit.\")\n        exit()\n\n    if os.path.exists(load + '/tokenizer.pkl'):\n        with open(load + '/tokenizer.pkl', 'rb') as f:\n            tokenizer = pickle.load(f)\n    else:\n        tokenizer = default_tokenizer\n\n    return joint_model.lang_model, tokenizer\n"
  },
  {
    "path": "vokenization/vokenize_corpus_mp.py",
    "content": "# coding=utf-8\n# Copyleft 2020 project COL.\n\nimport argparse\nimport copy\nfrom multiprocessing import Queue, Process\nimport os\nimport queue\nimport sys\nimport time\n\nimport h5py\nimport torch\nimport tqdm\nfrom spacy.lang.en import English\n\n# sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\nfrom vokenization import load_model_and_tokenizer, Vokenizer\n\n# Handle the GPU issue in multi-processing.\nfrom multiprocessing import set_start_method\ntry:\n    set_start_method('spawn')\nexcept RuntimeError:\n    pass\n\n\ndef processer(args, input_queue, output_queue):\n    print(f\"Setup workers on gpu {args.gpus}\")\n    img_sets = sorted([img_set.strip() for img_set in args.image_sets.split(',')])\n\n    print(\"Build models and tokenizer\")\n    # We will assign the GPU to model latter, thus load to cpu first!\n    model, tokenizer = load_model_and_tokenizer(args.load, cpu=True)\n    keys_dir = args.load + '/keys'  # Save the keys with the model dict\n\n    print(\"Build Retriever from %s with image sets\" % keys_dir, img_sets)\n    vokenizer = Vokenizer(model, tokenizer, keys_dir,\n                          img_sets=img_sets, max_img_num=args.max_img_num,\n                          gpus=args.gpus, sent_level=('sent' in args.load))\n    print(f\"GPU: {args.gpus}, build vokenizer with {vokenizer.img_num} images.\")\n\n    # Before vokenization, save the image ids\n    dset_name = os.path.split(args.corpus)[-1]\n    modifier = f\".{vokenizer.img_num}\" if vokenizer.img_num != 50000 else \"\"\n    vokens_img_ids_path = os.path.join(\n        args.output,\n        f\"{dset_name}.{'_'.join(img_sets)}{modifier}.ids\"\n    )\n    if args.gpus[0] == 0:\n        if os.path.exists(vokens_img_ids_path):\n            # If the img_ids file exists, assert that they are the same.\n            saved_img_ids = open(vokens_img_ids_path).readlines()\n            img_ids = vokenizer.img_ids\n            assert len(saved_img_ids) == len(img_ids)\n            for saved_img_id, img_id in zip(saved_img_ids, img_ids):\n                assert saved_img_id.strip() == img_id\n        else:\n            vokenizer.dump_img_ids(vokens_img_ids_path)\n\n    while True:\n        page_id, sents = input_queue.get()\n        # Print the first few sents for debugging\n        if args.gpus[0] == 0:\n            if page_id < 12 and sents is not None:\n                print('page_id:', page_id)\n                print('batch_size:', len(sents))\n                print('ids of sent[0]:', sents[0])\n                print('tokens of sent[0]:', tokenizer.convert_ids_to_tokens(sents[0]))\n                print()\n        # print(f\"Processer {args.gpus}: Get Page Id {page_id}\")\n        if sents is not None:\n            output_str = ''\n            results = vokenizer.vokenize_ids(sents)\n            idxs = results[1]\n            for j, idx in enumerate(idxs):\n                assert len(idx[1:-1]) == len(sents[j])\n                dump_idx = map(lambda x: str(x.item()), idx[1:-1])\n                output_str += ' '.join(dump_idx) + '\\n'\n\n            output_queue.put((page_id, output_str))\n        else:\n            break\n\n\ndef reducer(output_fname, output_queue, total_tokens):\n    next_page_id = 0\n    heap = queue.PriorityQueue()\n    output = open(output_fname, 'a')\n    cache = \"\"\n    start_time = None\n    processed_tokens = 0\n\n    while True:\n        page_id, result = output_queue.get()\n        if start_time is None:      # The clock starts to tick when receiving the first package.\n            start_time = time.time()\n        # print(\"Reducer: Get Page Id %d\" % page_id)\n        if result is not None:\n            # Put it into the heap\n            heap.put((page_id, result))\n\n            # Check the could-be-dumped data in the queue\n            while heap.qsize() > 0:\n                smallest_page_id, result = heap.get()\n                if smallest_page_id == next_page_id:\n                    # which means that this page is the next page, thus dump it\n                    # print(\"Reducer: Commit Page Id %d\" % next_page_id)\n                    processed_tokens += len(result.split(' '))\n                    cache += result\n                    next_page_id += 1\n                else:\n                    heap.put((smallest_page_id, result))\n                    break\n            # print(\"Reducer: Length of Cache Now\", len(cache))\n            if len(cache) > 1000000:\n                # Dump for every 1000000 characters to reduce IO calls\n                output.write(cache)\n                output.flush()\n                cache = ''\n                used_time = int(time.time() - start_time)\n                print(\"Process %d tokens, %d to go, with speed %0.2f tokens/second,\"\n                      \"finished in %0.2f hours\" % (\n                    processed_tokens,\n                    total_tokens - processed_tokens,\n                    processed_tokens / used_time,\n                    (total_tokens - processed_tokens) / (processed_tokens / used_time) / 3600\n                ))\n        else:\n            if len(cache) > 0:\n                output.write(cache)\n                output.flush()\n                cache = ''\n            break\n\n    output.close()\n\n\ndef setup_mp(args, tokens, sent_ranges, vokens_path):\n    QUEUE_SIZE = 10000\n    input_queue = Queue(maxsize=QUEUE_SIZE)\n    output_queue = Queue(maxsize=QUEUE_SIZE)\n\n    workers = []\n    num_gpu = torch.cuda.device_count()\n    for worker_id in range(args.num_workers):\n        gpu_id = worker_id % num_gpu\n        curr_args = copy.copy(args)\n        curr_args.gpus = (gpu_id,)\n        worker = Process(target=processer,\n                         args=(curr_args, input_queue, output_queue))\n        worker.daemon = True\n        worker.start()\n        workers.append(worker)\n\n    total_tokens = len(tokens) - sent_ranges[0][0] if len(sent_ranges) > 0 else 0\n    reduce = Process(target=reducer,\n                     args=(vokens_path, output_queue, total_tokens))\n    reduce.start()\n\n    for i, start_id in enumerate(range(0, len(sent_ranges), args.batch_size)):\n        sents = []\n        for left, right in sent_ranges[start_id: start_id + args.batch_size]:\n            sents.append(tokens[left: right])\n        input_queue.put((i, sents))\n\n    # Notifying workers the end of input\n    for _ in workers:\n        input_queue.put((-1, None))\n\n    # wait for workers to terminate\n    for w in workers:\n        w.join()\n\n    # Notify the reducer the end of output\n    output_queue.put((-1, None))\n\n    # wait for reducer to terminate\n    reduce.join()\n\n\ndef segment_sent(\n        tokens,\n        tokenizer,\n        tokens_line_info_path,\n        tokens_sent_info_path\n    ):\n    \"\"\"\n    Single-processed segmentation of sentences. We might need to parallel this as well.\n    \"\"\"\n    with open(tokens_line_info_path) as f:\n        line_starts = list(map(int, f.readlines()))\n\n    nlp = English()\n    sentencizer = nlp.create_pipe(\"sentencizer\")\n    nlp.add_pipe(sentencizer)\n\n    sent_starts = [0]\n    now = 0\n    print(\"Now, split lines into sentences with Spacy:\")\n    for i in tqdm.tqdm(range(len(line_starts) - 1)):\n        start_token_idx = line_starts[i]\n        end_token_idx = line_starts[i + 1]\n        line_tokens = tokens[start_token_idx: end_token_idx]\n        line = ' '.join(tokenizer.convert_ids_to_tokens(line_tokens))\n        line = line.replace(\"[UNK]\", \"UNK\")\n\n        doc = nlp(line)\n        sents_len = 0\n        sents = []\n        for sent in doc.sents:\n            if i < 2:\n                print(sent)\n            sent = str(sent)\n            sents.append(sent)\n            words = sent.split(' ')\n            sent_len = len(words)\n            now += sent_len\n            sent_starts.append(now)\n            sents_len += sent_len\n\n        if sents_len != len(line_tokens):\n            print(sents_len)\n            print(sents)\n            print(len(line_tokens))\n            print(line)\n            assert False\n        assert sent_starts[-1] == end_token_idx\n\n    with open(tokens_sent_info_path, 'w') as f:\n        for sent_start in sent_starts:\n            f.write(str(sent_start) + \"\\n\")\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Text\n    parser.add_argument('--corpus', type=str, default='/ssd-playpen/data/wiki103/wiki.train.raw')\n    # Models\n    parser.add_argument('--load', type=str,\n                        default='/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/coco_hinge05_dim64_resxt101_robertal4',\n                        help='The directory saved the model (containing'\n                             'BEST.pth.model).')\n    parser.add_argument('--output', type=str, default=None,\n                        help='The directory to save the extracted feature keys.'\n                             '\"None\" would save in the \"load\" dir')\n    parser.add_argument('--tokenizer-name', type=str, default='roberta-base')\n    # Vision: Define the vokens set\n    parser.add_argument('--image-sets', type=str, default='vg_nococo',\n                        help='The splits of images to be extracted')\n    parser.add_argument('--max-img-num', type=int, default=50000,\n                        help='number of images used. -1 means all images.')\n    # Speed Up Options:\n    parser.add_argument('--num-workers', type=int, default=-1,\n                        help='-1 will use all GPUs.')\n    parser.add_argument('--batch-size', type=int, default=16,\n                        help='The # of sentences in a batch.')\n    args = parser.parse_args()\n\n    if args.num_workers == -1:\n        args.num_workers = torch.cuda.device_count()\n\n    if args.output is None:\n        args.output = os.path.join(args.load, 'vokens')\n    os.makedirs(args.output, exist_ok=True)\n\n    dset_name = os.path.split(args.corpus)[-1]\n    img_sets = sorted([img_set.strip() for img_set in args.image_sets.split(',')])\n    print()\n    print(\"Main Th\"\n          \"read: Build a virtual vokenizer to check the number of images.\")\n    keys_dir = args.load + '/keys'  # Save the keys with the model dict\n    virtual_vokenizer = Vokenizer(\n        None, None, keys_dir,\n        img_sets=img_sets, max_img_num=args.max_img_num,\n        gpus=(-1,), sent_level=('sent' in args.load))\n    modifier = f\".{virtual_vokenizer.img_num}\" if virtual_vokenizer.img_num != 50000 else \"\"\n    vokens_path = os.path.join(\n        args.output,\n        f\"{dset_name}.{'_'.join(img_sets)}{modifier}\"\n    )\n    tokens_hdf5_path = f'{args.corpus}.{args.tokenizer_name}.hdf5'\n    tokens_sent_info_path = f'{args.corpus}.{args.tokenizer_name}.sent'\n\n    # \"Load\" tokens from hdf5\n    tokens_hdf5 = h5py.File(tokens_hdf5_path, 'r')\n    tokens = tokens_hdf5['tokens']\n\n    # Calibrate the start line if the vokens have been proceeded.\n    if not os.path.exists(tokens_sent_info_path):\n        tokens_line_info_path = f'{args.corpus}.{args.tokenizer_name}.line'\n        model, tokenizer = load_model_and_tokenizer(args.load, cpu=True)\n        segment_sent(\n            tokens,\n            tokenizer,\n            tokens_line_info_path,\n            tokens_sent_info_path\n        )\n\n    # Load sent info and find the start sentence\n    with open(tokens_sent_info_path) as f:\n        sent_starts = list(map(int, f.readlines()))\n\n    # Skip the sentences which have been extracted.\n    extracted_tokens = 0\n    if os.path.isfile(vokens_path):\n        with open(vokens_path, 'r') as g:\n            for g_line in tqdm.tqdm(g):\n                extracted_tokens += len(g_line.strip().split(' '))\n    try:\n        start_sent_idx = sent_starts.index(extracted_tokens)\n    except ValueError as e:\n        print(\"The extracted tokens does not match a starting sentence.\")\n        print(e)\n\n    # Start to vokenize\n    print(\"Main Thread: Dump visual tokens to %s\" % vokens_path)\n    print(\"Main Thread: Start vokenization from the %d'th token\" % sent_starts[start_sent_idx])\n\n    sent_ranges = []\n    for i in range(start_sent_idx, len(sent_starts) - 1):\n        left_token_idx = sent_starts[i]\n        right_token_idx = sent_starts[i + 1]\n        sent_ranges.append((left_token_idx, right_token_idx))\n\n    setup_mp(args, tokens, sent_ranges, vokens_path)\n\n    # save into hdf5 file\n    if os.path.exists(vokens_path + '.hdf5'):\n        print(\"The hdf5 file %s already exists. So the hdf5 file is not converted.\"\n              % (vokens_path + '.hdf5'))\n    else:\n        with open(args.corpus + '.' + args.tokenizer_name + \".sent\") as f:\n            for i, line in enumerate(f):\n                pass\n            num_tokens = int(line)\n            num_sents = i\n\n        h5_file = h5py.File(vokens_path + '.hdf5', 'w')\n        dset = h5_file.create_dataset(\"vokens\", (num_tokens,), dtype='int32')\n\n        dump_interval = 100000\n        dump_iter = 0\n        lines = 0\n\n        with open(vokens_path) as f:\n            tokens = []\n            for line in tqdm.tqdm(f, total=num_sents):\n                for token in map(int, line.split(' ')):\n                    tokens.append(token)\n                if len(tokens) >= dump_interval:\n                    dset[dump_iter: dump_iter + len(tokens)] = tokens\n                    dump_iter += len(tokens)\n                    tokens = []\n                lines += 1\n            dset[dump_iter: dump_iter + len(tokens)] = tokens\n            dump_iter += len(tokens)\n            assert num_tokens == dump_iter\n            print(lines, num_sents)\n            assert lines == num_sents\n        h5_file.close()\n\n\n"
  },
  {
    "path": "xmatching/__init__.py",
    "content": ""
  },
  {
    "path": "xmatching/data.py",
    "content": "# coding=utf-8\nimport json\nfrom pathlib import Path\nimport random\n\nfrom torch.utils.data import Dataset\nfrom torchvision.datasets.folder import default_loader\n\nfrom PIL import Image\nImage.MAX_IMAGE_PIXELS = None\n\nTINY_IMG_NUM = 1000\nFAST_IMG_NUM = 10000\n\nlxrt_imgsplits = {\n    'mscoco_train',\n    'mscoco_nominival',\n    'vgnococo',\n    'mscoco_minival',\n}\nlxrt_langsplits = {\n    'mscoco', 'vg', 'vqa', 'gqa', 'visual7w'\n}\ncc_imgsplits = {\n    'cc_train': 'training.tsv',\n    'cc_valid': 'validation.tsv',\n}\ncc_langsplits = {\n    'cc',\n}\n\nCC_ROOT = 'data/cc'\nCOCO_ROOT = 'data/mscoco'\nVG_ROOT = '/ssd-playpen/data/vg'\nLXRT_ROOT = 'data/lxmert'\n\n\ndef make_uid(img_id, source, sent_id):\n    \"\"\"\n    see the descriptions in function 'make_datum'\n    \"\"\"\n    return \"%s:%s:%s\" % (img_id, source, sent_id)\n\n\ndef get_img_path(source, img_id):\n    if source == 'cc':\n        split_tag, _ = img_id.split('_')\n        return \"%s/images/%s/%s\" % (CC_ROOT, split_tag, img_id)\n    elif 'COCO' in img_id:\n        _, split_tag, _ = img_id.split('_')\n        return \"%s/images/%s/%s\" % (COCO_ROOT, split_tag, img_id + '.jpg')\n    else:   # VG images\n        return \"%s/images/%s.jpg\" % (VG_ROOT, img_id)\n\n\ndef make_datum(source: str, img_id: str, sent_id: int, sent: str):\n    \"\"\"\n    Create a datum from the provided infos.\n    :param source: the dataset of the particular sentence.\n    :param img_id: id of the image\n    :param sent_id: id of the sentence (of the image)\n    :param sent: the sentence\n    :return: a dict of datum\n    \"\"\"\n    uid = make_uid(img_id, source, sent_id)\n    img_path = get_img_path(source, img_id)\n    return {\n        'uid': uid,\n        'img_id': img_id,\n        'img_path': img_path,\n        'sent': sent,\n    }\n\n\nclass ImgSentDataset:\n    def __init__(self, img_splits: str, lang_splits: str, tiny=False, fast=False):\n        \"\"\"\n        :param split: train, valid, test\n        :param sources: The data sources to be loaded, separated by comma.\n                       from: mscoco, cc, vg, vqa, gqa, visual7w\n                             'vg' stands for visual genome captions\n                             'cc' stands for conceptual captions.\n                       example: 'mscoco, vg'\n        \"\"\"\n        self.img_splits = [img_split.lower().strip() for img_split in img_splits.split(',')]\n        self.lang_splits = [lang_split.lower().strip() for lang_split in lang_splits.split(',')]\n        self.data = []\n\n        debug_imgs = -1\n        if tiny:\n            debug_imgs = TINY_IMG_NUM\n        elif fast:\n            debug_imgs = FAST_IMG_NUM\n\n        # Loading LXRT data (i.e., COCO Cap, VQA, GQA, VG Cap, VG QA (visual7w))\n        lxrt_data = []\n        lxrt_path = Path(LXRT_ROOT)\n        for img_split in self.img_splits:\n            if img_split in lxrt_imgsplits:\n                fname = img_split + \".json\"\n                if debug_imgs > 0 and fname != 'mscoco_nominival.json' \\\n                        and fname != 'mscoco_minival.json':  # Only load nominival when debugging\n                    continue\n                lxrt_data.extend(json.load((lxrt_path / fname).open()))\n\n        for i, lxrt_datum in enumerate(lxrt_data):\n            img_id = lxrt_datum['img_id']\n            for lang_split in self.lang_splits:\n                if lang_split in lxrt_datum['sentf']:\n                    sents = lxrt_datum['sentf'][lang_split]\n                    for j, sent in enumerate(sents):\n                        self.data.append(make_datum(lang_split, img_id, j, sent))\n                        if debug_imgs > 0:  # Only load one sentence if debugging\n                            break\n            if i+1 == debug_imgs:             # Load top #debug_imgs images\n                break\n\n        # Loading Conceptual Caption (CC) data\n        for img_split in self.img_splits:\n            if img_split in cc_imgsplits:\n                cc_path = Path(CC_ROOT)\n                for fname in cc_imgsplits[img_split]:\n                    for i, line in enumerate((cc_path / fname).open()):\n                        sent, img_id = line.split('\\t')\n                        self.data.append(make_datum('cc', img_id.strip(), 0, sent))\n                        if i+1 == debug_imgs:\n                            break\n\n    def __len__(self):\n        return len(self.data)\n\n    def __getitem__(self, item):\n        return self.data[item]\n\n    def shuffle(self):\n        random.seed(9595)\n        random.shuffle(self.data)\n\n\nclass ImgSentTorchDataset(Dataset):\n    def __init__(self,\n                 dataset: ImgSentDataset,\n                 img_transform,\n                 tokenizer,\n                 sent_len: int):\n        super().__init__()\n        self.raw_dataset = dataset\n        self.img_transform = img_transform\n        self.tokenizer = tokenizer\n        self.sent_len = sent_len\n\n    def __len__(self):\n        return len(self.raw_dataset)\n\n    def __getitem__(self, item: int):\n        datum = self.raw_dataset[item]\n\n        uid = datum['uid']\n        img_id = datum['img_id']\n        img_path = datum['img_path']\n        sent = datum['sent']\n\n        # Step 1: Load and pre-process the image\n        try:\n            pil_img = default_loader(img_path)\n        except Exception as e:\n            print(e)\n            print(img_path)\n            return self.__getitem__((item + 95) % self.__len__())\n        tensor_img = self.img_transform(pil_img)\n\n        # Step 2: Tokenization (to integers) and Padding\n        encoded_sent = self.tokenizer.encode_plus(\n            sent,\n            add_special_tokens=True,\n            max_length=self.sent_len,\n            truncation=True,\n            # pad_to_max_length=True,\n            padding='max_length',\n            return_tensors='pt'     # Return PyTorch (pt) tensors\n        )\n        input_ids = encoded_sent['input_ids'].squeeze()\n        attention_mask = encoded_sent['attention_mask'].squeeze()\n        # print('sent', sent)\n        # print('input_ids', input_ids)\n        # print('attention_mask', attention_mask)\n\n        return uid, (input_ids, attention_mask, ), (tensor_img, )\n"
  },
  {
    "path": "xmatching/frozen_batch_norm.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved\n\n# Note: This file is copied from https://github.com/facebookresearch/detectron2/blob/master/detectron2/layers/batch_norm.py\n# to avoid any future change from that project.\n\nimport torch\nfrom torch import nn\nfrom torch.nn import functional as F\n\n\nclass FrozenBatchNorm2d(nn.Module):\n    \"\"\"\n    BatchNorm2d where the batch statistics and the affine parameters are fixed.\n    It contains non-trainable buffers called\n    \"weight\" and \"bias\", \"running_mean\", \"running_var\",\n    initialized to perform identity transformation.\n    The pre-trained backbone models from Caffe2 only contain \"weight\" and \"bias\",\n    which are computed from the original four parameters of BN.\n    The affine transform `x * weight + bias` will perform the equivalent\n    computation of `(x - running_mean) / sqrt(running_var) * weight + bias`.\n    When loading a backbone model from Caffe2, \"running_mean\" and \"running_var\"\n    will be left unchanged as identity transformation.\n    Other pre-trained backbone models may contain all 4 parameters.\n    The forward is implemented by `F.batch_norm(..., training=False)`.\n    \"\"\"\n\n    _version = 3\n\n    def __init__(self, num_features, eps=1e-5):\n        super().__init__()\n        self.num_features = num_features\n        self.eps = eps\n        self.register_buffer(\"weight\", torch.ones(num_features))\n        self.register_buffer(\"bias\", torch.zeros(num_features))\n        self.register_buffer(\"running_mean\", torch.zeros(num_features))\n        self.register_buffer(\"running_var\", torch.ones(num_features) - eps)\n\n    def forward(self, x):\n        if x.requires_grad:\n            # When gradients are needed, F.batch_norm will use extra memory\n            # because its backward op computes gradients for weight/bias as well.\n            scale = self.weight * (self.running_var + self.eps).rsqrt()\n            bias = self.bias - self.running_mean * scale\n            scale = scale.reshape(1, -1, 1, 1)\n            bias = bias.reshape(1, -1, 1, 1)\n            return x * scale + bias\n        else:\n            # When gradients are not needed, F.batch_norm is a single fused op\n            # and provide more optimization opportunities.\n            return F.batch_norm(\n                x,\n                self.running_mean,\n                self.running_var,\n                self.weight,\n                self.bias,\n                training=False,\n                eps=self.eps,\n            )\n\n    def __repr__(self):\n        return \"FrozenBatchNorm2d(num_features={}, eps={})\".format(self.num_features, self.eps)\n\n    @classmethod\n    def convert_frozen_batchnorm(cls, module):\n        \"\"\"\n        Convert BatchNorm/SyncBatchNorm in module into FrozenBatchNorm.\n        Args:\n            module (torch.nn.Module):\n        Returns:\n            If module is BatchNorm/SyncBatchNorm, returns a new module.\n            Otherwise, in-place convert module and return it.\n        Similar to convert_sync_batchnorm in\n        https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/batchnorm.py\n        \"\"\"\n        bn_module = nn.modules.batchnorm\n        bn_module = (bn_module.BatchNorm2d, bn_module.SyncBatchNorm)\n        res = module\n        if isinstance(module, bn_module):\n            res = cls(module.num_features)\n            if module.affine:\n                res.weight.data = module.weight.data.clone().detach()\n                res.bias.data = module.bias.data.clone().detach()\n            res.running_mean.data = module.running_mean.data\n            res.running_var.data = module.running_var.data\n            res.eps = module.eps\n        else:\n            for name, child in module.named_children():\n                new_child = cls.convert_frozen_batchnorm(child)\n                if new_child is not child:\n                    res.add_module(name, new_child)\n        return res"
  },
  {
    "path": "xmatching/loss.py",
    "content": "import torch\n\n\ndef hinge(x):\n    return torch.clamp(x, min=0.)\n\n\ndef paired_hinge_rank_loss(\n        lang_output: torch.Tensor,\n        visn_output: torch.Tensor,\n        lang_mask: torch.Tensor,\n        margin: float,\n):\n    \"\"\"\n    Consider the first half as positive and the second half as negative.\n    :param lang_output: [batch_size, max_len, hid_dim]\n    :param visn_output: [batch_size, hid_dim]\n    :param lang_mask: Int Tensor [batch_size, max_len], 1 for tokens, 0 for paddings.\n    :param margin: margin in the ranking loss\n    :return: a scalar loss\n    \"\"\"\n    batch_size, lang_len, dim = lang_output.shape\n    assert batch_size % 2 == 0 and batch_size == visn_output.shape[0]\n    assert margin > 0.\n\n    # Expand the visn_output to match each word\n    visn_output = visn_output.unsqueeze(1)      # [b, 1, hid_dim]\n\n    # Split to positive and negative sets.\n    half_batch_size = batch_size // 2\n    pos_lang, neg_lang = torch.split(lang_output, half_batch_size, dim=0)\n    pos_visn, neg_visn = torch.split(visn_output, half_batch_size, dim=0)\n\n    # Calculate positive and negative scores.\n    true_pos_score = (pos_lang * pos_visn).sum(-1)           # [batch_size / 2, max_len]\n    true_neg_score = (neg_lang * neg_visn).sum(-1)           # [batch_size / 2, max_len]\n    false_pos_score = (pos_lang * neg_visn).sum(-1)          # [batch_size / 2, max_len]\n    false_neg_score = (neg_lang * pos_visn).sum(-1)          # [batch_size / 2, max_len]\n\n    # Hinge Loss\n    float_lang_mask = lang_mask.type(lang_output.dtype)      # Either fp16 or fp32\n    pos_lang_mask, neg_lang_mask = torch.split(float_lang_mask, half_batch_size, dim=0)\n    pos_loss = hinge(margin - true_pos_score + false_pos_score) * pos_lang_mask\n    neg_loss = hinge(margin - true_neg_score + false_neg_score) * neg_lang_mask\n\n    # Averaging\n    cnt = float_lang_mask.sum()    # Number of words.\n    loss = (pos_loss.sum() + neg_loss.sum()) / cnt\n\n    return loss\n\n\ndef batchwise_hinge_rank_loss(\n        lang_output: torch.Tensor,\n        visn_output: torch.Tensor,\n        lang_mask: torch.Tensor,\n        margin: float,\n):\n    \"\"\"\n    Consider all un-matched pairs in the batch as negative samples.\n    :param lang_output: [batch_size, max_len, hid_dim]\n    :param visn_output: [batch_size, hid_dim]\n    :param lang_mask: Int Tensor [batch_size, max_len], 1 for tokens, 0 for paddings.\n    :param margin: margin in the ranking loss\n    :return: a scalar loss\n    \"\"\"\n    batch_size, lang_len, dim = lang_output.shape\n    assert batch_size % 2 == 0 and batch_size == visn_output.shape[0]\n    assert margin > 0.\n\n    # Expand the visn_output to match each word\n    visn_output = visn_output.unsqueeze(1)                  # [b, 1, dim]\n\n    # The score of positive pairs\n    positive_score = (lang_output * visn_output.unsqueeze(1)).sum(-1)    # [b, max_len]\n\n    # The score of negative pairs. Note that the diagonal is actually the positive score,\n    # but it would be zero-graded in calculating the loss below.\n    negative_scores = (lang_output.reshape(batch_size, 1, lang_len, dim) *\n                       visn_output.reshape(1, batch_size, 1, dim)).sum(-1)    # [b(lang), b(visn), max_len]\n    # negative_scores = torch.einsum('ikd,jd->ijk', lang_output, visn_output)\n\n    # Calculate of the hinge rank loss, let me explain why it works:\n    # For the diagonal, the scores are for positive, we thus create a positive_mask to neglect these scores.\n    #   max(0., margin - x^T x + (x^T x - 2 margin) )\n    # = max(0., -margin)\n    # = 0.      , since we have made sure that margin > 0\n    # During backwards, the operator max(0., -margin) would raise a grad of 0 to the operand \"-margin\",\n    #   thus it is just what we want.\n    float_lang_mask = lang_mask.type(lang_output.dtype)       # Either fp16 or fp32\n    positive_mask = torch.eye(batch_size)\n    negative_scores = negative_scores - positive_mask.unsqueeze(-1) * margin * 2\n    lang_loss = hinge(margin - positive_score.unsqueeze(1) + negative_scores) * float_lang_mask.unsqueeze(1)\n    visn_loss = hinge(margin - positive_score.unsqueeze(0) + negative_scores) * float_lang_mask.unsqueeze(1)\n\n    # Averaging\n    # Each sentence is duplicated by batch_size thus the total length is also multiplied by this term.\n    cnt = max(float_lang_mask.sum() * batch_size, 1.)    # Number of words.\n    lang_loss = lang_loss.sum() / cnt\n    visn_loss = visn_loss.sum() / cnt\n\n    return lang_loss + visn_loss\n\n\n"
  },
  {
    "path": "xmatching/main.py",
    "content": "import collections\nimport os\nimport pickle\nimport sys\n\nimport torch\nimport torch.multiprocessing as mp\nimport torchvision.transforms as transforms\nimport torch.nn as nn\nimport torch.distributed as dist\nimport tqdm\nfrom transformers import BertTokenizer\n\nsys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\nfrom xmatching.data import ImgSentDataset, ImgSentTorchDataset\nfrom xmatching.loss import paired_hinge_rank_loss\nfrom xmatching.metric import batchwise_accuracy, batchwise_recall\nfrom xmatching.model import LangModel, VisnModel, JointModel, LANG_MODELS\nfrom xmatching.param import parse_args\n\n\ndef is_port_in_use(port):\n    import socket\n    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:\n        return s.connect_ex(('localhost', port)) == 0\n\n\ndef main():\n    os.environ['MASTER_ADDR'] = '127.0.0.1'\n    port = 9595\n    while is_port_in_use(port):\n        port += 1\n    print(\"Use port\", port)\n    os.environ['MASTER_PORT'] = str(port)\n\n    # Using all available gpus for multi-processing distributed\n    args = parse_args()\n    args.gpus = torch.cuda.device_count()\n    print(\"Use gpus \", list(range(args.gpus)))\n    args.world_size = args.gpus * args.nodes\n    # mp.spawn(setup, nprocs=args.gpus, args=(args,))\n    # args.world_size = args.gpus * args.nodes\n    mp.spawn(train, nprocs=args.gpus, args=(args,))\n\n\ndef train(gpu, args):\n    device = torch.device('cuda', gpu)\n    rank = args.nr * args.gpus + gpu\n    dist.init_process_group(\n        backend='nccl',\n        init_method='env://',\n        world_size=args.world_size,\n        rank=rank\n    )\n\n    # Models\n    lang_layers = list(map(lambda x: -int(x), args.lang_layers.split(',')))     # The layers concated as the output.\n    lang_model = LangModel(args.dim, arch=args.lang, layers=lang_layers,\n                           pretrained=args.lang_pretrained, finetuning=args.lang_finetune)\n    visn_model = VisnModel(args.dim, arch=args.visn,\n                           pretrained=args.visn_pretrained, finetuning=args.visn_finetune)\n    # The use of joint model would help synchronization in distributed learning.\n    model = JointModel(lang_model, visn_model)\n\n    # Since we will disallow the broadcast of buffers in DDP\n    # we want make sure that there are no buffers besides batch normalization and position id.\n    for name, buffer in model.named_buffers():\n        assert 'bn' in name or 'downsample' in name or \"position_ids\" in name\n\n    if args.load is not None:\n        state_dict = torch.load(args.load, map_location=device)\n        new_state_dict = {}\n        for key, value in state_dict.items():        # If the ddp state_dict is saved\n            if 'num_batches_tracked' not in key:\n                if key.startswith(\"module.\"):\n                    new_state_dict[key[len(\"module.\"):]] = state_dict[key]\n                else:\n                    new_state_dict[key] = state_dict[key]\n        model_keys = set(model.state_dict().keys())\n        load_keys = set(new_state_dict.keys())\n        print(\"Keys in model but not in load:\")\n        for key in sorted(model_keys - load_keys):\n            print(key)\n        print(\"Keys in load but not in model:\")\n        for key in sorted(load_keys - model_keys):\n            print(key)\n        model.load_state_dict(new_state_dict)\n\n    # Pre-processing Hyper-Params\n    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],\n                                     std=[0.229, 0.224, 0.225])\n    train_transform = transforms.Compose([\n        transforms.RandomResizedCrop(224),\n        transforms.RandomHorizontalFlip(),\n        transforms.ToTensor(),\n        normalize\n    ])\n    valid_transform = transforms.Compose([\n        transforms.Resize(256),\n        transforms.CenterCrop(224),\n        transforms.ToTensor(),\n        normalize\n    ])\n    Model, Tokenizer, weight = LANG_MODELS[args.lang]\n    tokenizer = Tokenizer.from_pretrained(weight)\n    # tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n    max_len = args.max_len\n\n    # Dump the pre-processing objs for future feature extractions.\n    if gpu == 0:\n        pickle.dump(tokenizer, open(\n            os.path.join(args.output, 'tokenizer.pkl'), 'wb'))\n        pickle.dump(valid_transform, open(\n            os.path.join(args.output, 'img_transform.pkl'), 'wb'))\n\n    # Data Sets\n    train_set = ImgSentDataset(args.train_imgs, args.train_langs, tiny=args.tiny, fast=args.fast)\n    train_tset = ImgSentTorchDataset(\n        train_set, train_transform, tokenizer, max_len\n    )\n    print(\"GPU %d: load %d data in training.\" % (gpu, len(train_set)))\n    valid_set = ImgSentDataset(args.valid_imgs, args.valid_langs, tiny=args.tiny, fast=args.fast)\n    valid_set.shuffle()         # Valid set only gets shuffled once!!!\n    print(\"GPU %d: load %d data in validation.\" % (gpu, len(valid_set)))\n    valid_tset = ImgSentTorchDataset(\n        valid_set, valid_transform, tokenizer, max_len\n    )\n    print()\n\n    # Data Loader\n    train_sampler = torch.utils.data.distributed.DistributedSampler(\n        train_tset,\n        num_replicas=args.world_size,\n        rank=rank,\n        shuffle=True,\n    )\n    train_loader = torch.utils.data.DataLoader(\n        dataset=train_tset,\n        batch_size=(args.batch_size // args.world_size),\n        shuffle=False,          # Will be shuffled in the sampler.\n        num_workers=max(args.num_workers // args.world_size, 1),\n        pin_memory=True,\n        sampler=train_sampler,\n        drop_last=True\n    )\n\n    valid_loader = torch.utils.data.DataLoader(\n        dataset=valid_tset,\n        batch_size=256,             # Fix batch_size to have stable batchwise evaluations.\n        shuffle=False,\n        num_workers=args.num_workers,\n        pin_memory=True,\n        drop_last=True\n    )\n\n    if args.optim == 'bert':\n        from transformers import AdamW, get_linear_schedule_with_warmup\n        no_decay = [\"bias\", \"LayerNorm.weight\"]\n        optimizer_grouped_parameters = [\n            {\n                \"params\": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],\n                \"weight_decay\": 0.01,\n            },\n            {\n                \"params\": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],\n                \"weight_decay\": 0.0,\n            },\n        ]\n        optimizer = AdamW(optimizer_grouped_parameters, lr=args.lr, eps=1e-8)\n        t_total = len(train_loader) * args.epochs\n        warmup_steps = int(t_total * args.warmup_ratio)\n        print(\"Train for %d steps and warm up for %d steps\" % (t_total, warmup_steps))\n        scheduler = get_linear_schedule_with_warmup(\n            optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total\n        )\n    else:\n        if args.optim == 'sgd':\n            optimizer = args.optimizer(\n                [param for param in model.parameters() if param.requires_grad],\n                args.lr,\n                momentum=0.9\n            )\n        else:\n            optimizer = args.optimizer(\n                [param for param in model.parameters() if param.requires_grad],\n                args.lr,\n                # momentum=0.9\n            )\n\n    # Loss and optimizer\n    criterion = paired_hinge_rank_loss\n    torch.cuda.set_device(gpu)\n    model.cuda(gpu)\n\n    if args.fp16:\n        try:\n            from apex import amp\n            from apex.parallel import DistributedDataParallel as DDP\n            model, optimizer = amp.initialize(model, optimizer,\n                                              opt_level='O2')\n            # Defautly, current apex DDP would not broadcast the buffers.\n            model = DDP(model)\n        except Exception as e:\n            print(e)\n            print(\"Please install apex library\")\n            return\n    else:\n        # Note that we disallow broad cast buffers here to reduce communication cost.\n        model = nn.parallel.DistributedDataParallel(\n            model,\n            device_ids=[gpu],\n            find_unused_parameters=True,\n            broadcast_buffers=False,\n        )\n\n    if args.test_only or args.load:     # Test the loading performance\n        if gpu == 0:\n            print(\"Test: GPU %d will test %d data in %d iterations.\" %\n                  (gpu, len(valid_loader) * 256, len(valid_loader)))\n            results = valid(args, model, criterion, valid_loader)\n            print(\"Initial test results:\")\n            for key, value in results.items():\n                print('\\t%s: %0.4f' % (key, value))\n        if args.test_only:\n            exit()\n\n    best_valid_loss = 9595.\n    for epoch in range(args.epochs):\n        if gpu == 0:\n            print(\"Training of Epoch %d: GPU %d will process %d data in %d iterations.\" %\n                  (epoch, gpu, len(train_loader) * args.batch_size // args.world_size, len(train_loader)))\n        prev_loss = total_loss = 0.\n        for i, (uid, lang_input, visn_input) in enumerate(tqdm.tqdm(train_loader, disable=(gpu!=0))):\n            # Currently, lang_input is the (input_ids, attention_mask)\n            # visn_input is (tensor_img)\n            lang_input = tuple(x.cuda(non_blocking=True) for x in lang_input)\n            visn_input = tuple(x.cuda(non_blocking=True) for x in visn_input)\n\n            # Forward pass\n            model.zero_grad()\n            lang_output, visn_output = model(lang_input, visn_input)\n            loss = criterion(lang_output, visn_output, lang_input[1], args.margin)\n            total_loss += loss.item()\n\n            # Backward\n            if args.fp16:\n                with amp.scale_loss(loss, optimizer) as scaled_loss:\n                    scaled_loss.backward()\n            else:\n                loss.backward()\n\n            # Step\n            if args.fp16:\n                torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), 5.)\n            else:\n                torch.nn.utils.clip_grad_norm_(model.parameters(), 5.)\n            optimizer.step()\n            if args.optim == 'bert':\n                scheduler.step()\n\n            # # Logging\n            # interval = 100\n            # if (i+1) % interval == 0:\n            #     print(\"GPU %d Epoch %d Iter %d: Training Loss %0.4f\" %\n            #           (gpu, epoch, i+1, (total_loss - prev_loss) / interval))\n            #     prev_loss = total_loss\n        if gpu == 0:\n            print(\"GPU %d Epoch %d: Total Training Loss %0.4f\" % (gpu, epoch, total_loss / len(train_loader)))\n            print()\n            print(\"Validation: GPU %d will process %d data in %d iterations.\" %\n                  (gpu, len(valid_loader) * 256, len(valid_loader)))\n            results = valid(args, model, criterion, valid_loader, use_tqdm=True)\n            for key, value in results.items():\n                print('\\t%s: %0.4f' % (key, value))\n            if results['loss'] < best_valid_loss:\n                best_valid_loss = results['loss']\n                snap_path = os.path.join(args.output, 'BEST.pth')\n                print(\"GPU 0: Save snapshot to \", snap_path)\n                torch.save(model.module.state_dict(), snap_path)\n                torch.save(model.module, snap_path + '.model')\n            print(\"BEST valid loss %0.4f\" % best_valid_loss)\n            print()\n\n\ndef valid(args, model, criterion, valid_loader, use_tqdm=True):\n    model.eval()\n    results = collections.defaultdict(lambda: 0)\n    iterator = tqdm.tqdm(valid_loader) if use_tqdm else valid_loader\n    for i, (uid, lang_input, visn_input) in enumerate(iterator):\n        # Currently, lang_input is the (input_ids, attention_mask)\n        # visn_input is (tensor_img)\n        lang_input = tuple(x.cuda(non_blocking=True) for x in lang_input)\n        visn_input = tuple(x.cuda(non_blocking=True) for x in visn_input)\n\n        with torch.no_grad():\n            # Forward pass\n            lang_output, visn_output = model(lang_input, visn_input)\n\n            # Evaluation\n            results['loss'] += criterion(lang_output, visn_output, lang_input[1], args.margin).item()\n            recall_results = batchwise_recall(lang_output, visn_output, lang_input[1], recalls=(1, 5, 10))\n            for key, value in recall_results.items():\n                results['R%d' % key] += value\n\n    for key in results:\n        results[key] = results[key] / len(valid_loader)\n    model.train()\n\n    return results\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "xmatching/metric.py",
    "content": "import torch\n\n\ndef batchwise_accuracy(lang_output, visn_output, lang_mask):\n    \"\"\"\n    Calculate the accuracy of contextual word retrieval, average by batch.\n    :param lang_output: [batch_size, max_len, hid_dim]\n    :param visn_output: [batch_size, hid_dim]\n    :param lang_mask: Int Tensor [batch_size, max_len], 1 for tokens, 0 for paddings.\n    :return:\n    \"\"\"\n    batch_size, lang_len, dim = lang_output.shape\n    assert batch_size % 2 == 0 and batch_size == visn_output.shape[0]\n\n    # Expand the visn_output to match each word\n    visn_output = visn_output.unsqueeze(1)                  # [b, 1, dim]\n\n    # The score of negative pairs. Note that the diagonal is actually the positive score,\n    # but it would be zero-graded in calculating the loss below.\n    negative_scores = (lang_output.reshape(batch_size, 1, lang_len, dim) *\n                       visn_output.reshape(1, batch_size, 1, dim)).sum(-1)    # [b(lang), b(visn), max_len]\n    # negative_scores = torch.einsum('ikd,jd->ijk', lang_output, visn_output)\n\n    max_neg_score, max_neg_idx = negative_scores.max(1)        # [batch, max_len], the batch_idx of max-aligned img\n    pos_idx = torch.arange(0, batch_size, dtype=torch.int64).to(lang_output.device)\n\n    correct = (pos_idx.unsqueeze(1) == max_neg_idx)\n    bool_lang_mask = lang_mask.type(correct.dtype)\n    correct = correct * bool_lang_mask\n    correct_num = correct.sum()\n\n    accuracy = correct_num * 1. / bool_lang_mask.sum()\n\n    return accuracy\n\n\ndef batchwise_recall(lang_output, visn_output, lang_mask, recalls=(1,)):\n    \"\"\"\n    Calculate the accuracy of contextual word retrieval, average by batch.\n    :param lang_output: [batch_size, max_len, hid_dim]\n    :param visn_output: [batch_size, hid_dim]\n    :param lang_mask: Int Tensor [batch_size, max_len], 1 for tokens, 0 for paddings.\n    :param recall: a list, which are the number of recalls to be evaluated.\n    :return:\n    \"\"\"\n    batch_size, lang_len, dim = lang_output.shape\n    assert batch_size % 2 == 0 and batch_size == visn_output.shape[0]\n\n    # Expand the visn_output to match each word\n    visn_output = visn_output.unsqueeze(1)                  # [b, 1, dim]\n\n    # The score of positive pairs\n    positive_score = (lang_output * visn_output).sum(-1)    # [b, max_len]\n\n    # The score of negative pairs. Note that the diagonal is actually the positive score,\n    # but it would be zero-graded in calculating the loss below.\n    negative_scores = (lang_output.reshape(batch_size, 1, lang_len, dim) *\n                       visn_output.reshape(1, batch_size, 1, dim)).sum(-1)    # [b(lang), b(visn), max_len]\n    # negative_scores = torch.einsum('ikd,jd->ijk', lang_output, visn_output)\n\n    result = {}\n    for recall in recalls:\n        kthscore, kthidx = torch.kthvalue(negative_scores, batch_size - recall, dim=1)     # [b, max_len]\n        # print(kthscore.shape) print(positive_score.shape)\n        correct = (positive_score >= kthscore)                                # [b, max_len]\n        bool_lang_mask = lang_mask.type(correct.dtype)\n        correct = correct * bool_lang_mask\n        correct_num = correct.sum()\n        # print(correct_num)\n        # print(bool_lang_mask.sum())\n        result[recall] = (correct_num * 1. / bool_lang_mask.sum()).item()\n\n    return result\n"
  },
  {
    "path": "xmatching/model.py",
    "content": "import torch\nfrom torch import nn\nimport torchvision.models as models\nfrom transformers import *\n\nfrom .frozen_batch_norm import FrozenBatchNorm2d\n\n\nLANG_MODELS = {\n          'bert':    (BertModel,       BertTokenizer,       'bert-base-uncased'),\n          'bert-large':  (BertModel,       BertTokenizer,       'bert-large-uncased'),\n          'gpt':     (OpenAIGPTModel,  OpenAIGPTTokenizer,  'openai-gpt'),\n          'gpt2':    (GPT2Model,       GPT2Tokenizer,       'gpt2'),\n          'ctrl':    (CTRLModel,       CTRLTokenizer,       'ctrl'),\n          'xl':      (TransfoXLModel,  TransfoXLTokenizer,  'transfo-xl-wt103'),\n          'xlnet':   (XLNetModel,      XLNetTokenizer,      'xlnet-base-cased'),\n          'xlm':     (XLMModel,        XLMTokenizer,        'xlm-mlm-enfr-1024'),\n          'distil':  (DistilBertModel, DistilBertTokenizer, 'distilbert-base-cased'),\n          'roberta': (RobertaModel,    RobertaTokenizer,    'roberta-base'),\n          'xlm-roberta': (XLMRobertaModel, XLMRobertaTokenizer, 'xlm-roberta-base'),\n}\n\n\ndef get_visn_arch(arch):\n    try:\n        return getattr(models, arch)\n    except AttributeError as e:\n        print(e)\n        print(\"There is no arch %s in torchvision.\" % arch)\n\n\nclass VisnModel(nn.Module):\n    def __init__(self, dim, arch='resnet50', pretrained=True, finetuning=False):\n        \"\"\"\n        :param dim: dimension of the output\n        :param arch: backbone architecture,\n        :param pretrained: load feature with pre-trained vector\n        :param finetuning: finetune the model\n        \"\"\"\n        super().__init__()\n        self.finetuning = finetuning\n\n        # Setup Backbone\n        resnet = get_visn_arch(arch)(pretrained=pretrained)\n        backbone_dim = resnet.fc.in_features\n        if not self.finetuning:\n            for param in resnet.parameters():\n                param.requires_grad = False\n        resnet.fc = nn.Identity()\n        self.backbone = resnet\n\n        # Surgery on the Networks\n        # 1. Frozen Batch Norm\n        #    Note that BatchNorm modules have been in-place replaced!\n        #    This piece of code is copied from Detectron2, and it was copied from mask-rcnn?\n        self.backbone = FrozenBatchNorm2d.convert_frozen_batchnorm(\n            self.backbone)\n        # print(self.backbone)\n        # 2. Frozen the first two (blocks of) layers\n        for module in [self.backbone.conv1,\n                       self.backbone.layer1]:\n            for param in module.parameters():\n                param.requires_grad = False\n\n        print(f\"Visn Model: {arch}, Finetune: {finetuning}, Pre-trained: {pretrained}\")\n        print(f\"Visn Model: backbone dim {backbone_dim} --> output dim {dim}\")\n\n        # Setup follow-up layers\n        self.mlp = nn.Sequential(\n            nn.Linear(backbone_dim, 256),\n            nn.ReLU(),\n            nn.Dropout(0.3),\n            nn.Linear(256, dim),\n        )\n\n    def forward(self, img):\n        \"\"\"\n        :param img: a tensor of shape [batch_size, H, W, C]\n        :return: a tensor of [batch_size, d]\n        \"\"\"\n        if not self.finetuning:\n            with torch.no_grad():\n                x = self.backbone(img)\n                x = x.detach()\n        else:\n            x = self.backbone(img)\n        x = self.mlp(x)         # [b, dim]\n        x = x / x.norm(2, dim=-1, keepdim=True)\n        return x\n\n\nclass LangModel(nn.Module):\n    def __init__(self, dim, arch='BERT', layers=(-1,), pretrained=True, finetuning=False):\n        \"\"\"\n        :param dim: dimension of the output\n        :param arch: backbone architecture,\n        :param aggregate: one of 'last4',\n        :param pretrained: load feature with pre-trained vector\n        :param finetuning: finetune the model\n        \"\"\"\n        super().__init__()\n        self.finetuning = finetuning\n\n        # Setup Backbone\n        Model, Tokenizer, weight = LANG_MODELS[arch]\n        bert = Model.from_pretrained(\n            weight,\n            output_hidden_states=True\n        )\n        if not pretrained:\n            bert.init_weights()\n\n        if not self.finetuning:\n            for param in bert.parameters():\n                param.requires_grad = False\n        backbone_dim = bert.config.hidden_size\n        self.backbone = bert\n        self.layers = sorted(layers)\n\n        print(f\"Language Model: {arch} with weight {weight}; Fine-tuning: {finetuning}, Pre-trained: {pretrained}.\")\n        print(f\"Language Model: using layers {self.layers}, result in backbone dim {backbone_dim * len(self.layers)} \"\n              f\"--> output dim {dim}.\")\n\n        # Setup follow-up layers\n        self.mlp = nn.Sequential(\n            nn.Linear(backbone_dim * len(self.layers), 256),\n            nn.ReLU(),\n            nn.Dropout(0.3),\n            nn.Linear(256, dim),\n        )\n\n    def forward(self, input_ids, attention_mask, token_type_ids=None):\n        \"\"\"\n        :param input_ids: [batch_size, max_len]\n        :param attention_mask: [batch_size, max_len]\n        :param token_type_ids: [batch_size, max_len]\n        :return: [batch_size, max_len, dim]\n        \"\"\"\n        if not self.finetuning:\n            with torch.no_grad():\n                x = self.backbone(\n                    input_ids,\n                    attention_mask=attention_mask,\n                    token_type_ids=token_type_ids,\n                )\n        else:\n            x = self.backbone(\n                input_ids,\n                attention_mask=attention_mask,\n                token_type_ids=token_type_ids,\n            )\n\n        # sequence_output, pooled_output, (hidden_states), (attentions) --> seq_output\n        if type(self.backbone) is XLNetModel:\n            output, hidden_states = x[:2]\n        else:\n            output, pooled_output, hidden_states = x[:3]\n\n        # gather the layers\n        if type(self.backbone) is XLNetModel:\n            x = torch.cat(list(hidden_states[layer].permute(1, 0, 2) for layer in self.layers), -1)\n        else:\n            x = torch.cat(list(hidden_states[layer] for layer in self.layers), -1)\n\n        if not self.finetuning:\n            x = x.detach()\n\n        # [batch_size, max_len, backbone_dim] -->\n        # [batch_size, max_len, output_dim]\n        x = self.mlp(x)\n        x = x / x.norm(2, dim=-1, keepdim=True)\n        return x\n\n\nclass JointModel(nn.Module):\n    def __init__(self, lang_model, visn_model):\n        super().__init__()\n        self.lang_model = lang_model\n        self.visn_model = visn_model\n\n    def forward(self, lang_input, visn_input):\n        lang_output = self.lang_model(*lang_input)\n        visn_output = self.visn_model(*visn_input)\n        return lang_output, visn_output\n\n\n"
  },
  {
    "path": "xmatching/param.py",
    "content": "# coding=utf-8\n# Copyleft 2020 project COL.\n# Copyleft 2019 project LXRT.\n\nimport argparse\nimport random\n\nimport numpy as np\nimport torch\n\n\ndef get_optimizer(optim):\n    # Bind the optimizer\n    if optim == 'rms':\n        # print(\"Optimizer: Using RMSProp\")\n        optimizer = torch.optim.RMSprop\n    elif optim == 'adam':\n        # print(\"Optimizer: Using Adam\")\n        optimizer = torch.optim.Adam\n    elif optim == 'adamax':\n        # print(\"Optimizer: Using Adamax\")\n        optimizer = torch.optim.Adamax\n    elif optim == 'sgd':\n        # print(\"Optimizer: sgd\")\n        optimizer = torch.optim.SGD\n    elif 'bert' in optim:\n        optimizer = 'bert'      # The bert optimizer will be bind later.\n    else:\n        assert False, \"Please add your optimizer %s in the list.\" % optim\n\n    return optimizer\n\n\ndef parse_args():\n    parser = argparse.ArgumentParser()\n\n    # Data Splits\n    parser.add_argument(\"--sources\", default='mscoco', help=\"mscoco, cc, vg, vqa, gqa, visual7w\")\n    parser.add_argument(\"--train-imgs\", default='mscoco_train,mscoco_nominival,vg_nococo')\n    parser.add_argument(\"--valid-imgs\", default='mscoco_minival')\n    parser.add_argument(\"--train-langs\", default='mscoco',\n                        help='Some of mscoco, cc, vg, vqa, gqa, visual7w.'\n                             'split by comma')\n    parser.add_argument(\"--valid-langs\", default='mscoco',\n                        help='Some of mscoco, cc, vg, vqa, gqa, visual7w.'\n                             'split by comma')\n    parser.add_argument(\"--test\", default=None)\n    parser.add_argument(\"--test-only\", action='store_true')\n\n    # Datasets Configuration\n    parser.add_argument(\"--fast\", action='store_true')\n    parser.add_argument(\"--tiny\", action='store_true')\n    parser.add_argument(\"--max-len\", default=20, type=int)\n\n    # Training Hyper-parameters\n    parser.add_argument('--batchSize', dest='batch_size', type=int, default=256)\n    parser.add_argument('--optim', default='bert')\n    parser.add_argument('--lr', type=float, default=1e-4)\n    parser.add_argument('--warmup-ratio', type=float, default=0.05)\n    parser.add_argument('--epochs', type=int, default=10)\n    parser.add_argument('--dropout', type=float, default=0.1)\n    parser.add_argument('--seed', type=int, default=9595, help='random seed')\n    parser.add_argument(\"--fp16\", action='store_true')\n\n    # Model Hyper-parameters\n    parser.add_argument('--visn', type=str, default='resnext101_32x8d', help='The vision backbone model.')\n    parser.add_argument('--lang', type=str, default='bert', help='The language backbone model.')\n    parser.add_argument('--lang-layers', type=str, default='-1', help='The language backbone model.')\n    parser.add_argument('--dim', type=int, default=64, help='The output dim of the joint emb.')\n\n    # Model Loading\n    parser.add_argument('--load', type=str, default=None,\n                        help='Load the model (usually the fine-tuned model).')\n    parser.add_argument('--lang-finetune', action='store_true', help='finetune the language encoder.')\n    parser.add_argument('--visn-finetune', action='store_true', help='finetune the visual encoder.')\n    parser.add_argument('--lang-pretrained', action='store_true', help='Use the pre-trained language encoder.')\n    parser.add_argument('--visn-pretrained', action='store_true', help='Use the pre-trained visual encoder.')\n\n    # Optimization\n    parser.add_argument(\"--margin\", default=0.5, type=float, help='The margin in the hinge losses.')\n    parser.add_argument(\"--loss\", dest='loss', default='paired_hinge',\n                        type=str)\n\n    # Training configuration\n    parser.add_argument(\"--num-workers\", default=0, type=int)\n    parser.add_argument('--output', type=str, default='snap/test')\n\n    # Distributed Training Configuration\n    parser.add_argument('-n', '--nodes', default=1,\n                        type=int, metavar='N')\n    parser.add_argument('-g', '--gpus', default=1, type=int,\n                        help='number of gpus per node')\n    parser.add_argument('-nr', '--nr', default=0, type=int,\n                        help='ranking within the nodes')\n\n    # Parse the arguments.\n    args = parser.parse_args()\n\n    # Bind optimizer class.\n    args.optimizer = get_optimizer(args.optim)\n\n    # Set seeds\n    torch.manual_seed(args.seed)\n    random.seed(args.seed)\n    np.random.seed(args.seed)\n\n    return args\n\n\n# args = parse_args()\n"
  }
]