[
  {
    "path": "MultiModal-CVPR2021.md",
    "content": "## Multi-modal learning paper in CVPR2021\n\nthe Navigation of [CVPR 2021 papers](https://blog.kitware.com/demos/cvpr-2021-papers/)\n\n\n### Text-to-Image Generation\n|No.  |Model Name |Title |Links |Pub. | Organization| \n|-----|:-----:|:-----:|:-----:|:--------:|:---:|\n|0|XMC-GAN| Cross-Modal Contrastive Learning for Text-to-Image Generation | [paper](https://openaccess.thecvf.com/content/CVPR2021/html/Zhang_Cross-Modal_Contrastive_Learning_for_Text-to-Image_Generation_CVPR_2021_paper.html)| CVPR 2021 | Google Research|\n\n### Autonomous Driving \n|No.  |Model Name |Title |Links |Pub. | Organization| \n|-----|:-----:|:-----:|:-----:|:--------:|:---:|\n|0|MVDNet|Robust Multimodal Vehicle Detection in Foggy Weather Using Complementary Lidar and Radar Signals   |[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Qian_Robust_Multimodal_Vehicle_Detection_in_Foggy_Weather_Using_Complementary_Lidar_CVPR_2021_paper.pdf) [code](https://github.com/qiank10/MVDNet)| CVPR 2021 | University of California SanDiego |\n|1|-| Multi-Modal Fusion Transformer for End-to-End Autonomous Driving | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Prakash_Multi-Modal_Fusion_Transformer_for_End-to-End_Autonomous_Driving_CVPR_2021_paper.pdf) | CVPR 2021 | Max Planck Institute for Intelligent Systems| \n\n### Navigation\n|No.  |Model Name |Title |Links |Pub. | Organization| \n|-----|:-----:|:-----:|:-----:|:--------:|:---:|\n|0|VLN|Robust Multimodal Vehicle Detection in Foggy Weather Using Complementary Lidar and Radar Signals   |[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Qian_Robust_Multimodal_Vehicle_Detection_in_Foggy_Weather_Using_Complementary_Lidar_CVPR_2021_paper.pdf) [code](https://github.com/qiank10/MVDNet)| CVPR 2021 | University of California SanDiego |\n|1|SSM | Structured Scene Memory for Vision-Language Navigation| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Wang_Structured_Scene_Memory_for_Vision-Language_Navigation_CVPR_2021_paper.pdf) | CVPR 2021 | Beijing Institute of Technology |\n\n\n\n### OCR\n|No.  |Model Name |Title |Links |Pub. | Organization| \n|-----|:-----:|:-----:|:-----:|:--------:|:---:|\n|0|-|Semantic-Aware Video Text Detection |[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Feng_Semantic-Aware_Video_Text_Detection_CVPR_2021_paper.pdf) | CVPR 2021 | National Laboratory of Pattern Recognition |\n|1| TRBA | What If We Only Use Real Datasets for Scene Text Recognition?Toward Scene Text Recognition With Fewer Labels | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Baek_What_if_We_Only_Use_Real_Datasets_for_Scene_Text_CVPR_2021_paper.pdf) [code](https://github.com/ku21fan/STR-Fewer-Labels) | CVPR 2021 | The University of Tokyo|\n|2| Multiplexed TextSpotter | A Multiplexed Network for End-to-End, Multilingual OCR| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Huang_A_Multiplexed_Network_for_End-to-End_Multilingual_OCR_CVPR_2021_paper.pdf)| CVPR 2021 | Facebook AI|\n|3|STKM | Self-attention based Text Knowledge Mining for Text Detection | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Wan_Self-Attention_Based_Text_Knowledge_Mining_for_Text_Detection_CVPR_2021_paper.pdf) | CVPR 2021 | Shenzhen University |\n|4| TextOCR | TextOCR: Towards large-scale end-to-end reasoningfor arbitrary-shaped scene text | CVPR 2021 | Facebook AI Research|\n\n### Video Moment Retreival\n|No.  |Model Name |Title |Links |Pub. | Organization| \n|-----|:-----:|:-----:|:-----:|:--------:|:---:|\n|0|-| Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Zeng_Multi-Modal_Relational_Graph_for_Cross-Modal_Video_Moment_Retrieval_CVPR_2021_paper.pdf) | CVPR 2021 | Hunan University|\n\n### video-audio-text \n|No.  |Model Name |Title |Links |Pub. | Organization| \n|-----|:-----:|:-----:|:-----:|:--------:|:---:|\n|0|| How2Sign: A Large-scale Multimodal Datasetfor Continuous American Sign Language|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Duarte_How2Sign_A_Large-Scale_Multimodal_Dataset_for_Continuous_American_Sign_Language_CVPR_2021_paper.pdf) [dataset](http://how2sign.github.io/) | CVPR 2021 | Universitat Polit\\`ecnica de Catalunya|\n\n### Image&Language\n|No.  |Model Name |Title |Links |Pub. | Organization| \n|-----|:-----:|:-----:|:-----:|:--------:|:---:|\n|0|| Image Change Captioning by Learning from an Auxiliary Task|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Hosseinzadeh_Image_Change_Captioning_by_Learning_From_an_Auxiliary_Task_CVPR_2021_paper.pdf)  | CVPR 2021 |University of Manitoba|\n|1| UC^2 | UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training |[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhou_UC2_Universal_Cross-Lingual_Cross-Modal_Vision-and-Language_Pre-Training_CVPR_2021_paper.pdf) | CVPR 2021 | University of California, Davis|\n|2|-| How Transferable are Reasoning Patterns in VQA?|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Kervadec_How_Transferable_Are_Reasoning_Patterns_in_VQA_CVPR_2021_paper.pdf)  [code](https://reasoningpatterns.github.io) | CVPR 2021 |INSA Lyon|\n|3|M3p | M3P: Learning Universal Representations via Multitask MultilingualMultimodal Pre-training |  [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Ni_M3P_Learning_Universal_Representations_via_Multitask_Multilingual_Multimodal_Pre-Training_CVPR_2021_paper.pdf) | CVPR 2021 | HiT |\n|4| CC12M | Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Changpinyo_Conceptual_12M_Pushing_Web-Scale_Image-Text_Pre-Training_To_Recognize_Long-Tail_Visual_CVPR_2021_paper.pdf)  | CVPR 2021 | Google Research|\n|5| - | Separatin Skills and Concepts for Novel Visual Questions Answering| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Whitehead_Separating_Skills_and_Concepts_for_Novel_Visual_Question_Answering_CVPR_2021_paper.pdf) | CVPR 2021 |UIUC | \n|6| VinVL | VinVL: Revisiting Visual Representations in Vision-Language Models | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhang_VinVL_Revisiting_Visual_Representations_in_Vision-Language_Models_CVPR_2021_paper.pdf) [code](https://github.com/pzzhang/VinVL) | CVPR 2021 | Microsoft |\n|7| -| Domain-robus VQA with diverse datasets and methods but no target labels | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhang_Domain-Robust_VQA_With_Diverse_Datasets_and_Methods_but_No_Target_CVPR_2021_paper.pdf) | CVPR 2021 | University of Pittsburgh |\n|8| PCME | Probabilistic Embeddings for Cross-Modal Retrieval | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Chun_Probabilistic_Embeddings_for_Cross-Modal_Retrieval_CVPR_2021_paper.pdf) [code](https://github.com/naver-ai/pcme) | CVPR 2021 | NAVER AI Lab|\n|9| -| Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers |[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Miech_Thinking_Fast_and_Slow_Efficient_Text-to-Visual_Retrieval_With_Transformers_CVPR_2021_paper.pdf)| CVPR 2021 | DeepMind|\n|10|TAP| TAP: Text-Aware Pre-training for Text-VQA and Text-Caption| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Yang_TAP_Text-Aware_Pre-Training_for_Text-VQA_and_Text-Caption_CVPR_2021_paper.pdf)| CVPR 2021 | University of Rochester|\n|11| Causal Attention| Causal Attention for Vision-Language Tasks| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Yang_Causal_Attention_for_Vision-Language_Tasks_CVPR_2021_paper.pdf) [code](https://github.com/yangxuntu/lxmertcatt) | CVPR 2021 | Nanyang Technological University,Singapore|\n|12| VirTex | VirTex: Learning Visual Representations from Textual Annotations | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Desai_VirTex_Learning_Visual_Representations_From_Textual_Annotations_CVPR_2021_paper.pdf) | CVPR 2021 | University of Michigan |\n|13| -| Predicting Human Scanpaths in Visual Question Answering | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Chen_Predicting_Human_Scanpaths_in_Visual_Question_Answering_CVPR_2021_paper.pdf) | CVPR 2021 | Univeristy of Minnesota | \n|14| Kaleido-BERT| Kaleido-BERT: Vision-Language Pre-training on Fashion Domain | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhuge_Kaleido-BERT_Vision-Language_Pre-Training_on_Fashion_Domain_CVPR_2021_paper.pdf) [code](http://dpfan.net/Kaleido-BERT) | CVPR 2021 | Alibaba Group |\n|15| -| Seeing Out of tHe bOx:End-to-End Pre-training for Vision-Language Representation Learning | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Huang_Seeing_Out_of_the_Box_End-to-End_Pre-Training_for_Vision-Language_Representation_CVPR_2021_paper.pdf) | CVPR 2021 | Univeristy of Science and Technology Beijing|\n|16| -| Learning by Planning: Language-Guided Global Image Editing|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Shi_Learning_by_Planning_Language-Guided_Global_Image_Editing_CVPR_2021_paper.pdf) [code](https://github.com/jshi31/T2ONet) | CVPR 2021 | University of Rochester|\n|17| KRISP| KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Marino_KRISP_Integrating_Implicit_and_Symbolic_Knowledge_for_Open-Domain_Knowledge-Based_VQA_CVPR_2021_paper.pdf) [code](https://github.com/facebookresearch/krisp) |\n|18| -| Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Liu_Adaptive_Cross-Modal_Prototypes_for_Cross-Domain_Visual-Language_Retrieval_CVPR_2021_paper.pdf) | CVPR 2021 | Peking University|\n\n\n\n\n### Video&Text\n|No.  |Model Name |Title |Links |Pub. | Organization| \n|-----|:-----:|:-----:|:-----:|:--------:|:---:|\n|0| ClipBERT | Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling  | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Lei_Less_Is_More_ClipBERT_for_Video-and-Language_Learning_via_Sparse_Sampling_CVPR_2021_paper.pdf) [code](https://github.com/jayleicn/ClipBERT) | CVPR 2021 | UNC|\n|1| -| SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Networkfor Video Reasoning over Traffic Events | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Xu_SUTD-TrafficQA_A_Question_Answering_Benchmark_and_an_Efficient_Network_for_CVPR_2021_paper.pdf) [code](https://github.com/SUTDCV/SUTD-TrafficQA) | Singapore University of Technology and Design |\n|2| -| Open-book Video Captioning with Retrieve-Copy-Generate Network| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhang_Open-Book_Video_Captioning_With_Retrieve-Copy-Generate_Network_CVPR_2021_paper.pdf) | CVPR 2021 | institute of Automation, Chinese Academy of Sciences|\n|3| NExT-QA| NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Xiao_NExT-QA_Next_Phase_of_Question-Answering_to_Explaining_Temporal_Actions_CVPR_2021_paper.pdf) [code](https://github.com/doc-doc/NExT-QA.git) | CVPR 2021 | National University of Singapore |\n|4|AGQA| AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Grunde-McLaughlin_AGQA_A_Benchmark_for_Compositional_Spatio-Temporal_Reasoning_CVPR_2021_paper.pdf) | CVPR 2021 | Stanford University|\n|5| -| Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Park_Bridge_To_Answer_Structure-Aware_Graph_Interaction_Network_for_Video_Question_CVPR_2021_paper.pdf) | CVPR 2021 |Yonsei University, Souch Korea|\n|6| -| Look Before you Speak: Visually Contextualized Utterances | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Seo_Look_Before_You_Speak_Visually_Contextualized_Utterances_CVPR_2021_paper.pdf) | CVPR 2021 | Google Research|\n\n### 3D cross-modal retreival\n|No.  |Model Name |Title |Links |Pub. | Organization| \n|-----|:-----:|:-----:|:-----:|:--------:|:---:|\n|0| -| Cross-Modal Center Loss for 3D Cross-Modal Retrieval | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Jing_Cross-Modal_Center_Loss_for_3D_Cross-Modal_Retrieval_CVPR_2021_paper.pdf) | CVPR 2021 | The City University of New York|\n\n\n### Video-to-Text Generation\n|No.  |Model Name |Title |Links |Pub. | Organization| \n|-----|:-----:|:-----:|:-----:|:--------:|:---:|\n|0| Vx2Text |VX2TEXT: End-to-End Learning of Video-Based Text GenerationFrom Multimodal Inputs| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Lin_Vx2Text_End-to-End_Learning_of_Video-Based_Text_Generation_From_Multimodal_Inputs_CVPR_2021_paper.pdf) | CVPR 2021 | Columbia University |\n\n\n### Image-to-Video Synthesis\n|No.  |Model Name |Title |Links |Pub. | Organization| \n|-----|:-----:|:-----:|:-----:|:--------:|:---:|\n|0|cINNs| Stochastic Image-to-Video Synthesis using cINNs|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Dorkenwald_Stochastic_Image-to-Video_Synthesis_Using_cINNs_CVPR_2021_paper.pdf)  | CVPR 2021 |Heidelberg University|\n|1| |Understanding Object Dynamics for Interactive Image-to-Video Synthesis|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Blattmann_Understanding_Object_Dynamics_for_Interactive_Image-to-Video_Synthesis_CVPR_2021_paper.pdf) [code](https://bit.ly/3cxfA2L) | CVPR 2021 | Heidelberg University|\n\n\n### Audio&Visual\n|No.  |Model Name |Title |Links |Pub. | Organization| \n|-----|:-----:|:-----:|:-----:|:--------:|:---:|\n|0|-| Can audio-visual integration strengthen robustnessunder multimodal attacks?|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Tian_Can_Audio-Visual_Integration_Strengthen_Robustness_Under_Multimodal_Attacks_CVPR_2021_paper.pdf)  | CVPR 2021 |University of Rochester|\n|1| -| Audio-Visual Instance Discrimination with Cross-Modal Agreement| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Morgado_Audio-Visual_Instance_Discrimination_with_Cross-Modal_Agreement_CVPR_2021_paper.pdf) | CVPR 2021 | UC San Diego|\n|2| -|VISUALVOICE: Audio-Visual Speech Separation with Cross-Modal Consistency | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Gao_VisualVoice_Audio-Visual_Speech_Separation_With_Cross-Modal_Consistency_CVPR_2021_paper.pdf) [code](http://vision.cs.utexas.edu/projects/VisualVoice/) | CVPR 2021 | The University of Texas at Austin|\n\n### Language-guided video actor segmentation\n|No.  |Model Name |Title |Links |Pub. | Organization| \n|-----|:-----:|:-----:|:-----:|:--------:|:---:|\n|0|-| Collaborative Spatial-Temporal Modeling for Language-QueriedVideo Actor Segmentation|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Hui_Collaborative_Spatial-Temporal_Modeling_for_Language-Queried_Video_Actor_Segmentation_CVPR_2021_paper.pdf) | CVPR 2021 |Chinese Academy of Sciences|\n\n\n"
  },
  {
    "path": "NLP-transformer.md",
    "content": "# Natural Language Processing Transformer\r\n\r\n\r\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time | \r\n|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|\r\n|1|BERT|BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |[paper](https://arxiv.org/abs/1810.04805) [code](https://github.com/google-research/bert) |__NAACL 2019__|Google|Oct 2018|\r\n|2|GPT3|Language Models are Few-Shot Learners|[paper](https://arxiv.org/abs/2005.14165) | __NeuRIPS 2020__ | OpenAI | May 2020|\r\n|3|GPT2|Language Models are Unsupervised Multitask Learners |[paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) [code](https://github.com/openai/gpt-2)|__arXiv__ | OpenAI | Feb 2019|\r\n|4| RoBERTa | RoBERTa: A Robustly Optimized BERT Pretraining Approach | [paper](https://arxiv.org/abs/1907.11692) | __arXiv__ |Facebook AI | Jul 2019|\r\n|5| XLNet |XLNet: Generalized Autoregressive Pretraining for Language Understanding|  [paper](https://arxiv.org/abs/1906.08237) [code](https://github.com/zihangdai/xlnet) |__NeuRIPS 2019__| Google | Jun 2019|\r\n"
  },
  {
    "path": "README.md",
    "content": "# Reading list in Transformer\n \n\nThis repo is aimed to collect all the recent popular Transformer paper, codes and learning resources with respect to the domains of **Vision Transformer**, **NLP** and **multi-modal**, etc. \n\n\n\n\n### Topics (paper and code)\n- [Image Transformer](image-transformer.md) \n\n\n- [Video Transformer](video-transformer.md)\n\n\n- [Video & Language & other modality Transformer](video-language-transformer.md)\n\n\n- [Image & language & other modlity Trasformer](image-language-transformer.md)\n\n\n- [Natural Language Processing Transformer](NLP-transformer.md)\n\n\n- [Efficient Transformer](efficiency-transformer.md)\n\n- [model compression](vision_model_compression.md)\n\n- [Self Supverpervised Learning in Vision](Self-supervised_learning.md)\n\n<!-- - [MLP for Image Classification](MLP-mixer.md) -->\n\n- [other interested papers in related domains](other_interesting_paper.md)\n\n\nReview Paper in multi-modal  \n- [Video-language](paper-review.md)\n\n\n### Tutorials and workshop\n- [Cross-View and Cross-Modal Visual Geo-Localization: IEEE CVPR 2021 Tutorial](https://youtube.com/playlist?list=PLUgbVHjDharjTo9tk3xcPJHEkmi33ap-u)\n\n- [From VQA to VLN: Recent Advances in Vision-and-Language Research: IEEE CVPR 2021 Tutorial](https://youtube.com/playlist?list=PLUgbVHjDhari645g1zmpo-MtOVap1FKxh)\n\n- [Tutorial on MultiModal Machine Learning: IEEE CVPR 2022 Tutorial](https://cmu-multicomp-lab.github.io/mmml-tutorial/cvpr2022/)\n\n\n\n### Datasets\n- [Multi-modal Datasets](datasets.md)\n\n\n### Blogs\n- [Lil's blogs](https://lilianweng.github.io/lil-log/)\n- \n\n### Tools\n- [PyTorchVideo](https://pytorchvideo.org/) a deep learning library for video understanding research\n\n- [horovod](https://github.com/horovod/horovod) a tool for multi-gpu parallel processing\n\n- [accelerate](https://huggingface.co/docs/accelerate/) an easy API for mixed precision and any kind of distributed computing\n\n- [hyperparameter search: optuna](https://optuna.org/)\n\n- [AI Conference Deadlines](https://aideadlin.es/)\n\n"
  },
  {
    "path": "Self-supervised_learning.md",
    "content": "# this will collect many papers that relates to self-supervied learning in vision domains.\n\n\nSelf-supervised learning for Images\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |\n|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|\n|1|iGPT |\tGenerative Pretraining from Pixels |[paper](http://proceedings.mlr.press/v119/chen20s/chen20s.pdf) [code](https://github.com/openai/image-gpt) |__ICML 2021__|OpenAI|17 June 2020|\n|2| MST | MST: Masked Self-Supervised Transformer for Visual Representation | [paper](https://arxiv.org/pdf/2106.05656.pdf) | __NeurIPS 2021__|Chinese Academy of Sciences| 10 June 2021|\n|3|BEiT| BEiT: BERT Pre-Training of Image Transformers| [paper](https://arxiv.org/abs/2106.08254) [code](https://github.com/microsoft/unilm/tree/master/beit) | __ICLR 2022__|Microsoft Research| 15 June 2021|\n|4| MAE | Masked Autoencoders Are Scalable Vision Learners| [paper](https://arxiv.org/pdf/2111.06377.pdf) [code](https://github.com/facebookresearch/mae)| CVPR 2022| Meta | 19 Dec 2021|\n|5| iBoT | iBOT: Image BERT Pre-Training with Online Tokenizer| [paper](https://arxiv.org/pdf/2111.07832.pdf) [code](https://github.com/bytedance/ibot) | ICLR 2022 | ByteDance |15 Nov 2021| \n|6| SimMIM| SimMIM: A Simple Framework for Masked Image Modeling | [paper](https://arxiv.org/pdf/2111.09886.pdf) [code](https://github.com/microsoft/SimMIM) | arXiv| MSRA| 18 Nov 2021| \n|7| PeCo | \tPeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers | [paper](https://arxiv.org/pdf/2111.12710.pdf) |arXiv|  Univeristy of Science and Technology of China | 24 Nov 2021|\n|8| MaskFeat | \tMasked Feature Prediction for Self-Supervised Visual Pre-Training | [paper](https://arxiv.org/pdf/2112.09133.pdf) | arXiv | Meta | 16 Dec 2021|\n|9| SplitMask | Are Large-scale Datasets Necessary for Self-Supervised Pre-training? | [paper](https://arxiv.org/pdf/2112.10740.pdf) | arXiv | Meta | 20 Dec 2021 | \n|10| ADIOS | Adversarial Masking for Self-Supervised Learning| [paper](https://arxiv.org/pdf/2201.13100.pdf) | ICML 2022 | Unviersity of Oxford | 31 Jan 2021|\n|11| CAE | Context Autoencoder for Self-Supervised Representation Learning | [paper](https://arxiv.org/pdf/2202.03026.pdf) | arXiv | Peking University | 7 Feb 2022 |\n|12| CIM| Corrupted Image Modeling for Self-Supervised Visual Pre-Training| [paper](https://arxiv.org/pdf/2202.03382.pdf) [code](https://github.com/microsoft/unilm) | arXiv | Microsoft | 7 Feb 2022|\n|13| ConvMAE | ConvMAE: Masked Convolution Meets Masked Autoencoders |[paper](https://arxiv.org/pdf/2205.03892.pdf) [code](https://github.com/Alpha-VL/ConvMAE) | arXiv | Shanghai AI Laboratory |  19 May 2022 |\n|14 | uniform masking | Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality | [paper](https://arxiv.org/pdf/2205.10063.pdf)  [code](https://github.com/implus/UM-MAE) | arXiv | Nanjing University of Science and Technology | 20 May 2022|\n|15| LoMaR | Efficient self-supervised learning with local masked reconstruction | [paper](https://arxiv.org/pdf/2206.00790.pdf) [code](https://github.com/junchen14/LoMaR) | arXiv| KAUST | 1 Jun 2022 |\n|16| M3AE | Multimodal Masked Autoencoders Learn Transferable Representations | [paper](https://arxiv.org/pdf/2205.14204.pdf) | arXiv | UCB | 31 May 2022|\n|17| HiViT| HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling | [paper](https://arxiv.org/pdf/2205.14949.pdf) | arXiv | University of Chinese Academy of Sciences | 30 May 2022 |\n|18 | GreenMiM |  Green Hierarchical Vision Transformer for Masked Image Modeling | [paper](https://arxiv.org/pdf/2205.13515v1.pdf) [code](https://github.com/LayneH/GreenMIM) | arXiv | The University of Tokyo | 26 May 2022 | \n|19| A^2MIM |Architecture-Agnostic Masked Image Modeling – From ViT back to CNN   | [paper](https://arxiv.org/pdf/2205.13943.pdf) | arXiv | AI Lab, Westlake University | 1 Jun 2022|\n|20 | MixMIM | MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning | [paper](https://arxiv.org/pdf/2205.13137.pdf) [code](https://github.com/Sense-X/MixMIM) | arXiv | SenseTime Research |  28 May 2022 |\n|21 | SemMAE |SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders | [paper](https://arxiv.org/pdf/2206.10207.pdf) | arXiv | Chinese Academy of Sciences| 21 Jun 2022|\n|22 | Voxel-MAE | Voxel-MAE: Masked Autoencoders for Pre-training Large-scale Point Clouds | [paper](https://arxiv.org/pdf/2206.09900.pdf) [code](https://github.com/chaytonmin/Voxel-MAE) | arXiv | Peking University | 20 Jun 2022|\n|23 | BootMAE |Bootstrapped Masked Autoencoders for Vision BERT Pretraining| [paper](https://arxiv.org/pdf/2207.07116.pdf) [code](https://github.com/LightDXY/BootMAE) | ECCV 2022 | University of Science and Technology of China | 14 Jul 2022|\n|24 | OmniMAE | OmniMAE: Single Model Masked Pretraining on Images and Videos| [paper](https://arxiv.org/pdf/2206.08356.pdf) [code](https://github.com/facebookresearch/omnivore) | arXiv | Meta AI | 16 Jun 2022|\n|25 | SatMAE| SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery| [paper](https://arxiv.org/pdf/2207.08051.pdf) | arXiv | Stanford University | 17 Jul 2022 |\n|26 | CMAE | Contrastive Masked Autoencoders are Stronger Vision Learners | [paper](https://arxiv.org/abs/2207.13532) | arXiv | University of Science and Technology | 27 Jul 2022 |\n|27| BEiT v2 | BEIT V2: Masked Image Modeling with Vector-Quantized Visual Tokenizers | [paper](https://arxiv.org/pdf/2208.06366.pdf) | arXiv| University of Chinese Academy of Sciences | 12 Aug 2022|\n|28| BEiT v3| Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks | [paper](https://arxiv.org/abs/2208.10442) | arXiv | Microsoft Corporation | 22 Aug 2022 |\n\n\nSelf-supervised Learning for Videos\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |\n|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|\n|1| VideoMAE| VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | [paper](https://arxiv.org/abs/2203.12602) [code](https://github.com/MCG-NJU/VideoMAE) | arXiv |  Tencent AI Lab | 23 Mar 2022 |\n|2|MAE in Video| Masked Autoencoders As Spatiotemporal Learners | [paper](https://arxiv.org/pdf/2205.09113.pdf) | arXiv | Meta | 18 May 2022 |\n\n\nSelf-supervised Learning for Audios\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |\n|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|\n|1| AudioMAE| Masked Autoencoders that Listen | [paper](https://arxiv.org/pdf/2207.06405v1.pdf)  [code](https://github.com/facebookresearch/AudioMAE) | arXiv |  Meta AI | 13 Jul 2022|\n\n\nSurvey in self-supervised learning\n|No.  |Title |Links |Pub. | Organization| Release Time |\n|-----|:-----:|:-----:|:--------:|:---:|:-------:|\n|1|A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond | [paper](https://arxiv.org/pdf/2208.00173.pdf) |arXiv| KAIST| 30 Jul 2022|\n\n"
  },
  {
    "path": "datasets.md",
    "content": "# Common multimodal datasets\n\n## Image Datasets\n[COCO](https://cocodataset.org/#home)\\\n[conceptual 3M](https://ai.google.com/research/ConceptualCaptions/)\\\n[coenceptual 12M](https://github.com/google-research-datasets/conceptual-12m)\n\n## Video&language  Dataset\n|Dataset |paper| Clips |Captions |Videos |Duration | Source| Year |  Tasks| collection method|\n|-----|:-----:|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|:-------:|:-------:|\n|[Chalades](https://prior.allenai.org/projects/charades) | [paper](https://openreview.net/forum?id=rJW3ItWubH)|10K | 16K |10,000 | 82h|daily household videos|2016| action recoginition & captioning| AMT|\\\n|[MSRVTT](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |[paper](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/cvpr16.msr-vtt.tmei_-1.pdf) |  10k| 200k| 7,180| 40h| web-crawed videos with 257 queries |2016| retreival and captioning | AMT|\\\n|[Didemo](https://github.com/LisaAnne/LocalizingMoments)| [paper](https://arxiv.org/pdf/1708.01641.pdf) | 27k| 41k| 10,464| 87h| randomly select over 14,000 videos from YFCC100M| 2017| Moment localization| crowdsoucing|\\\n|[M-VAD](https://github.com/aimagelab/mvad-names-dataset) | [paper](https://arxiv.org/pdf/1503.01070.pdf) |49k| 56k| 92| 84h| DVD movies| 2015| retreival |crowdsourcing| \\\n| [MPII-MD](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/mpii-movie-description-dataset) | [paper](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Rohrbach_A_Dataset_for_2015_CVPR_paper.pdf)| 69k| 68k| 94| 41h|Web Movies| 2015| captioning| crowdsourcing |\\\n|[ActivityNet](http://activity-net.org/)| [paper](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Heilbron_ActivityNet_A_Large-Scale_2015_CVPR_paper.pdf)|  100k | 100k | 20,000 | 849h| online human activities videos| 2017|captioning & retrieval| AMT |\\\n| [TGIF](http://raingo.github.io/TGIF-Release/) | [paper](https://arxiv.org/pdf/1604.02748.pdf)| 69k| 68k| 94| 41h| a year’s worth of GIF posts from Tumblr| 2015| captioning| CrowdFlower|\\\n[YouCook2](http://youcook2.eecs.umich.edu/download) |[paper](http://youcook2.eecs.umich.edu/static/YouCookII/youcookii_readme.pdf) |14k| 14k| 2,000| 176h| online cooking videos| 2018| retreival & captioning| well-trained native English speakers |\\\n|[LSMDC](https://sites.google.com/site/describingmovies/download) |[paper](https://arxiv.org/pdf/1605.03705.pdf) | 128k| 128k| 200| 150h| comination of M-VAD and MPII-MD datasets |2017 | captioning| /|\\\n[HowTo100M](https://github.com/antoine77340/howto100m) | [paper](https://arxiv.org/pdf/1906.03327.pdf)| 136M| 136M| 1.221M| 134,472h| large-scaled online videos| 2019| action step localization & retreival | ASR|\n[Kinetics-700](https://deepmind.com/research/open-source/kinetics) |[paper](https://arxiv.org/abs/1907.06987)| 650K| /| 650K| /| an extension of kinetics-700 dataset |2019| action recoginition| /|\\\n[AVA-Kinetics](https://deepmind.com/research/open-source/kinetics) |[paper](https://arxiv.org/abs/2005.00214) | 230K| /| 230K| /| combines the annotation style of AVA and kinetics dataset| 2020| action recoginition|/ |\\\n[HACS]( http://hacs.csail.mit.edu/) |[paper]( https://arxiv.org/abs/1712.09374) | 1.5M| /| 504K| /| large scale human action localization dataset| 2019| action recoginition&captioning| crowdsourcing|\\\n[Tiny-Virat]( https://github.com/UgurDemir/Tiny-VIRAT) |[paper]( https://arxiv.org/abs/2007.07355) |  13K| /| 13K| /| low-resolution action recognition dataset (surveillance videos) |2020| action recognition| /|\\\nAction Genome |[paper]( https://arxiv.org/abs/1912.06992) | 234K| /| 234K| /| video scene graph| 2020| action recoginition& representations encoding eventpartonomies| crowdsourcing|\\\n[SoccerNet]( https://silviogiancola.github.io/SoccerNet) |[paper]( https://arxiv.org/pdf/1804.04527.pdf) | 650K| 764h| 650K| /| European Football League video| 2018| event classification in football game video| transformed from the data from league websites|\\\n[ActivityNet Entities]( http://t.cn/EfePohM) |[paper]( https://arxiv.org/abs/1812.06587) | 650K| /| 650K| /| ground the visual entity with the activitynet video objects| 2018| video understanding & action recognition| crowdsourcing|\\\n[VidSitu]( https://vidsitu.org/) |[paper]( https://arxiv.org/abs/2104.00990) | 136K| /| 29K| /| the events and related roles in the movies | 2021| semantic role and co-referencing prediction| AMT|\\\n[VATEX]( https://eric-xw.github.io/vatex-website/) | [paper](https://arxiv.org/abs/1904.03493)| 41.3k| 826k| 41.3k| 114h38m| human behavior video from YouTube| 2019| action recoginition&captioning| /|\\\n[MSVD]( https://www.cs.utexas.edu/users/ml/clamp/videoDescription/) | [paper](https://aclanthology.org/P11-1020.pdf)| 2k| 70k| 2k| 4h55m| web videos| 2011| video captioning| AMT |\\\n[MovieNet](http://movienet.site/) | [paper]( https://arxiv.org/abs/2007.10937)| 420k| 25k| 420k| /| Web Movies| 2020|  Genre classification & cinematic style analysis & character recognition &  scene analysis & story understanding| crowdsourcing| \\\n[MovieGraphs](http://moviegraphs.cs.toronto.edu/) | [paper]( http://moviegraphs.cs.toronto.edu/)| 7.6k| 70k| 51| 150h| scene graph representation of movie| 2018| description retreival & dialog retrieval & Movie Clip Retrieval | crowdsourcing|\\\n[QVHIGHLIGHTS](https://github.com/jayleicn/moment_detr) | [paper](https://arxiv.org/pdf/2107.09609.pdf) | 10.3k| 10.2k| 10.3k| / | daily or travel vlog and news| 2021| moment retreival & highlight detection| AMT|\\\n[UCF101](https://www.crcv.ucf.edu/research/data-sets/ucf101/) | [paper]( https://www.crcv.ucf.edu/wp-content/uploads/2019/03/UCF101_CRCV-TR-12-01.pdf) | 13.3k| 1600m| 13.3k| / | user-uploaded videos| 2012| action recoginition| crowdsourcing |\\\n[HMDB51]( https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/#dataset) | [paper]( http://serre-lab.clps.brown.edu/wp-content/uploads/2012/08/Kuehne_etal_ICCV2011.pdf) | 7K| /| 7K| /| action videos from Youtube/Google| 2011| action recoginition&captioning| crowdsourcing|\\\n[Moments-in-Time]( http://moments.csail.mit.edu/) | [paper]( https://arxiv.org/abs/1801.03150) | 1M| /| 1M| /| edited videos from YouTube, Flickr, Vine, Metacafe and other sources| 2017| action&event recognition| AMT|\\\n[AVA]( https://github.com/cvdfoundation/ava-dataset) | [paper](https://arxiv.org/abs/1705.08421) | 57.6K| 300k| 57.6K| / | Web Movies with human bounding boxes| 2017| atomic visual actions recogintion| crowdsourcing|\\\n[HVU]( https://holistic-video-understanding.github.io/) | [paper](https://arxiv.org/abs/1904.11451) | 57.2K| 9M| 57.2K| / | Youtube| 2020| multi-label and multi-task video understanding| semi-automatic crowdsourcing strategy |\\\n[Oops!]( https://github.com/DmZhukov/CrossTask) | [paper]( https://arxiv.org/abs/1911.11206) | 20K| / | 20K| / | in-the-wild videos of unintentional action| 2019| unintentional action recoginition| AMT|\\\n[CrossTask]( https://github.com/DmZhukov/CrossTask) | [paper]( https://arxiv.org/pdf/1903.08225.pdf) | 4.7K| / | 4.7K| /| weakly supervising learning from instructional videos| 2019| video classification| crowdsourcing|\\\n[COIN]( https://coin-dataset.github.io/) | [paper]( https://arxiv.org/pdf/1903.02874.pdf) | 11.8K | /| 11.8K| /| Comprehensive instructional video analysis | 2019| step localization & action recoginition| crowdsourcing|\\\n[Sports-1M]( https://cs.stanford.edu/people/karpathy/deepvideo/) | [paper]( http://cs.stanford.edu/people/karpathy/deepvideo/deepvideo_cvpr2014.pdf) | 1.1M| /| 1.1M| /| sports video from Youtube | 2014| video classification| crowdsourcing labed with taxonomy|\\\n[20BN-SOMETHING-SOMETHING]( https://20bn.com/datasets/something-something) | [paper]( https://arxiv.org/abs/1706.04261) | 220K| 318K| 220K| /| show humans performing pre-defined basic actions with everyday objects| 2017| action recoginition| AMT|\\\n[DALY]( http://thoth.inrialpes.fr/daly/) | [paper]( https://arxiv.org/pdf/1605.05197.pdf) | 8.1K| / | 8.1K| /| Daily Action Localization in YouTube| 2016| video classification| crowdsourcing|\\\n[FineGym]( https://sdolivia.github.io/FineGym/) | [paper]( https://arxiv.org/abs/2004.06704) | 8.1K| / | 8.1K| /|  gymnastic videos with temporal actions and sub-actions| 2020| video action recognition&detection&generation| crowdsourcing|\\\n[MultiSports]( https://deeperaction.github.io/multisports/) | [paper]( https://arxiv.org/abs/2105.07404) | 3.2K| / | 3.2K| /| competition videos with high resolution held in recent years| 2021| spatio-temporal action detection| /|\\\n[“Wildlife Action”]() | [paper]( https://www.crcv.ucf.edu/wp-content/uploads/2018/11/Weining_L_Report.pdf) | 10.6K| / | 10.6K| /| downloaded from YouTube| 2020| animal action recognition| YouTube’s Data API|\\\n[“Action Recogniation of Large Animals”]() | [paper]( https://ieeexplore.ieee.org/document/8634672) | /| / | /| /| downloaded from YouTube| 2018| animal action recognition| YouTube’s Data API|\\\n[“First-Person Animal Action”]() | [paper]( http://robotics.ait.kyushu-u.ac.jp/~yumi/db/papers/2014_ICPR_Final.pdf) | /| / | /| /| collected by a dog wearing a GoPro size camera| 2014| first-person animal activity recogniation| /|\\\n[AnimalWeb]( https://vcla.stat.ucla.edu/people/zhangzhang-si/HiT/exp5.html) | [paper]( https://arxiv.org/pdf/1909.04951.pdf) | /| / | /| /| collected by a dog wearing a GoPro size camera| 2014| first-person animal activity recogniation| /|\\\n\n\n## Video Dataset\n\n|Dataset  |Videos |Duration | Source| Year | \n|-----|:-----:|:--------:|:---:|:-------:|\n[Youtube8M](https://research.google.com/youtube8m/index.html) | 6M|350,000|YouTube| 2018|\n[FineAction](https://deeperaction.github.io/fineaction/) |16,732 | -| YouTube |  24 May 2021|\n[VideoLT](https://videolt.github.io/) | 256,218 | 819,898 | YouTube|  6 May 2021| \n\n\n## dataset collection tools\n[voxel](https://voxel51.com/)  \n[amazon turkers](https://www.mturk.com/)  \n[shaip](https://www.shaip.com/)\n"
  },
  {
    "path": "efficiency-transformer.md",
    "content": "# Efficient Transformer\n\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |\n|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|\n|1|HiBERT |HIBERT: Document Level Pre-training of Hierarchical BidirectionalTransformers for Document Summarization |[paper](https://arxiv.org/pdf/1905.06566.pdf) |__ACL 2019__|Microsoft Research Asia|16 May 2019|\n|2|star transformer|Star Transformer |[paper](https://www.aclweb.org/anthology/N19-1133.pdf) |__NAACL 2019__|Shanghai Key Laboratory of Intelligent Information Processing, Fudan University|25 Feb 2019|\n|3|ETC |ETC: Encoding Long and Structured Inputs in Transformers |[paper](https://www.aclweb.org/anthology/2020.emnlp-main.19.pdf) |__EMNLP 2020__|Google AI|16 November 2020|\n|4|BP-Transformer |BP-Transformer: Modelling Long-Range Context via Binary Partitioning |[paper](https://arxiv.org/pdf/1911.04070.pdf) [code](https://github.com/yzh119/BPT)|__arXiv__|AWS Shanghai AI Lab|11 November 2019|\n|5|Routing Transformer |Efficient Content-Based Sparse Attention with Routing Transformers |[paper](https://openreview.net/forum?id=B1gjs6EtDr) [code](https://github.com/lucidrains/routing-transformer)|__ICLR 2020__|Google AI|1 Februray 2021|\n|7|Compressive Transformer |Compressive Transformers for Long-Range Sequence Modelling |[paper](https://openreview.net/pdf?id=SylKikSYDH) [code](https://github.com/lucidrains/compressive-transformer-pytorch)|__ICLR 2020__|Deep Mind|25 Sep 2019|\n|8|Transformer-XL |Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context |[paper](https://arxiv.org/abs/1901.02860) [code](https://github.com/kimiyoung/transformer-xl)|__ACL 2019__|CMU|9 Jan 2019|\n|9|Big Bird |Big Bird: Transformers for Longer Sequences |[paper](https://arxiv.org/abs/2007.14062) [code](https://github.com/google-research/bigbird)|__NeurIPS 2020__|Google Research|8 Jan 2021|\n|10|Adaptive-Span |Adaptive Attention Span in Transformers |[paper](https://arxiv.org/pdf/1905.07799.pdf) [code](https://github.com/facebookresearch/adaptive-span)|__ACL 2019__|Facebook AI|19 May 2019|\n|11|reformer |reformer: the efficient transformer |[paper](https://arxiv.org/abs/2001.04451) [code](https://github.com/lucidrains/reformer-pytorch)|__ICLR 2020__|Google AI|13 Jan 2020|\n|12|Longformer |Longformer: The Long-Document Transformer |[paper](https://arxiv.org/abs/2004.05150) [code](https://github.com/allenai/longformer)|__ICLR 2020__|Allen Insitute for Artificial Intelligence|2 Dec 2020|\n|13| - | parameter efficient multimodal transformers for video representation learning | [paper](https://openreview.net/forum?id=6UdQLhqJyFD) [code](https://github.com/sangho-vision/avbert) | __ICLR 2021__| Seoul National University | 8 Dec 2020|\n|14| Albert| Albert: A lite BERT for self-supervised learning of language prepresentations | [paper](https://openreview.net/pdf?id=H1eA7AEtvS) [code](https://github.com/google-research/ALBERT) | __ICLR 2020__| Google Research | 26 Sep 2019|\n|15| DEQ | Deep Equilibrium Models |[paper](https://proceedings.neurips.cc/paper/2019/file/01386bd6d8e091c2ab4c7c7de644d37b-Paper.pdf) [code](https://github.com/locuslab/deq) |  __NeurIPS 2019__| CMU |3 Sep 2019|\n|16| Universal Transformer | Universal Transformers | [paper](https://arxiv.org/pdf/1807.03819.pdf) [code](https://github.com/andreamad8/Universal-Transformer-Pytorch) | __ICLR 2019__| University of Amsterdam | 5 May 2019|\n|17| Linear Transformer | Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention | [paper](https://arxiv.org/pdf/2006.16236.pdf) [code](https://linear-transformers.com/) | __ICML 2020__ | Idiap Research Institute | 31 Aug 2020|\n|18| ∞-former | ∞-former: Infinite Memory Transformer | [paper](https://arxiv.org/pdf/2109.00301.pdf) | __arXiv__ |Pedro Henrique Martins | 1 Sep 2021|\n|19| ATS| ATS: Adaptive Token Sampling For Efficient Vision Transformers| [paper](https://arxiv.org/pdf/2111.15667.pdf) | arXiv | Microsoft|30 Nov 2021|\n|20 |TerViT |TerViT: An Efficient Ternary Vision Transformer |[paper](https://arxiv.org/pdf/2201.08050.pdf) | arXiv| Beihang University| 20 Jan 2022|\n|21| Lite Transformer| Lite Transformer with long-short range attention| [paper](https://arxiv.org/pdf/2004.11886.pdf) [code](https://github.com/mit-han-lab/lite-transformer) | __ICLR 2020__ | MiT | 24 Apr 2020|\n|22|UVC|  Unitified Visual Transformer Compression  [paper](https://github.com/VITA-Group/UVC) [code](https://github.com/VITA-Group/UVC) |__ICLR 2022__ |University of Texas at Austin| 29 Sept 2021|\n|23|MobileVIT| MobileVIT: light-weight, general-purpose，and mobile-friendly vision transformers | __ICLR 2022__| Apple | 5 Oct 2021|\n\n\n\n\n\n\n\n"
  },
  {
    "path": "image-language-transformer.md",
    "content": "# Image & Language (Retrieval & captioning & image generation )\r\n\r\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time | \r\n|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|\r\n|1|ViusalGPT |VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning |[paper]( https://arxiv.org/abs/2102.10407) [code]( https://github.com/Vision-CAIR/VisualGPT) |__arXiv 2021__|KAUST|20 Feb 2021|\r\n|2|Kaleido-BERT |Kaleido-BERT: Vision-Language Pre-training on Fashion Domain |[paper](https://arxiv.org/pdf/2103.16110.pdf) [code]( https://github.com/mczhuge/Kaleido-BERT/) |__CVPR 2021__|Alibaba Group|15 April 2021|\r\n|3|CLIPBERT |Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling | [paper](https://arxiv.org/pdf/2102.06183.pdf) [code](https://github.com/jayleicn/ClipBERT) |__CVPR 2021__| UNC | 11 Feb 2021|\r\n|4| -|Probabilistic Embeddings for Cross-Modal Retrieval| [paper](https://openaccess.thecvf.com/content/CVPR2021/html/Chun_Probabilistic_Embeddings_for_Cross-Modal_Retrieval_CVPR_2021_paper.html) [github](https://github.com/naver-ai/pcme) | __CVPR 2021__ | NAVER Lab|  14 June 2021|\r\n|5| -| Scaling Up Vision-Language Representation Learning With Noisy Text Supervision | [paper](https://arxiv.org/pdf/2102.05918.pdf) | ICML 2021| Google | 11 June 2021|\r\n|6|-|Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training| [paper](https://arxiv.org/pdf/2106.13488.pdf) | arXiv| MSRA| 28 June 2021|\r\n|7| CogView| CogView: Mastering Text-to-Image Generation via Transformers | [paper](https://arxiv.org/pdf/2105.13290.pdf) [code](https://github.com/THUDM/CogView) | arXiv | TsingHua University | 28 May 2021| \r\n|8|ViLT| ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision| [paper](https://arxiv.org/pdf/2102.03334.pdf) [code](https://github.com/dandelin/vilt) | ICML 2021 | NAVER AI lab|  10 Jun 2021| \r\n|9| - |Unifying Vision-and-Language Tasks via Text Generation | [paper](https://arxiv.org/pdf/2102.02779.pdf) [code](https://github.com/j-min/VL-T5) | ICML 2021 | UNC | 23 May 2021|\r\n|10| Pixel-BERT | Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers | [paper](https://arxiv.org/pdf/2004.00849.pdf) | arXiv | Univesity of Science and Technology Beijing |  22 Jun 2020 |\r\n|11| -| How Much Can CLIP Benefit Vision-and-Language Tasks?| [paper](https://arxiv.org/pdf/2107.06383.pdf)| arXiv| UCB | 13 Jul 2021|\r\n|12| LXMERT |LXMERT: Learning Cross-Modality Encoder Representations from Transformers| [paper](https://arxiv.org/abs/1908.07490) [code](https://github.com/airsplay/lxmert)| EMNLP 2019| UNC Chapel Hill | 3 Dec 2019|\r\n|13| ViLBERT | VilBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks| [paper](https://arxiv.org/abs/1908.02265) [code](https://github.com/jiasenlu/vilbert_beta)| NeurIPS 2019| Georgia Institute of Technology | 6 Aug 2019|\r\n|14| ImageBERT | ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data | [paper](https://arxiv.org/abs/2001.07966) | arXiv | Bing, Microsoft|23 Jan 2020|\r\n|15| Unicoder-VL | Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training | [paper](https://arxiv.org/pdf/1908.06066v3.pdf) | AAAI 2020| MSRA | 2 Dec 2019|\r\n|16| VLP | Unified Vision-Language Pre-Training for Image Captioning and VQA | [paper](https://arxiv.org/pdf/1909.11059.pdf) [code](https://github.com/LuoweiZhou/VLP) | AAAI 2020| University of Michigan | 4 Dec 2019|\r\n|17| XGPT |XGPT: Cross-modal Generative Pre-Training for Image Captioning |[paper](https://arxiv.org/pdf/2003.01473.pdf) | arXiv| Peking University | 4 Mar 2020|\r\n|18| 12-IN-1 | 12-in-1: Multi-Task Vision and Language Representation Learning | [paper](https://arxiv.org/pdf/1912.02315.pdf) [code](https://github.com/facebookresearch/vilbert-multi-task) | CVPR 2020 | Facebook | 5 Dec 2019|\r\n|19| FashionBERT | FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval | [paper](https://arxiv.org/abs/2005.09801)| SIGIR | Alibaba | 20 May 2020|\r\n|20| UNITER | UNITER: UNiversal Image-TExt Representation Learning | [paper](https://arxiv.org/abs/1909.11740) [code](https://github.com/ChenRocks/UNITER) | ECCV 2020 | Microsoft Dynamics 365 AI Research| 25 Sep 2019|\r\n|21| VisDial-BERT | Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline | [paper](https://arxiv.org/abs/1912.02379) [code](https://github.com/vmurahari3/visdial-bert) | ECCV 2020 | 1Georgia Institute of Technology | 31 Mar 2020 |\r\n|22| OSCAR | Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks | [paper](https://arxiv.org/abs/2004.06165) [code](https://github.com/microsoft/Oscar) | ECCV 2020| Microsoft |13 Apr 2020|\r\n|23| KD-VLP| KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation |[paper](https://arxiv.or/pdf/2109.10504v1.pdf) | arXiv | ShanghaiTech | 22 Sep 2021|\r\n|24| Fast & Slow| Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers|[paper](https://arxiv.org/abs/2103.16553)| CVPR 2021 |DeepMind | 30 Mar 2021|\r\n|25| - |Unifying Multimodal Transfomer for Bi-directional Image and Text Generation | [paper](https://arxiv.org/abs/2110.09753) | Arxiv | Sun Yat-sen University| 19 Oct 2021 |\r\n|26|SOHO| Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning | [paper](https://arxiv.org/pdf/2104.03135.pdf) | CVPR 2021 | University of Science and Technology Beijing | 8 Apr 2021|\r\n|27| E2E-VLP | E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual| [paper](https://aclanthology.org/2021.acl-long.42.pdf) | ACL 2021| Alibaba Group | 3 June 2021|\r\n|28| KD-VLP | KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation | [paper](https://arxiv.org/abs/2109.10504) |EMNLP 2021| ShanghaiTech| 22 Sep 2021|\r\n|29| L-Verse| L-Verse: Bidirectional Generation Between Image and Text| [paper](https://arxiv.org/pdf/2111.11133.pdf) | ArXiv | LG AI Research | 22 Nov 2021|\r\n|30| NUWA| NUWA: Visual Synthesis Pre-training for Neural visUal World creAtion| [paper](https://arxiv.org/pdf/2111.12417.pdf) | arXiv | MSRA| 24 Nov 2021|\r\n|31| Florence| Florence: A New Foundation Model for Computer Vision | [paper](https://arxiv.org/abs/2111.11432) | arXiv | Microsoft | 22 Nov 2021|\r\n|32| -|Distilled Dual-Encoder Model for Vision-Language Understanding| [paper](https://arxiv.org/pdf/2112.08723v1.pdf) | arXiv | Microsoft | 16 Dec 2021|\r\n|33| FLAVA| FLAVA : A Foundational Language And Vision Alignment Model| [paper](https://arxiv.org/pdf/2112.04482.pdf) | arXiv | FAIR | 8 Dec 2021|\r\n\r\n\r\n\r\n# Object Detection\r\n\r\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time | \r\n|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|\r\n|1|MDTER|  MDETR - Modulated Detection for End-to-End Multi-Modal Understanding   | [paper](https://arxiv.org/pdf/2104.12763.pdf)  [code](https://github.com/ashkamath/mdetr)  | __ICCV 2021__|NYU |26 April 2021|\r\n|2| pix2seq| pix2seq: A Language Modeling Framework for Object Detection| [paper](https://arxiv.org/abs/2109.10852) | arXiv | Google Research |  22 Sep 2021 |\r\n\r\n\r\n\r\n\r\n"
  },
  {
    "path": "image-transformer.md",
    "content": "## Image Classification\r\n\r\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time | \r\n|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|\r\n|1|ViT |An image is worth 16 * 16 words: transformers for image recognition at scale |[paper]( https://arxiv.org/pdf/2010.11929.pdf) [code]( https://github.com/rwightman/pytorch-image-models) |__ICLR 2021__|Google Brain|22 Oct 2020|\r\n|2|LeViT |LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference |[paper](https://arxiv.org/abs/2104.01136)  |__arXiv__|/|2 Apr 2021|\r\n|3|Swin Transformer |Swin Transformer: Hierarchical Vision Transformer using Shifted Windows |[paper](https://arxiv.org/pdf/2103.14030.pdf) [code](https://github.com/microsoft/Swin-Transformer)  |__arXiv__|MSRA|25 Mar 2021|\r\n|4|DeiT Transformer |Training data-efficient image transformers& distillation through attention |[paper](https://arxiv.org/pdf/2012.12877.pdf) [code](https://github.com/facebookresearch/deit)  |__arXiv__|Facebook AI|15 Jan 2021|\r\n|5|Pyramid Vision Transformer |Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions|[paper](https://arxiv.org/abs/2102.12122) [code](https://github.com/whai362/PVT)  |__arXiv__|Nanjing University of Science and Technology|24 Feb 2021|\r\n|6|TNT |Transformer in Transformer|[paper](https://arxiv.org/pdf/2103.00112.pdf) [code](https://github.com/huawei-noah/noah-research/tree/master/TNT)  |__arXiv__|Noah's Ark Lab|27 Feb 2021|\r\n|7|PiT |Rethinking Spatial Dimensions of Vision Transformers|[paper](https://arxiv.org/pdf/2103.16302.pdf) [code](https://github.com/naver-ai/pit)  |__arXiv__|NAVER AI Lab|30 Mar 2021|\r\n|8|T2T-ViT |Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet|[paper](https://arxiv.org/pdf/2101.11986.pdf) [code](https://github.com/yitu-opensource/T2T-ViT)  |__arXiv__| NUS|22 Mar 2021|\r\n|9|CPVT |Conditional Positional Encodings for Vision Transformers|[paper](https://arxiv.org/pdf/2102.10882.pdf) [code](https://github.com/Meituan-AutoML/CPVT)  |__arXiv__| Meituan Inc|18 Mar 2021|\r\n|10|ViL |Multi-Scale Vision Longformer:A New Vision Transformer for High-Resolution Image Encoding|[paper](https://arxiv.org/pdf/2103.15358.pdf)   |__arXiv__| Microsoft Corporation|29 Mar 2021|\r\n|11|CoaT |Co-Scale Conv-Attentional Image Transformer|[paper](https://arxiv.org/abs/2104.06399) [code](https://github.com/mlpc-ucsd/CoaT)  |__arXiv__| University of California San Diego|13 April 2021|\r\n|12|CoaT |Co-Scale Conv-Attentional Image Transformer|[paper](https://arxiv.org/abs/2104.06399) [code](https://github.com/mlpc-ucsd/CoaT)  |__arXiv__| University of California San Diego|13 April 2021|\r\n|14|pruning |Visual Transforemr Pruning | [paper](https://arxiv.org/pdf/2104.08500.pdf) |__arXiv__|Zhejiang University| 17 April 2021 |\r\n|15|ViL| Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding| [paper](https://arxiv.org/pdf/2103.15358.pdf) |__arXiv__|Microsoft Corporation |29 Mar 2021|\r\n|16|M2TR| M2TR: Multi-modal Multi-scale Transformersfor Deepfake Detection | [paper](https://arxiv.org/pdf/2104.09770.pdf) | __arXiv__ | Fudan Univeristy | 21 Apr|\r\n|17|VisTransformer | Visformer: The Vision-friendly Transformer |[paper](https://arxiv.org/pdf/2104.12533.pdf) [code](https://github.com/danczs/Visformer) | __arXiv__ | Beihang University | 26 April 2021|\r\n|18| ConTNet | ConTNet: Why not use convolution and transformer at the same time?| [paper](https://arxiv.org/pdf/2104.13497.pdf) [code](https://github.com/yan-hao-tian/ConTNet)|__arXiv__| ByteDance AI Lab | 27 Apr 2021 |\r\n|19| Twins-SVT | Twins: Revisiting the Design of Spatial Attention in Vision Transformers  | [paper](https://arxiv.org/pdf/2104.13840.pdf) [code](https://github.com/Meituan-AutoML/Twins) |__arXiv__ | Meituan Inc | 28 Apr 2021|\r\n|20|LeViT| LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference| [paper](https://arxiv.org/pdf/2104.01136.pdf) [code](https://github.com/facebookresearch/LeViT) | __arXiv__ | Facebook | 6 May 2021 |\r\n|21|CoAtNet| CoAtNet: Marrying Convolution and Attentionfor All Data Sizes | [paper](https://arxiv.org/pdf/2106.04803.pdf) | __arXiv__ | Google Brain| 9 June, 2021|\r\n|22|Focal Transformer |Focal Self-attention for Local-Global Interactions in Vision Transformers  |[paper](https://arxiv.org/pdf/2107.00641.pdf) | Microsoft Research at Redmond | 1 Jul 2021|\r\n|23|BEIT| BEIT: BERT Pre-Training of Image Transformers| [paper](https://arxiv.org/pdf/2106.08254.pdf)|arXiv| Microsoft|15 Jun 2021|\r\n|24| ViT-G| Scaling Vision Transformers| [paper](https://arxiv.org/pdf/2106.04560.pdf) | arXiv | google brain | 8 Jun 2021| \r\n|25| -| Efficient Training of Visual Transformers with Small-Size Datasets | [paper](https://arxiv.org/pdf/2106.03746.pdf) | arXiv |TFBK| 7 Jun 2021|\r\n|26|PS-ViT | Vision Transformer with Progressive Sampling | [paper](https://arxiv.org/pdf/2108.01684.pdf) [code](https://github.com/yuexy/PS-ViT) | arXiv | Centre for Perceptual and Interactive Intelligence| 3 Aug 2021|\r\n|27|-| Masked Autoencoders Are Scalable Vision Learners| [paper](https://arxiv.org/pdf/2010.16056.pdf)  | arXiv | Facebook FAIR| 11 Nov 2021|\r\n|28| Evo-ViT | Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer |[paper](https://arxiv.org/pdf/2108.01390.pdf) | AAAI 2022 | Chinese Academy of Sciences| 6 Dec 2021|\r\n|29| ATS| ATS: Adaptive Token Sampling For Efficient Vision Transformers| [paper](https://arxiv.org/pdf/2111.15667.pdf) | arXiv | Microsoft|30 Nov 2021|\r\n|30| AdaViT | AdaViT: Adaptive Vision Transformers for Efficient Image Recognition| [paper](https://arxiv.org/pdf/2111.15668.pdf) | arXiv | Fudan University| 30 Nov 2021|\r\n|31| PeCo| PeCo : Perceptual Codebook for BERT Pre-training of Vision Transformers| [paper](https://arxiv.org/pdf/2111.12710.pdf) [code](https://github.com/microsoft/PeCo) | arXiv | University of Science and Technology of China| 24 Nov 2021|\r\n|32| DAT| Vision Transformer with Deformable Attention | [paper](https://arxiv.org/pdf/2201.00520.pdf) [code](https://github.com/LeapLabTHU/DAT) |arXiv | Tsinghua University | 3 Jan 2022|\r\n\r\n\r\n\r\n# Viusal Relationship Detection\r\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time | \r\n|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|\r\n|1| RelTransformer | RelTransformer: Balancing the Visual Relationship Detection from Local Context, Scene and Memory| [paper](https://arxiv.org/pdf/2104.11934.pdf) [code](https://github.com/Vision-CAIR/RelTransformer) | __arXiv__| KAUST| 24 April 2021|\r\n\r\n\r\n\r\n# Object Tracking\r\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time | \r\n|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|\r\n|1| MOTR | MOTR: End-to-End Multiple-Object Tracking with TRansformer| [paper](https://arxiv.org/pdf/2105.03247.pdf) [code](https://github.com/megvii-model/MOTR) | __arXiv__| MEGVII Techonology| 7 May 2021|\r\n\r\n\r\n\r\n"
  },
  {
    "path": "other_interesting_paper.md",
    "content": "# Different interesting attention designs\n\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |\n|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|\n|1|Invertible Attention |Invertible Attention |[paper](https://arxiv.org/pdf/2106.09003.pdf) |arxiv|Australian National University|27 Jun 2021|\n|2|AutoSampling| AutoSampling : Search for Effective Data Sampling Schedules| [paper](https://arxiv.org/pdf/2105.13695.pdf) | ICML 2021|SenseTime Research  | 28 May 2021|\n|3|AdaFocus-TSM| Adaptive Focus for Efficient Video Recognition | [paper](https://arxiv.org/pdf/2105.03245.pdf) [code](https://github.com/blackfeather-wang/AdaFocus) | arXiv | Tsinghua University | 7 May 2021|\n|4| SMART| SMART Frame Selection for Action Recognition| [paper](https://arxiv.org/pdf/2012.10671.pdf) |AAAI 2021| University of Edinburgh | 19 Dec 2020|\n|5|-| A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Feichtenhofer_A_Large-Scale_Study_on_Unsupervised_Spatiotemporal_Representation_Learning_CVPR_2021_paper.pdf) [code](https://github.com/facebookresearch/SlowFast) | CVPR 2021 | Facebook | 29 Apr 2021|\n|6| PixelTransformer| PixelTransformer: Sample Conditioned Signal Generation  |[paper](https://arxiv.org/abs/2103.15813) [code](https://shubhtuls.github.io/PixelTransformer/)| ICML 2021 | Facebook | 29 Mar 2021|\n|7| Perceiver | Perceiver: General Perception with Iterative Attention | [paper](https://arxiv.org/pdf/2103.03206.pdf) [code](https://github.com/lucidrains/perceiver-pytorch) | ICML 2021 | DeepMind | 23 Jun 2021|\n|8 | DOVE| DOVE: Learning Deformable 3D Objects by Watching Videos|[paper](https://arxiv.org/pdf/2107.10844.pdf) [code](https://dove3d.github.io/) | arXiv | Oxford | 22 Jul 2021 |\n|9| MGSampler| MGSampler: An Explainable Sampling Strategy for Video Action Recognition| [paper](https://arxiv.org/abs/2104.09952)| ICCV 2021 | Nanjing university | 20 Apr 2021|\n|10| Expire Span| Not All Memories are Created Equal: Learning to Forget by Expiring | [paper](https://arxiv.org/pdf/2105.06548.pdf)| ICML 2021 | Facebook | 13 Jun 2021|\n|11| -| STEP-UNROLLED DENOISING AUTOENCODERS FOR TEXT GENERATION| [paper](https://arxiv.org/pdf/2112.06749.pdf) | arXiv | Deepmind|  13 Dec 2021| \n|12| -| dataset meta-learning from kernel ridge regression| [paper](https://openreview.net/pdf?id=l-PrrQrK0QR) | arXiv | Google Brain| 22 Mar 2021 |\n|13| -| Dataset Distillation with Infinitely Wide Convolutional Networks| [paper](https://openreview.net/pdf?id=hXWPpJedrVP) | arXiv| Google Brain| 27 Oct 2021|\n|14 | bert2BERT | bert2BERT: Towards Reusable Pretrained Language Models | [paper](https://arxiv.org/abs/2110.07143) | ACL 2022 | Huawei Noah’s Ark Lab| 14 Oct 2021 |\n"
  },
  {
    "path": "paper-review.md",
    "content": "# Multi-Modal Survey Paper\n\n|No.  |topic |Title |Links |Pub. | Organization| Release Time |\n|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|\n|1|video-language |Bridging Vision and Language from the Video-to-TextPerspective: A Comprehensive Review |[paper](https://arxiv.org/pdf/2103.14785v1.pdf) |__arXiv__|University of Chile|27 Mar 2021|\n|2| video-language pretraining| Survey: Transformer based Video-Language Pre-training | [paper](https://arxiv.org/pdf/2109.09920.pdf) | __arXiv__ |  Renmin University of China | 21 Sep 2021|\n\n"
  },
  {
    "path": "video-language-transformer.md",
    "content": "# Video & Language Transformer\r\n\r\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |\r\n|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|\r\n|1|COOT |COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning |[paper](https://proceedings.neurips.cc/paper/2020/file/ff0abbcc0227c9124a804b084d161a2d-Paper.pdf) [code](https://github.com/gingsi/coot-videotext) |__Neurips 2020__|University of Freiburg|1 Nov 2020|\r\n|2|MMT |Multi-modal Transformer for Video Retrieval |[paper](https://arxiv.org/abs/2007.10639) [code](https://github.com/gabeur/mmt) |__ECCV 2020__|Inria & Google|21 Jul 2020|\r\n|3|HiT |HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval |[paper](https://arxiv.org/abs/2103.15049) |__arXiv__|Peking University|28 Mar 2021|\r\n|4|CLIPBERT |Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling |[paper](https://arxiv.org/pdf/2102.06183.pdf) [code](https://github.com/jayleicn/ClipBERT) |__CVPR 2021__|UNC Chapel Hill|11 Feb 2020|\r\n|5|SVRTN |Self-supervised Video Retrieval Transformer Network |[paper](https://arxiv.org/pdf/2104.07993.pdf) |__arXiv__|Alibaba DAMO Academy|16 Apr 2021|\r\n|6| VATT| VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text | [paper](https://arxiv.org/pdf/2104.11178.pdf) | __arXiv__| Google | 22 April 2021|\r\n|7|Forzen in Time | Forzen in Time: A Joint Video and Image Encoder for End-to-End Retrieval| [paper](https://arxiv.org/pdf/2104.00650.pdf) [code](https://github.com/m-bain/frozen-in-time) | __arXiv__ | University of Oxford| 1 April 2021|\r\n|8|CLIP4CLIP| CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval | [paper](https://arxiv.org/pdf/2104.08860.pdf) [code](https://github.com/ArrowLuo/CLIP4Clip)  |   __arXiv__|  Southwest Jiaotong University | 18 April 2021 |\r\n|9|CLIP2Video| CLIP2Video: Mastering Video-Text Retrieval via Image CLIP |  [paper](https://arxiv.org/pdf/2106.11097.pdf) [code](https://github.com/CryhanFang/CLIP2Video) | __arXiv__| PCG, Tencent | 21 June, 2021 |\r\n|10| T2VLAD| T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval | [paper](https://arxiv.org/abs/2104.10054)  | CVPR 2021 | Baidu | 20 April 2021 | \r\n|11|-| On Semantic Similarity in Video Retrieval | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Wray_On_Semantic_Similarity_in_Video_Retrieval_CVPR_2021_paper.pdf)  [code](https://github.com/mwray/Semantic-Video-Retrieval) | CVPR 2021 |Univesity of Bristol | 21 June, 2021|\r\n|12| VLM|VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding| [paper](https://arxiv.org/pdf/2105.09996.pdf) | arXiv | Facebook AI | 20 May 2021|\r\n|13| VideoBERT| VideoBERT: A Joint Model for Video and Language Representation Learning |[paper](https://arxiv.org/abs/1904.01766) | CVPR 2019 | Google Research | 11 Sep 2019 |\r\n|14| CBT | learning video representations using contrastive bidirectional transformer |[paper](https://arxiv.org/pdf/1906.05743.pdf)| arXiv | Google Research |  27 Sep 2019|\r\n|15 | ActBERT | ActBERT: Learning Global-Local Video-Text Representations |[paper](https://arxiv.org/abs/2011.07231) |  Baidu Research | CVPR 2020 | 14 Nov 2020| \r\n|16 | HERO |  HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training  |[paper](https://aclanthology.org/2020.emnlp-main.161.pdf) [code](https://github.com/linjieli222/HERO) | EMNLP 2020 | Microsoft Dynamics 365 AI Research | 29 Sep 2020 |\r\n| 17 | UniVL | UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation | [paper](https://arxiv.org/pdf/2002.06353.pdf) [code](https://github.com/microsoft/UniVL) |arXiv| MSRA| 15 Sep 2021|\r\n|18 |G-TAD| Boundary-sensitive Pre-training for Temporal Localization in Videos |[paper](https://arxiv.org/pdf/2011.10830.pdf) | ICCV 2021 | Samsung AI Centre Cambridge, UK | 26 Mar 2021 |\r\n|19 | UniVL |UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation| [paper](https://arxiv.org/abs/2002.06353) [code](https://github.com/microsoft/UniVL) | arXiv | Microsoft | 15 Feb 2020|\r\n|20| ActBERT |ActBERT: Learning Global-Local Video-Text Representations | [paper](https://arxiv.org/pdf/2011.07231.pdf) | CVPR 2020| Baidu Research |14 Nov 2020|\r\n|21| HERO| HERO : Hierarchical Encoder for Video+Language Omni-representation Pre-training | [paper](https://arxiv.org/abs/2005.00200) [code](https://github.com/linjieli222/HERO)| EMNLP 2020| Microsoft Dynamics 365 AI Research| 1 May 2020|\r\n|22| MM-ViT| MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition| [paper](https://arxiv.org/pdf/2108.09322.pdf) | arXiv | Oppo| 20 Aug 2021|\r\n|23| TPT |Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering | [paper](https://arxiv.org/pdf/2109.04735v1.pdf) | arXiv | Chinese Academy of Sciences | 10 Sep 2021|\r\n|24| ActionClip| ActionCLIP: A New Paradigm for Video Action Recognition | [paper](https://arxiv.org/pdf/2109.08472.pdf) [code](https://github.com/sallymmx/ActionCLIP.git) | arXiv | Zhejiang University  |17 Sep 2021|\r\n|25| justAsk | Just Ask: Learning to Answer Questions from Millions of Narrated Video|[paper](https://arxiv.org/pdf/2012.00451.pdf) [code](https://github.com/antoyang/just-ask) | ICCV 2021 | Inria Paris| 12 Aug 2021| \r\n|26| - | A Bilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer| [paper](https://arxiv.org/pdf/2112.04888.pdf) [code](github.com/weijiawu/BOVText)| arXiv |Zhengjiang University| 9 Dec 2021|\r\n|27| SWINBERT| SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning | [paper](https://arxiv.org/pdf/2111.13196.pdf) | arXiv | Microsoft | 25 Nov 2021|\r\n|28 | VIOLET | VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling | [paper](https://arxiv.org/pdf/2111.12681.pdf) [code](https://github.com/tsujuifu/pytorch_violet) | arXiv | UC Santa Barbara|  24 Nov 2021|\r\n|29| FasionViL | FashionViL: Fashion-Focused Vision-and-Language Representation Learning | [paper](https://arxiv.org/pdf/2207.08150.pdf) [github](https://github.com/BrandonHanx/mmf) | ECCV 2022 | University of Surrey | 17 Jul 2022 |\r\n\r\n\r\n# cross-domain video-retreival \r\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |\r\n|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|\r\n|1| -| Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Liu_Adaptive_Cross-Modal_Prototypes_for_Cross-Domain_Visual-Language_Retrieval_CVPR_2021_paper.pdf) | CVPR 2021 | Zhejiang University| 20 April 2021| \r\n\r\n\r\n# vision & language navigation\r\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |\r\n|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|\r\n|1|Episodic Transformer | Episodic Transformer for Vision-and-Language Navigation| [paper](https://arxiv.org/pdf/2105.06453.pdf) | arXiv | Inria |  13 May 2021| \r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n"
  },
  {
    "path": "video-transformer.md",
    "content": "# Video Transformer\r\n\r\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |\r\n|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|\r\n|1|TimeSformer |Is Space-Time Attention All You Need for Video Understanding? |[paper](https://arxiv.org/abs/2102.05095) [code](https://github.com/facebookresearch/TimeSformer) |__arXiv__|Facebook AI|24 Feb 2021|\r\n|2|Video Transformer |Video Transformer Network |[paper](https://arxiv.org/abs/2102.00719) |__arXiv__|Theator|1 Feb 2021|\r\n|3|ViViT |ViViT: A Video Vision Transformer |[paper](https://arxiv.org/pdf/2103.15691.pdf) |__arXiv__|Google AI|29 Mar 2021|\r\n|4|VideoGPT |  VideoGPT: Video Generation using VQ-VAE and Transformers |  [paper](https://arxiv.org/pdf/2104.10157.pdf) [code](https://wilson1yan.github.io/videogpt/index.html)  | __arXiv__ | UC Berkeley | 20 Apr 2021|\r\n|5|VIMPAC|VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning| [paper](https://arxiv.org/pdf/2106.11250.pdf) [code](https://github.com/airsplay/vimpac) | __arXiv__ | UNC| 21 June 2021|\r\n|6|-| Self-supervised Video Representation Learning by Context and Motion Decoupling | [paper](https://arxiv.org/pdf/2104.00862.pdf)| CVPR 2021 | Alibaba | 2 April 2021|\r\n|7|VideoLightFormer| VideoLightFormer: Lightweight Action Recognition using Transformers| [paper](https://arxiv.org/pdf/2107.00451v1.pdf) | arXiv| the university of shefield| 1 Jul 2021|\r\n|8|Video Swin Transformer| Video Swin Transformer| [paper](https://arxiv.org/pdf/2106.13230.pdf) [code](https://github.com/SwinTransformer/Video-Swin-Transformer) | arXiv | MSRA | 24 Jun 2021|\r\n|9| ST Swin| Long-Short Temporal Contrastive Learning of Video Transformers| [paper](https://arxiv.org/pdf/2106.09212.pdf) |arXiv|Facebook AI|  17 Jun 2021|\r\n|10|X-ViT|Space-time Mixing Attention for Video Transformer| [paper](https://arxiv.org/pdf/2106.05968.pdf) | arXiv|  Samsung AI Cambridge |11 Jun 2021| \r\n|11| OCVT | Generative Video Transformer: Can Objects be the Words? | [paper](https://arxiv.org/abs/2107.09240) | ICML 2021 |Rutgers University | 20 Jul 2021|\r\n|12|-|An Image is Worth 16x16 Words, What is a Video Worth?| [paper](https://arxiv.org/pdf/2103.13915.pdf) [code](https://github.com/Alibaba-MIIL/STAM) | arXiv | Alibaba |27 May 2021|\r\n|13| SCT| Shifted Chunk Transformer for Spatio-Temporal Representational Learning | [paper](https://arxiv.org/pdf/2108.11575.pdf) | arXiv | Kuaishou Technology | 26 Aug 2021|\r\n|14| -| Evaluating Transformers for Lightweight Action Recognition | [paper](https://arxiv.org/pdf/2111.09641.pdf) | arXiv | University of Sheffield | 18 Nov 2021|\r\n|15| DualFormer| DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition | [paper](https://arxiv.org/pdf/2112.04674v1.pdf) | arXiv |Sea AI Lab | 9 Dec 2021|\r\n|16| BEVT| BEVT: BERT Pretraining of Video Transformers | [paper](https://arxiv.org/pdf/2112.01529.pdf) | arXiv | Shanghai Key Lab of Intelligent Information Processing | 2 Dec 2021|\r\n|17|-| Efficient Video Transformers with Spatial-Temporal Token Selection|[paper](https://arxiv.org/pdf/2111.11591.pdf)| arXiv | Shanghai Key Lab of Intelligent Information Processing | 23 Nov 2021|\r\n|18| -| Lite Vision Transformer with Enhanced Self-Attention| [paper](https://arxiv.org/pdf/2112.10809.pdf) [code](https://github.com/Chenglin-Yang/LVT) | arXiv | Johns Hopkins University | 20 Dec 2021|\r\n|19|MViT| Multiscale Vision Transformers| [paper](https://arxiv.org/pdf/2104.11227.pdf) [code](https://github.com/facebookresearch/SlowFast)| ICCV 2021 | Facebook| 22 Apr 2021|\r\n|20| Uniformer| Uniformer: Unified Transformer For Efficient Spatiotemporal Representation Learning| [paper](https://openreview.net/pdf?id=nBU_u6DLvoK) [code](https://github.com/sense-x/uniformer) | arXiv | Chinese Academy of Sciences|12 Jan 2022|\r\n|21|MaskFeat| Masked Feature Prediction for Self-Supervised Visual Pre-Training| [paper](https://arxiv.org/pdf/2112.09133v1.pdf)| arXiv | Facebook AI |16 Dec 2021|\r\n|22|MTV| Multiview Transformers for Video Recognition| [paper](https://arxiv.org/pdf/2201.04288.pdf) |arXiv| Google | 20 Jan 2022|\r\n|23| MeMViT | MeMViT : Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition| [paper](https://arxiv.org/pdf/2201.08383.pdf) |arXiv | Facebook AI Research | 20 Jan 2022|\r\n\r\n"
  },
  {
    "path": "vision_model_compression.md",
    "content": "# Compressed Transformer\n\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |\n|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|\n|1| VTP |Vision Transformer Pruning |[paper](https://arxiv.org/pdf/2104.08500.pdf) |__KDD 2021 workshop__|Westlake University|14 Aug 2021|\n|2| IA-RED2 | IA-RED2 : Interpretability-Aware Redundancy Reduction for Vision Transformers | [paper](https://proceedings.neurips.cc/paper/2021/hash/d072677d210ac4c03ba046120f0802ec-Abstract.html) [code](http://people.csail.mit.edu/bpan/ia-red/) | __NeurIPS 2021__ | MIT| 23 Jun 2021|\n|3| DynamicViT| DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification | [paper](https://arxiv.org/pdf/2106.02034.pdf) [code](https://github.com/raoyongming/DynamicViT) | | __NeurIPS 2021__| Tsinghua University| 26 Oct 2021|\n|4|  Evo-ViT| Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer| [paper](https://arxiv.org/pdf/2108.01390.pdf) [code](https://github.com/YifanXu74/Evo-ViT)|__arXiv__|Chinese Academy of Sciences |6 Dec 2021|\n|5| - |Patch Slimming for Efficient Vision Transformers| [paper](https://arxiv.org/pdf/2106.02852.pdf) |__arXiv__| Peking University|5 Jun 2021|\n|6|-| Chasing Sparsity in Vision Transformers: An End-to-End Exploration| [paper](https://arxiv.org/pdf/2106.04533.pdf) [code](https://github.com/VITA-Group/SViTE) | __arXiv__| University of Texas at Austin| 22 Oct 2021|\n|7|DeIT| Training data-efficient image transformers & distillation through attention | [paper](https://arxiv.org/pdf/2012.12877.pdf) | __ICML 2021__|Facebook | 15 Jan 2021|\n|8| -|Post-Training Quantization for Vision Transformer| [paper](https://arxiv.org/abs/2106.14156) | __NeurIPS 2021__| Peking University| 27 Jun 2021|\n|9| -| Multi-Dimensional Model Compression of Vision Transformer | [paper](https://arxiv.org/pdf/2201.00043.pdf) | __arXiv__| Princeton University |31 Dec 2021|\n|10|-| Patch Slimming for Efficient Vision Transformers|[paper](https://arxiv.org/pdf/2106.02852.pdf) | __arXiv__ |Peking University|5 Jun 2021|\n|11|-| Chasing Sparsity in Vision Transformers: An End-to-End Exploration| [paper](https://arxiv.org/pdf/2106.04533.pdf) [code](https://github.com/VITA-Group/SViTE)| NeurIPS 2021 | University of Texas at Austin|22 Oct 2021|\n"
  }
]