Full Code of junchen14/Multi-Modal-Transformer for AI

main abae8295070c cached

13 files

69.2 KB

23.0k tokens

1 requests

Download .txt

Repository: junchen14/Multi-Modal-Transformer
Branch: main
Commit: abae8295070c
Files: 13
Total size: 69.2 KB

Directory structure:
gitextract_p2i50wah/

├── MultiModal-CVPR2021.md
├── NLP-transformer.md
├── README.md
├── Self-supervised_learning.md
├── datasets.md
├── efficiency-transformer.md
├── image-language-transformer.md
├── image-transformer.md
├── other_interesting_paper.md
├── paper-review.md
├── video-language-transformer.md
├── video-transformer.md
└── vision_model_compression.md

================================================
FILE CONTENTS
================================================

================================================
FILE: MultiModal-CVPR2021.md
================================================
## Multi-modal learning paper in CVPR2021

the Navigation of [CVPR 2021 papers](https://blog.kitware.com/demos/cvpr-2021-papers/)


### Text-to-Image Generation
|No.  |Model Name |Title |Links |Pub. | Organization| 
|-----|:-----:|:-----:|:-----:|:--------:|:---:|
|0|XMC-GAN| Cross-Modal Contrastive Learning for Text-to-Image Generation | [paper](https://openaccess.thecvf.com/content/CVPR2021/html/Zhang_Cross-Modal_Contrastive_Learning_for_Text-to-Image_Generation_CVPR_2021_paper.html)| CVPR 2021 | Google Research|

### Autonomous Driving 
|No.  |Model Name |Title |Links |Pub. | Organization| 
|-----|:-----:|:-----:|:-----:|:--------:|:---:|
|0|MVDNet|Robust Multimodal Vehicle Detection in Foggy Weather Using Complementary Lidar and Radar Signals   |[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Qian_Robust_Multimodal_Vehicle_Detection_in_Foggy_Weather_Using_Complementary_Lidar_CVPR_2021_paper.pdf) [code](https://github.com/qiank10/MVDNet)| CVPR 2021 | University of California SanDiego |
|1|-| Multi-Modal Fusion Transformer for End-to-End Autonomous Driving | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Prakash_Multi-Modal_Fusion_Transformer_for_End-to-End_Autonomous_Driving_CVPR_2021_paper.pdf) | CVPR 2021 | Max Planck Institute for Intelligent Systems| 

### Navigation
|No.  |Model Name |Title |Links |Pub. | Organization| 
|-----|:-----:|:-----:|:-----:|:--------:|:---:|
|0|VLN|Robust Multimodal Vehicle Detection in Foggy Weather Using Complementary Lidar and Radar Signals   |[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Qian_Robust_Multimodal_Vehicle_Detection_in_Foggy_Weather_Using_Complementary_Lidar_CVPR_2021_paper.pdf) [code](https://github.com/qiank10/MVDNet)| CVPR 2021 | University of California SanDiego |
|1|SSM | Structured Scene Memory for Vision-Language Navigation| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Wang_Structured_Scene_Memory_for_Vision-Language_Navigation_CVPR_2021_paper.pdf) | CVPR 2021 | Beijing Institute of Technology |



### OCR
|No.  |Model Name |Title |Links |Pub. | Organization| 
|-----|:-----:|:-----:|:-----:|:--------:|:---:|
|0|-|Semantic-Aware Video Text Detection |[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Feng_Semantic-Aware_Video_Text_Detection_CVPR_2021_paper.pdf) | CVPR 2021 | National Laboratory of Pattern Recognition |
|1| TRBA | What If We Only Use Real Datasets for Scene Text Recognition?Toward Scene Text Recognition With Fewer Labels | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Baek_What_if_We_Only_Use_Real_Datasets_for_Scene_Text_CVPR_2021_paper.pdf) [code](https://github.com/ku21fan/STR-Fewer-Labels) | CVPR 2021 | The University of Tokyo|
|2| Multiplexed TextSpotter | A Multiplexed Network for End-to-End, Multilingual OCR| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Huang_A_Multiplexed_Network_for_End-to-End_Multilingual_OCR_CVPR_2021_paper.pdf)| CVPR 2021 | Facebook AI|
|3|STKM | Self-attention based Text Knowledge Mining for Text Detection | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Wan_Self-Attention_Based_Text_Knowledge_Mining_for_Text_Detection_CVPR_2021_paper.pdf) | CVPR 2021 | Shenzhen University |
|4| TextOCR | TextOCR: Towards large-scale end-to-end reasoningfor arbitrary-shaped scene text | CVPR 2021 | Facebook AI Research|

### Video Moment Retreival
|No.  |Model Name |Title |Links |Pub. | Organization| 
|-----|:-----:|:-----:|:-----:|:--------:|:---:|
|0|-| Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Zeng_Multi-Modal_Relational_Graph_for_Cross-Modal_Video_Moment_Retrieval_CVPR_2021_paper.pdf) | CVPR 2021 | Hunan University|

### video-audio-text 
|No.  |Model Name |Title |Links |Pub. | Organization| 
|-----|:-----:|:-----:|:-----:|:--------:|:---:|
|0|| How2Sign: A Large-scale Multimodal Datasetfor Continuous American Sign Language|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Duarte_How2Sign_A_Large-Scale_Multimodal_Dataset_for_Continuous_American_Sign_Language_CVPR_2021_paper.pdf) [dataset](http://how2sign.github.io/) | CVPR 2021 | Universitat Polit\`ecnica de Catalunya|

### Image&Language
|No.  |Model Name |Title |Links |Pub. | Organization| 
|-----|:-----:|:-----:|:-----:|:--------:|:---:|
|0|| Image Change Captioning by Learning from an Auxiliary Task|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Hosseinzadeh_Image_Change_Captioning_by_Learning_From_an_Auxiliary_Task_CVPR_2021_paper.pdf)  | CVPR 2021 |University of Manitoba|
|1| UC^2 | UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training |[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhou_UC2_Universal_Cross-Lingual_Cross-Modal_Vision-and-Language_Pre-Training_CVPR_2021_paper.pdf) | CVPR 2021 | University of California, Davis|
|2|-| How Transferable are Reasoning Patterns in VQA?|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Kervadec_How_Transferable_Are_Reasoning_Patterns_in_VQA_CVPR_2021_paper.pdf)  [code](https://reasoningpatterns.github.io) | CVPR 2021 |INSA Lyon|
|3|M3p | M3P: Learning Universal Representations via Multitask MultilingualMultimodal Pre-training |  [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Ni_M3P_Learning_Universal_Representations_via_Multitask_Multilingual_Multimodal_Pre-Training_CVPR_2021_paper.pdf) | CVPR 2021 | HiT |
|4| CC12M | Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Changpinyo_Conceptual_12M_Pushing_Web-Scale_Image-Text_Pre-Training_To_Recognize_Long-Tail_Visual_CVPR_2021_paper.pdf)  | CVPR 2021 | Google Research|
|5| - | Separatin Skills and Concepts for Novel Visual Questions Answering| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Whitehead_Separating_Skills_and_Concepts_for_Novel_Visual_Question_Answering_CVPR_2021_paper.pdf) | CVPR 2021 |UIUC | 
|6| VinVL | VinVL: Revisiting Visual Representations in Vision-Language Models | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhang_VinVL_Revisiting_Visual_Representations_in_Vision-Language_Models_CVPR_2021_paper.pdf) [code](https://github.com/pzzhang/VinVL) | CVPR 2021 | Microsoft |
|7| -| Domain-robus VQA with diverse datasets and methods but no target labels | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhang_Domain-Robust_VQA_With_Diverse_Datasets_and_Methods_but_No_Target_CVPR_2021_paper.pdf) | CVPR 2021 | University of Pittsburgh |
|8| PCME | Probabilistic Embeddings for Cross-Modal Retrieval | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Chun_Probabilistic_Embeddings_for_Cross-Modal_Retrieval_CVPR_2021_paper.pdf) [code](https://github.com/naver-ai/pcme) | CVPR 2021 | NAVER AI Lab|
|9| -| Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers |[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Miech_Thinking_Fast_and_Slow_Efficient_Text-to-Visual_Retrieval_With_Transformers_CVPR_2021_paper.pdf)| CVPR 2021 | DeepMind|
|10|TAP| TAP: Text-Aware Pre-training for Text-VQA and Text-Caption| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Yang_TAP_Text-Aware_Pre-Training_for_Text-VQA_and_Text-Caption_CVPR_2021_paper.pdf)| CVPR 2021 | University of Rochester|
|11| Causal Attention| Causal Attention for Vision-Language Tasks| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Yang_Causal_Attention_for_Vision-Language_Tasks_CVPR_2021_paper.pdf) [code](https://github.com/yangxuntu/lxmertcatt) | CVPR 2021 | Nanyang Technological University,Singapore|
|12| VirTex | VirTex: Learning Visual Representations from Textual Annotations | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Desai_VirTex_Learning_Visual_Representations_From_Textual_Annotations_CVPR_2021_paper.pdf) | CVPR 2021 | University of Michigan |
|13| -| Predicting Human Scanpaths in Visual Question Answering | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Chen_Predicting_Human_Scanpaths_in_Visual_Question_Answering_CVPR_2021_paper.pdf) | CVPR 2021 | Univeristy of Minnesota | 
|14| Kaleido-BERT| Kaleido-BERT: Vision-Language Pre-training on Fashion Domain | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhuge_Kaleido-BERT_Vision-Language_Pre-Training_on_Fashion_Domain_CVPR_2021_paper.pdf) [code](http://dpfan.net/Kaleido-BERT) | CVPR 2021 | Alibaba Group |
|15| -| Seeing Out of tHe bOx:End-to-End Pre-training for Vision-Language Representation Learning | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Huang_Seeing_Out_of_the_Box_End-to-End_Pre-Training_for_Vision-Language_Representation_CVPR_2021_paper.pdf) | CVPR 2021 | Univeristy of Science and Technology Beijing|
|16| -| Learning by Planning: Language-Guided Global Image Editing|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Shi_Learning_by_Planning_Language-Guided_Global_Image_Editing_CVPR_2021_paper.pdf) [code](https://github.com/jshi31/T2ONet) | CVPR 2021 | University of Rochester|
|17| KRISP| KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Marino_KRISP_Integrating_Implicit_and_Symbolic_Knowledge_for_Open-Domain_Knowledge-Based_VQA_CVPR_2021_paper.pdf) [code](https://github.com/facebookresearch/krisp) |
|18| -| Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Liu_Adaptive_Cross-Modal_Prototypes_for_Cross-Domain_Visual-Language_Retrieval_CVPR_2021_paper.pdf) | CVPR 2021 | Peking University|




### Video&Text
|No.  |Model Name |Title |Links |Pub. | Organization| 
|-----|:-----:|:-----:|:-----:|:--------:|:---:|
|0| ClipBERT | Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling  | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Lei_Less_Is_More_ClipBERT_for_Video-and-Language_Learning_via_Sparse_Sampling_CVPR_2021_paper.pdf) [code](https://github.com/jayleicn/ClipBERT) | CVPR 2021 | UNC|
|1| -| SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Networkfor Video Reasoning over Traffic Events | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Xu_SUTD-TrafficQA_A_Question_Answering_Benchmark_and_an_Efficient_Network_for_CVPR_2021_paper.pdf) [code](https://github.com/SUTDCV/SUTD-TrafficQA) | Singapore University of Technology and Design |
|2| -| Open-book Video Captioning with Retrieve-Copy-Generate Network| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhang_Open-Book_Video_Captioning_With_Retrieve-Copy-Generate_Network_CVPR_2021_paper.pdf) | CVPR 2021 | institute of Automation, Chinese Academy of Sciences|
|3| NExT-QA| NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Xiao_NExT-QA_Next_Phase_of_Question-Answering_to_Explaining_Temporal_Actions_CVPR_2021_paper.pdf) [code](https://github.com/doc-doc/NExT-QA.git) | CVPR 2021 | National University of Singapore |
|4|AGQA| AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Grunde-McLaughlin_AGQA_A_Benchmark_for_Compositional_Spatio-Temporal_Reasoning_CVPR_2021_paper.pdf) | CVPR 2021 | Stanford University|
|5| -| Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Park_Bridge_To_Answer_Structure-Aware_Graph_Interaction_Network_for_Video_Question_CVPR_2021_paper.pdf) | CVPR 2021 |Yonsei University, Souch Korea|
|6| -| Look Before you Speak: Visually Contextualized Utterances | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Seo_Look_Before_You_Speak_Visually_Contextualized_Utterances_CVPR_2021_paper.pdf) | CVPR 2021 | Google Research|

### 3D cross-modal retreival
|No.  |Model Name |Title |Links |Pub. | Organization| 
|-----|:-----:|:-----:|:-----:|:--------:|:---:|
|0| -| Cross-Modal Center Loss for 3D Cross-Modal Retrieval | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Jing_Cross-Modal_Center_Loss_for_3D_Cross-Modal_Retrieval_CVPR_2021_paper.pdf) | CVPR 2021 | The City University of New York|


### Video-to-Text Generation
|No.  |Model Name |Title |Links |Pub. | Organization| 
|-----|:-----:|:-----:|:-----:|:--------:|:---:|
|0| Vx2Text |VX2TEXT: End-to-End Learning of Video-Based Text GenerationFrom Multimodal Inputs| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Lin_Vx2Text_End-to-End_Learning_of_Video-Based_Text_Generation_From_Multimodal_Inputs_CVPR_2021_paper.pdf) | CVPR 2021 | Columbia University |


### Image-to-Video Synthesis
|No.  |Model Name |Title |Links |Pub. | Organization| 
|-----|:-----:|:-----:|:-----:|:--------:|:---:|
|0|cINNs| Stochastic Image-to-Video Synthesis using cINNs|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Dorkenwald_Stochastic_Image-to-Video_Synthesis_Using_cINNs_CVPR_2021_paper.pdf)  | CVPR 2021 |Heidelberg University|
|1| |Understanding Object Dynamics for Interactive Image-to-Video Synthesis|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Blattmann_Understanding_Object_Dynamics_for_Interactive_Image-to-Video_Synthesis_CVPR_2021_paper.pdf) [code](https://bit.ly/3cxfA2L) | CVPR 2021 | Heidelberg University|


### Audio&Visual
|No.  |Model Name |Title |Links |Pub. | Organization| 
|-----|:-----:|:-----:|:-----:|:--------:|:---:|
|0|-| Can audio-visual integration strengthen robustnessunder multimodal attacks?|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Tian_Can_Audio-Visual_Integration_Strengthen_Robustness_Under_Multimodal_Attacks_CVPR_2021_paper.pdf)  | CVPR 2021 |University of Rochester|
|1| -| Audio-Visual Instance Discrimination with Cross-Modal Agreement| [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Morgado_Audio-Visual_Instance_Discrimination_with_Cross-Modal_Agreement_CVPR_2021_paper.pdf) | CVPR 2021 | UC San Diego|
|2| -|VISUALVOICE: Audio-Visual Speech Separation with Cross-Modal Consistency | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Gao_VisualVoice_Audio-Visual_Speech_Separation_With_Cross-Modal_Consistency_CVPR_2021_paper.pdf) [code](http://vision.cs.utexas.edu/projects/VisualVoice/) | CVPR 2021 | The University of Texas at Austin|

### Language-guided video actor segmentation
|No.  |Model Name |Title |Links |Pub. | Organization| 
|-----|:-----:|:-----:|:-----:|:--------:|:---:|
|0|-| Collaborative Spatial-Temporal Modeling for Language-QueriedVideo Actor Segmentation|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Hui_Collaborative_Spatial-Temporal_Modeling_for_Language-Queried_Video_Actor_Segmentation_CVPR_2021_paper.pdf) | CVPR 2021 |Chinese Academy of Sciences|




================================================
FILE: NLP-transformer.md
================================================
# Natural Language Processing Transformer


|No.  |Model Name |Title |Links |Pub. | Organization| Release Time | 
|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|
|1|BERT|BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |[paper](https://arxiv.org/abs/1810.04805) [code](https://github.com/google-research/bert) |__NAACL 2019__|Google|Oct 2018|
|2|GPT3|Language Models are Few-Shot Learners|[paper](https://arxiv.org/abs/2005.14165) | __NeuRIPS 2020__ | OpenAI | May 2020|
|3|GPT2|Language Models are Unsupervised Multitask Learners |[paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) [code](https://github.com/openai/gpt-2)|__arXiv__ | OpenAI | Feb 2019|
|4| RoBERTa | RoBERTa: A Robustly Optimized BERT Pretraining Approach | [paper](https://arxiv.org/abs/1907.11692) | __arXiv__ |Facebook AI | Jul 2019|
|5| XLNet |XLNet: Generalized Autoregressive Pretraining for Language Understanding|  [paper](https://arxiv.org/abs/1906.08237) [code](https://github.com/zihangdai/xlnet) |__NeuRIPS 2019__| Google | Jun 2019|


================================================
FILE: README.md
================================================
# Reading list in Transformer
 

This repo is aimed to collect all the recent popular Transformer paper, codes and learning resources with respect to the domains of **Vision Transformer**, **NLP** and **multi-modal**, etc. 




### Topics (paper and code)
- [Image Transformer](image-transformer.md) 


- [Video Transformer](video-transformer.md)


- [Video & Language & other modality Transformer](video-language-transformer.md)


- [Image & language & other modlity Trasformer](image-language-transformer.md)


- [Natural Language Processing Transformer](NLP-transformer.md)


- [Efficient Transformer](efficiency-transformer.md)

- [model compression](vision_model_compression.md)

- [Self Supverpervised Learning in Vision](Self-supervised_learning.md)

<!-- - [MLP for Image Classification](MLP-mixer.md) -->

- [other interested papers in related domains](other_interesting_paper.md)


Review Paper in multi-modal  
- [Video-language](paper-review.md)


### Tutorials and workshop
- [Cross-View and Cross-Modal Visual Geo-Localization: IEEE CVPR 2021 Tutorial](https://youtube.com/playlist?list=PLUgbVHjDharjTo9tk3xcPJHEkmi33ap-u)

- [From VQA to VLN: Recent Advances in Vision-and-Language Research: IEEE CVPR 2021 Tutorial](https://youtube.com/playlist?list=PLUgbVHjDhari645g1zmpo-MtOVap1FKxh)

- [Tutorial on MultiModal Machine Learning: IEEE CVPR 2022 Tutorial](https://cmu-multicomp-lab.github.io/mmml-tutorial/cvpr2022/)



### Datasets
- [Multi-modal Datasets](datasets.md)


### Blogs
- [Lil's blogs](https://lilianweng.github.io/lil-log/)
- 

### Tools
- [PyTorchVideo](https://pytorchvideo.org/) a deep learning library for video understanding research

- [horovod](https://github.com/horovod/horovod) a tool for multi-gpu parallel processing

- [accelerate](https://huggingface.co/docs/accelerate/) an easy API for mixed precision and any kind of distributed computing

- [hyperparameter search: optuna](https://optuna.org/)

- [AI Conference Deadlines](https://aideadlin.es/)



================================================
FILE: Self-supervised_learning.md
================================================
# this will collect many papers that relates to self-supervied learning in vision domains.


Self-supervised learning for Images
|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |
|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|
|1|iGPT |	Generative Pretraining from Pixels |[paper](http://proceedings.mlr.press/v119/chen20s/chen20s.pdf) [code](https://github.com/openai/image-gpt) |__ICML 2021__|OpenAI|17 June 2020|
|2| MST | MST: Masked Self-Supervised Transformer for Visual Representation | [paper](https://arxiv.org/pdf/2106.05656.pdf) | __NeurIPS 2021__|Chinese Academy of Sciences| 10 June 2021|
|3|BEiT| BEiT: BERT Pre-Training of Image Transformers| [paper](https://arxiv.org/abs/2106.08254) [code](https://github.com/microsoft/unilm/tree/master/beit) | __ICLR 2022__|Microsoft Research| 15 June 2021|
|4| MAE | Masked Autoencoders Are Scalable Vision Learners| [paper](https://arxiv.org/pdf/2111.06377.pdf) [code](https://github.com/facebookresearch/mae)| CVPR 2022| Meta | 19 Dec 2021|
|5| iBoT | iBOT: Image BERT Pre-Training with Online Tokenizer| [paper](https://arxiv.org/pdf/2111.07832.pdf) [code](https://github.com/bytedance/ibot) | ICLR 2022 | ByteDance |15 Nov 2021| 
|6| SimMIM| SimMIM: A Simple Framework for Masked Image Modeling | [paper](https://arxiv.org/pdf/2111.09886.pdf) [code](https://github.com/microsoft/SimMIM) | arXiv| MSRA| 18 Nov 2021| 
|7| PeCo | 	PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers | [paper](https://arxiv.org/pdf/2111.12710.pdf) |arXiv|  Univeristy of Science and Technology of China | 24 Nov 2021|
|8| MaskFeat | 	Masked Feature Prediction for Self-Supervised Visual Pre-Training | [paper](https://arxiv.org/pdf/2112.09133.pdf) | arXiv | Meta | 16 Dec 2021|
|9| SplitMask | Are Large-scale Datasets Necessary for Self-Supervised Pre-training? | [paper](https://arxiv.org/pdf/2112.10740.pdf) | arXiv | Meta | 20 Dec 2021 | 
|10| ADIOS | Adversarial Masking for Self-Supervised Learning| [paper](https://arxiv.org/pdf/2201.13100.pdf) | ICML 2022 | Unviersity of Oxford | 31 Jan 2021|
|11| CAE | Context Autoencoder for Self-Supervised Representation Learning | [paper](https://arxiv.org/pdf/2202.03026.pdf) | arXiv | Peking University | 7 Feb 2022 |
|12| CIM| Corrupted Image Modeling for Self-Supervised Visual Pre-Training| [paper](https://arxiv.org/pdf/2202.03382.pdf) [code](https://github.com/microsoft/unilm) | arXiv | Microsoft | 7 Feb 2022|
|13| ConvMAE | ConvMAE: Masked Convolution Meets Masked Autoencoders |[paper](https://arxiv.org/pdf/2205.03892.pdf) [code](https://github.com/Alpha-VL/ConvMAE) | arXiv | Shanghai AI Laboratory |  19 May 2022 |
|14 | uniform masking | Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality | [paper](https://arxiv.org/pdf/2205.10063.pdf)  [code](https://github.com/implus/UM-MAE) | arXiv | Nanjing University of Science and Technology | 20 May 2022|
|15| LoMaR | Efficient self-supervised learning with local masked reconstruction | [paper](https://arxiv.org/pdf/2206.00790.pdf) [code](https://github.com/junchen14/LoMaR) | arXiv| KAUST | 1 Jun 2022 |
|16| M3AE | Multimodal Masked Autoencoders Learn Transferable Representations | [paper](https://arxiv.org/pdf/2205.14204.pdf) | arXiv | UCB | 31 May 2022|
|17| HiViT| HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling | [paper](https://arxiv.org/pdf/2205.14949.pdf) | arXiv | University of Chinese Academy of Sciences | 30 May 2022 |
|18 | GreenMiM |  Green Hierarchical Vision Transformer for Masked Image Modeling | [paper](https://arxiv.org/pdf/2205.13515v1.pdf) [code](https://github.com/LayneH/GreenMIM) | arXiv | The University of Tokyo | 26 May 2022 | 
|19| A^2MIM |Architecture-Agnostic Masked Image Modeling – From ViT back to CNN   | [paper](https://arxiv.org/pdf/2205.13943.pdf) | arXiv | AI Lab, Westlake University | 1 Jun 2022|
|20 | MixMIM | MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning | [paper](https://arxiv.org/pdf/2205.13137.pdf) [code](https://github.com/Sense-X/MixMIM) | arXiv | SenseTime Research |  28 May 2022 |
|21 | SemMAE |SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders | [paper](https://arxiv.org/pdf/2206.10207.pdf) | arXiv | Chinese Academy of Sciences| 21 Jun 2022|
|22 | Voxel-MAE | Voxel-MAE: Masked Autoencoders for Pre-training Large-scale Point Clouds | [paper](https://arxiv.org/pdf/2206.09900.pdf) [code](https://github.com/chaytonmin/Voxel-MAE) | arXiv | Peking University | 20 Jun 2022|
|23 | BootMAE |Bootstrapped Masked Autoencoders for Vision BERT Pretraining| [paper](https://arxiv.org/pdf/2207.07116.pdf) [code](https://github.com/LightDXY/BootMAE) | ECCV 2022 | University of Science and Technology of China | 14 Jul 2022|
|24 | OmniMAE | OmniMAE: Single Model Masked Pretraining on Images and Videos| [paper](https://arxiv.org/pdf/2206.08356.pdf) [code](https://github.com/facebookresearch/omnivore) | arXiv | Meta AI | 16 Jun 2022|
|25 | SatMAE| SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery| [paper](https://arxiv.org/pdf/2207.08051.pdf) | arXiv | Stanford University | 17 Jul 2022 |
|26 | CMAE | Contrastive Masked Autoencoders are Stronger Vision Learners | [paper](https://arxiv.org/abs/2207.13532) | arXiv | University of Science and Technology | 27 Jul 2022 |
|27| BEiT v2 | BEIT V2: Masked Image Modeling with Vector-Quantized Visual Tokenizers | [paper](https://arxiv.org/pdf/2208.06366.pdf) | arXiv| University of Chinese Academy of Sciences | 12 Aug 2022|
|28| BEiT v3| Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks | [paper](https://arxiv.org/abs/2208.10442) | arXiv | Microsoft Corporation | 22 Aug 2022 |


Self-supervised Learning for Videos
|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |
|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|
|1| VideoMAE| VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | [paper](https://arxiv.org/abs/2203.12602) [code](https://github.com/MCG-NJU/VideoMAE) | arXiv |  Tencent AI Lab | 23 Mar 2022 |
|2|MAE in Video| Masked Autoencoders As Spatiotemporal Learners | [paper](https://arxiv.org/pdf/2205.09113.pdf) | arXiv | Meta | 18 May 2022 |


Self-supervised Learning for Audios
|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |
|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|
|1| AudioMAE| Masked Autoencoders that Listen | [paper](https://arxiv.org/pdf/2207.06405v1.pdf)  [code](https://github.com/facebookresearch/AudioMAE) | arXiv |  Meta AI | 13 Jul 2022|


Survey in self-supervised learning
|No.  |Title |Links |Pub. | Organization| Release Time |
|-----|:-----:|:-----:|:--------:|:---:|:-------:|
|1|A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond | [paper](https://arxiv.org/pdf/2208.00173.pdf) |arXiv| KAIST| 30 Jul 2022|



================================================
FILE: datasets.md
================================================
# Common multimodal datasets

## Image Datasets
[COCO](https://cocodataset.org/#home)\
[conceptual 3M](https://ai.google.com/research/ConceptualCaptions/)\
[coenceptual 12M](https://github.com/google-research-datasets/conceptual-12m)

## Video&language  Dataset
|Dataset |paper| Clips |Captions |Videos |Duration | Source| Year |  Tasks| collection method|
|-----|:-----:|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|:-------:|:-------:|
|[Chalades](https://prior.allenai.org/projects/charades) | [paper](https://openreview.net/forum?id=rJW3ItWubH)|10K | 16K |10,000 | 82h|daily household videos|2016| action recoginition & captioning| AMT|\
|[MSRVTT](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |[paper](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/cvpr16.msr-vtt.tmei_-1.pdf) |  10k| 200k| 7,180| 40h| web-crawed videos with 257 queries |2016| retreival and captioning | AMT|\
|[Didemo](https://github.com/LisaAnne/LocalizingMoments)| [paper](https://arxiv.org/pdf/1708.01641.pdf) | 27k| 41k| 10,464| 87h| randomly select over 14,000 videos from YFCC100M| 2017| Moment localization| crowdsoucing|\
|[M-VAD](https://github.com/aimagelab/mvad-names-dataset) | [paper](https://arxiv.org/pdf/1503.01070.pdf) |49k| 56k| 92| 84h| DVD movies| 2015| retreival |crowdsourcing| \
| [MPII-MD](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/mpii-movie-description-dataset) | [paper](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Rohrbach_A_Dataset_for_2015_CVPR_paper.pdf)| 69k| 68k| 94| 41h|Web Movies| 2015| captioning| crowdsourcing |\
|[ActivityNet](http://activity-net.org/)| [paper](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Heilbron_ActivityNet_A_Large-Scale_2015_CVPR_paper.pdf)|  100k | 100k | 20,000 | 849h| online human activities videos| 2017|captioning & retrieval| AMT |\
| [TGIF](http://raingo.github.io/TGIF-Release/) | [paper](https://arxiv.org/pdf/1604.02748.pdf)| 69k| 68k| 94| 41h| a year’s worth of GIF posts from Tumblr| 2015| captioning| CrowdFlower|\
[YouCook2](http://youcook2.eecs.umich.edu/download) |[paper](http://youcook2.eecs.umich.edu/static/YouCookII/youcookii_readme.pdf) |14k| 14k| 2,000| 176h| online cooking videos| 2018| retreival & captioning| well-trained native English speakers |\
|[LSMDC](https://sites.google.com/site/describingmovies/download) |[paper](https://arxiv.org/pdf/1605.03705.pdf) | 128k| 128k| 200| 150h| comination of M-VAD and MPII-MD datasets |2017 | captioning| /|\
[HowTo100M](https://github.com/antoine77340/howto100m) | [paper](https://arxiv.org/pdf/1906.03327.pdf)| 136M| 136M| 1.221M| 134,472h| large-scaled online videos| 2019| action step localization & retreival | ASR|
[Kinetics-700](https://deepmind.com/research/open-source/kinetics) |[paper](https://arxiv.org/abs/1907.06987)| 650K| /| 650K| /| an extension of kinetics-700 dataset |2019| action recoginition| /|\
[AVA-Kinetics](https://deepmind.com/research/open-source/kinetics) |[paper](https://arxiv.org/abs/2005.00214) | 230K| /| 230K| /| combines the annotation style of AVA and kinetics dataset| 2020| action recoginition|/ |\
[HACS]( http://hacs.csail.mit.edu/) |[paper]( https://arxiv.org/abs/1712.09374) | 1.5M| /| 504K| /| large scale human action localization dataset| 2019| action recoginition&captioning| crowdsourcing|\
[Tiny-Virat]( https://github.com/UgurDemir/Tiny-VIRAT) |[paper]( https://arxiv.org/abs/2007.07355) |  13K| /| 13K| /| low-resolution action recognition dataset (surveillance videos) |2020| action recognition| /|\
Action Genome |[paper]( https://arxiv.org/abs/1912.06992) | 234K| /| 234K| /| video scene graph| 2020| action recoginition& representations encoding eventpartonomies| crowdsourcing|\
[SoccerNet]( https://silviogiancola.github.io/SoccerNet) |[paper]( https://arxiv.org/pdf/1804.04527.pdf) | 650K| 764h| 650K| /| European Football League video| 2018| event classification in football game video| transformed from the data from league websites|\
[ActivityNet Entities]( http://t.cn/EfePohM) |[paper]( https://arxiv.org/abs/1812.06587) | 650K| /| 650K| /| ground the visual entity with the activitynet video objects| 2018| video understanding & action recognition| crowdsourcing|\
[VidSitu]( https://vidsitu.org/) |[paper]( https://arxiv.org/abs/2104.00990) | 136K| /| 29K| /| the events and related roles in the movies | 2021| semantic role and co-referencing prediction| AMT|\
[VATEX]( https://eric-xw.github.io/vatex-website/) | [paper](https://arxiv.org/abs/1904.03493)| 41.3k| 826k| 41.3k| 114h38m| human behavior video from YouTube| 2019| action recoginition&captioning| /|\
[MSVD]( https://www.cs.utexas.edu/users/ml/clamp/videoDescription/) | [paper](https://aclanthology.org/P11-1020.pdf)| 2k| 70k| 2k| 4h55m| web videos| 2011| video captioning| AMT |\
[MovieNet](http://movienet.site/) | [paper]( https://arxiv.org/abs/2007.10937)| 420k| 25k| 420k| /| Web Movies| 2020|  Genre classification & cinematic style analysis & character recognition &  scene analysis & story understanding| crowdsourcing| \
[MovieGraphs](http://moviegraphs.cs.toronto.edu/) | [paper]( http://moviegraphs.cs.toronto.edu/)| 7.6k| 70k| 51| 150h| scene graph representation of movie| 2018| description retreival & dialog retrieval & Movie Clip Retrieval | crowdsourcing|\
[QVHIGHLIGHTS](https://github.com/jayleicn/moment_detr) | [paper](https://arxiv.org/pdf/2107.09609.pdf) | 10.3k| 10.2k| 10.3k| / | daily or travel vlog and news| 2021| moment retreival & highlight detection| AMT|\
[UCF101](https://www.crcv.ucf.edu/research/data-sets/ucf101/) | [paper]( https://www.crcv.ucf.edu/wp-content/uploads/2019/03/UCF101_CRCV-TR-12-01.pdf) | 13.3k| 1600m| 13.3k| / | user-uploaded videos| 2012| action recoginition| crowdsourcing |\
[HMDB51]( https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/#dataset) | [paper]( http://serre-lab.clps.brown.edu/wp-content/uploads/2012/08/Kuehne_etal_ICCV2011.pdf) | 7K| /| 7K| /| action videos from Youtube/Google| 2011| action recoginition&captioning| crowdsourcing|\
[Moments-in-Time]( http://moments.csail.mit.edu/) | [paper]( https://arxiv.org/abs/1801.03150) | 1M| /| 1M| /| edited videos from YouTube, Flickr, Vine, Metacafe and other sources| 2017| action&event recognition| AMT|\
[AVA]( https://github.com/cvdfoundation/ava-dataset) | [paper](https://arxiv.org/abs/1705.08421) | 57.6K| 300k| 57.6K| / | Web Movies with human bounding boxes| 2017| atomic visual actions recogintion| crowdsourcing|\
[HVU]( https://holistic-video-understanding.github.io/) | [paper](https://arxiv.org/abs/1904.11451) | 57.2K| 9M| 57.2K| / | Youtube| 2020| multi-label and multi-task video understanding| semi-automatic crowdsourcing strategy |\
[Oops!]( https://github.com/DmZhukov/CrossTask) | [paper]( https://arxiv.org/abs/1911.11206) | 20K| / | 20K| / | in-the-wild videos of unintentional action| 2019| unintentional action recoginition| AMT|\
[CrossTask]( https://github.com/DmZhukov/CrossTask) | [paper]( https://arxiv.org/pdf/1903.08225.pdf) | 4.7K| / | 4.7K| /| weakly supervising learning from instructional videos| 2019| video classification| crowdsourcing|\
[COIN]( https://coin-dataset.github.io/) | [paper]( https://arxiv.org/pdf/1903.02874.pdf) | 11.8K | /| 11.8K| /| Comprehensive instructional video analysis | 2019| step localization & action recoginition| crowdsourcing|\
[Sports-1M]( https://cs.stanford.edu/people/karpathy/deepvideo/) | [paper]( http://cs.stanford.edu/people/karpathy/deepvideo/deepvideo_cvpr2014.pdf) | 1.1M| /| 1.1M| /| sports video from Youtube | 2014| video classification| crowdsourcing labed with taxonomy|\
[20BN-SOMETHING-SOMETHING]( https://20bn.com/datasets/something-something) | [paper]( https://arxiv.org/abs/1706.04261) | 220K| 318K| 220K| /| show humans performing pre-defined basic actions with everyday objects| 2017| action recoginition| AMT|\
[DALY]( http://thoth.inrialpes.fr/daly/) | [paper]( https://arxiv.org/pdf/1605.05197.pdf) | 8.1K| / | 8.1K| /| Daily Action Localization in YouTube| 2016| video classification| crowdsourcing|\
[FineGym]( https://sdolivia.github.io/FineGym/) | [paper]( https://arxiv.org/abs/2004.06704) | 8.1K| / | 8.1K| /|  gymnastic videos with temporal actions and sub-actions| 2020| video action recognition&detection&generation| crowdsourcing|\
[MultiSports]( https://deeperaction.github.io/multisports/) | [paper]( https://arxiv.org/abs/2105.07404) | 3.2K| / | 3.2K| /| competition videos with high resolution held in recent years| 2021| spatio-temporal action detection| /|\
[“Wildlife Action”]() | [paper]( https://www.crcv.ucf.edu/wp-content/uploads/2018/11/Weining_L_Report.pdf) | 10.6K| / | 10.6K| /| downloaded from YouTube| 2020| animal action recognition| YouTube’s Data API|\
[“Action Recogniation of Large Animals”]() | [paper]( https://ieeexplore.ieee.org/document/8634672) | /| / | /| /| downloaded from YouTube| 2018| animal action recognition| YouTube’s Data API|\
[“First-Person Animal Action”]() | [paper]( http://robotics.ait.kyushu-u.ac.jp/~yumi/db/papers/2014_ICPR_Final.pdf) | /| / | /| /| collected by a dog wearing a GoPro size camera| 2014| first-person animal activity recogniation| /|\
[AnimalWeb]( https://vcla.stat.ucla.edu/people/zhangzhang-si/HiT/exp5.html) | [paper]( https://arxiv.org/pdf/1909.04951.pdf) | /| / | /| /| collected by a dog wearing a GoPro size camera| 2014| first-person animal activity recogniation| /|\


## Video Dataset

|Dataset  |Videos |Duration | Source| Year | 
|-----|:-----:|:--------:|:---:|:-------:|
[Youtube8M](https://research.google.com/youtube8m/index.html) | 6M|350,000|YouTube| 2018|
[FineAction](https://deeperaction.github.io/fineaction/) |16,732 | -| YouTube |  24 May 2021|
[VideoLT](https://videolt.github.io/) | 256,218 | 819,898 | YouTube|  6 May 2021| 


## dataset collection tools
[voxel](https://voxel51.com/)  
[amazon turkers](https://www.mturk.com/)  
[shaip](https://www.shaip.com/)


================================================
FILE: efficiency-transformer.md
================================================
# Efficient Transformer

|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |
|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|
|1|HiBERT |HIBERT: Document Level Pre-training of Hierarchical BidirectionalTransformers for Document Summarization |[paper](https://arxiv.org/pdf/1905.06566.pdf) |__ACL 2019__|Microsoft Research Asia|16 May 2019|
|2|star transformer|Star Transformer |[paper](https://www.aclweb.org/anthology/N19-1133.pdf) |__NAACL 2019__|Shanghai Key Laboratory of Intelligent Information Processing, Fudan University|25 Feb 2019|
|3|ETC |ETC: Encoding Long and Structured Inputs in Transformers |[paper](https://www.aclweb.org/anthology/2020.emnlp-main.19.pdf) |__EMNLP 2020__|Google AI|16 November 2020|
|4|BP-Transformer |BP-Transformer: Modelling Long-Range Context via Binary Partitioning |[paper](https://arxiv.org/pdf/1911.04070.pdf) [code](https://github.com/yzh119/BPT)|__arXiv__|AWS Shanghai AI Lab|11 November 2019|
|5|Routing Transformer |Efficient Content-Based Sparse Attention with Routing Transformers |[paper](https://openreview.net/forum?id=B1gjs6EtDr) [code](https://github.com/lucidrains/routing-transformer)|__ICLR 2020__|Google AI|1 Februray 2021|
|7|Compressive Transformer |Compressive Transformers for Long-Range Sequence Modelling |[paper](https://openreview.net/pdf?id=SylKikSYDH) [code](https://github.com/lucidrains/compressive-transformer-pytorch)|__ICLR 2020__|Deep Mind|25 Sep 2019|
|8|Transformer-XL |Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context |[paper](https://arxiv.org/abs/1901.02860) [code](https://github.com/kimiyoung/transformer-xl)|__ACL 2019__|CMU|9 Jan 2019|
|9|Big Bird |Big Bird: Transformers for Longer Sequences |[paper](https://arxiv.org/abs/2007.14062) [code](https://github.com/google-research/bigbird)|__NeurIPS 2020__|Google Research|8 Jan 2021|
|10|Adaptive-Span |Adaptive Attention Span in Transformers |[paper](https://arxiv.org/pdf/1905.07799.pdf) [code](https://github.com/facebookresearch/adaptive-span)|__ACL 2019__|Facebook AI|19 May 2019|
|11|reformer |reformer: the efficient transformer |[paper](https://arxiv.org/abs/2001.04451) [code](https://github.com/lucidrains/reformer-pytorch)|__ICLR 2020__|Google AI|13 Jan 2020|
|12|Longformer |Longformer: The Long-Document Transformer |[paper](https://arxiv.org/abs/2004.05150) [code](https://github.com/allenai/longformer)|__ICLR 2020__|Allen Insitute for Artificial Intelligence|2 Dec 2020|
|13| - | parameter efficient multimodal transformers for video representation learning | [paper](https://openreview.net/forum?id=6UdQLhqJyFD) [code](https://github.com/sangho-vision/avbert) | __ICLR 2021__| Seoul National University | 8 Dec 2020|
|14| Albert| Albert: A lite BERT for self-supervised learning of language prepresentations | [paper](https://openreview.net/pdf?id=H1eA7AEtvS) [code](https://github.com/google-research/ALBERT) | __ICLR 2020__| Google Research | 26 Sep 2019|
|15| DEQ | Deep Equilibrium Models |[paper](https://proceedings.neurips.cc/paper/2019/file/01386bd6d8e091c2ab4c7c7de644d37b-Paper.pdf) [code](https://github.com/locuslab/deq) |  __NeurIPS 2019__| CMU |3 Sep 2019|
|16| Universal Transformer | Universal Transformers | [paper](https://arxiv.org/pdf/1807.03819.pdf) [code](https://github.com/andreamad8/Universal-Transformer-Pytorch) | __ICLR 2019__| University of Amsterdam | 5 May 2019|
|17| Linear Transformer | Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention | [paper](https://arxiv.org/pdf/2006.16236.pdf) [code](https://linear-transformers.com/) | __ICML 2020__ | Idiap Research Institute | 31 Aug 2020|
|18| ∞-former | ∞-former: Infinite Memory Transformer | [paper](https://arxiv.org/pdf/2109.00301.pdf) | __arXiv__ |Pedro Henrique Martins | 1 Sep 2021|
|19| ATS| ATS: Adaptive Token Sampling For Efficient Vision Transformers| [paper](https://arxiv.org/pdf/2111.15667.pdf) | arXiv | Microsoft|30 Nov 2021|
|20 |TerViT |TerViT: An Efficient Ternary Vision Transformer |[paper](https://arxiv.org/pdf/2201.08050.pdf) | arXiv| Beihang University| 20 Jan 2022|
|21| Lite Transformer| Lite Transformer with long-short range attention| [paper](https://arxiv.org/pdf/2004.11886.pdf) [code](https://github.com/mit-han-lab/lite-transformer) | __ICLR 2020__ | MiT | 24 Apr 2020|
|22|UVC|  Unitified Visual Transformer Compression  [paper](https://github.com/VITA-Group/UVC) [code](https://github.com/VITA-Group/UVC) |__ICLR 2022__ |University of Texas at Austin| 29 Sept 2021|
|23|MobileVIT| MobileVIT: light-weight, general-purpose，and mobile-friendly vision transformers | __ICLR 2022__| Apple | 5 Oct 2021|









================================================
FILE: image-language-transformer.md
================================================
# Image & Language (Retrieval & captioning & image generation )

|No.  |Model Name |Title |Links |Pub. | Organization| Release Time | 
|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|
|1|ViusalGPT |VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning |[paper]( https://arxiv.org/abs/2102.10407) [code]( https://github.com/Vision-CAIR/VisualGPT) |__arXiv 2021__|KAUST|20 Feb 2021|
|2|Kaleido-BERT |Kaleido-BERT: Vision-Language Pre-training on Fashion Domain |[paper](https://arxiv.org/pdf/2103.16110.pdf) [code]( https://github.com/mczhuge/Kaleido-BERT/) |__CVPR 2021__|Alibaba Group|15 April 2021|
|3|CLIPBERT |Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling | [paper](https://arxiv.org/pdf/2102.06183.pdf) [code](https://github.com/jayleicn/ClipBERT) |__CVPR 2021__| UNC | 11 Feb 2021|
|4| -|Probabilistic Embeddings for Cross-Modal Retrieval| [paper](https://openaccess.thecvf.com/content/CVPR2021/html/Chun_Probabilistic_Embeddings_for_Cross-Modal_Retrieval_CVPR_2021_paper.html) [github](https://github.com/naver-ai/pcme) | __CVPR 2021__ | NAVER Lab|  14 June 2021|
|5| -| Scaling Up Vision-Language Representation Learning With Noisy Text Supervision | [paper](https://arxiv.org/pdf/2102.05918.pdf) | ICML 2021| Google | 11 June 2021|
|6|-|Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training| [paper](https://arxiv.org/pdf/2106.13488.pdf) | arXiv| MSRA| 28 June 2021|
|7| CogView| CogView: Mastering Text-to-Image Generation via Transformers | [paper](https://arxiv.org/pdf/2105.13290.pdf) [code](https://github.com/THUDM/CogView) | arXiv | TsingHua University | 28 May 2021| 
|8|ViLT| ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision| [paper](https://arxiv.org/pdf/2102.03334.pdf) [code](https://github.com/dandelin/vilt) | ICML 2021 | NAVER AI lab|  10 Jun 2021| 
|9| - |Unifying Vision-and-Language Tasks via Text Generation | [paper](https://arxiv.org/pdf/2102.02779.pdf) [code](https://github.com/j-min/VL-T5) | ICML 2021 | UNC | 23 May 2021|
|10| Pixel-BERT | Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers | [paper](https://arxiv.org/pdf/2004.00849.pdf) | arXiv | Univesity of Science and Technology Beijing |  22 Jun 2020 |
|11| -| How Much Can CLIP Benefit Vision-and-Language Tasks?| [paper](https://arxiv.org/pdf/2107.06383.pdf)| arXiv| UCB | 13 Jul 2021|
|12| LXMERT |LXMERT: Learning Cross-Modality Encoder Representations from Transformers| [paper](https://arxiv.org/abs/1908.07490) [code](https://github.com/airsplay/lxmert)| EMNLP 2019| UNC Chapel Hill | 3 Dec 2019|
|13| ViLBERT | VilBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks| [paper](https://arxiv.org/abs/1908.02265) [code](https://github.com/jiasenlu/vilbert_beta)| NeurIPS 2019| Georgia Institute of Technology | 6 Aug 2019|
|14| ImageBERT | ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data | [paper](https://arxiv.org/abs/2001.07966) | arXiv | Bing, Microsoft|23 Jan 2020|
|15| Unicoder-VL | Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training | [paper](https://arxiv.org/pdf/1908.06066v3.pdf) | AAAI 2020| MSRA | 2 Dec 2019|
|16| VLP | Unified Vision-Language Pre-Training for Image Captioning and VQA | [paper](https://arxiv.org/pdf/1909.11059.pdf) [code](https://github.com/LuoweiZhou/VLP) | AAAI 2020| University of Michigan | 4 Dec 2019|
|17| XGPT |XGPT: Cross-modal Generative Pre-Training for Image Captioning |[paper](https://arxiv.org/pdf/2003.01473.pdf) | arXiv| Peking University | 4 Mar 2020|
|18| 12-IN-1 | 12-in-1: Multi-Task Vision and Language Representation Learning | [paper](https://arxiv.org/pdf/1912.02315.pdf) [code](https://github.com/facebookresearch/vilbert-multi-task) | CVPR 2020 | Facebook | 5 Dec 2019|
|19| FashionBERT | FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval | [paper](https://arxiv.org/abs/2005.09801)| SIGIR | Alibaba | 20 May 2020|
|20| UNITER | UNITER: UNiversal Image-TExt Representation Learning | [paper](https://arxiv.org/abs/1909.11740) [code](https://github.com/ChenRocks/UNITER) | ECCV 2020 | Microsoft Dynamics 365 AI Research| 25 Sep 2019|
|21| VisDial-BERT | Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline | [paper](https://arxiv.org/abs/1912.02379) [code](https://github.com/vmurahari3/visdial-bert) | ECCV 2020 | 1Georgia Institute of Technology | 31 Mar 2020 |
|22| OSCAR | Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks | [paper](https://arxiv.org/abs/2004.06165) [code](https://github.com/microsoft/Oscar) | ECCV 2020| Microsoft |13 Apr 2020|
|23| KD-VLP| KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation |[paper](https://arxiv.or/pdf/2109.10504v1.pdf) | arXiv | ShanghaiTech | 22 Sep 2021|
|24| Fast & Slow| Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers|[paper](https://arxiv.org/abs/2103.16553)| CVPR 2021 |DeepMind | 30 Mar 2021|
|25| - |Unifying Multimodal Transfomer for Bi-directional Image and Text Generation | [paper](https://arxiv.org/abs/2110.09753) | Arxiv | Sun Yat-sen University| 19 Oct 2021 |
|26|SOHO| Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning | [paper](https://arxiv.org/pdf/2104.03135.pdf) | CVPR 2021 | University of Science and Technology Beijing | 8 Apr 2021|
|27| E2E-VLP | E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual| [paper](https://aclanthology.org/2021.acl-long.42.pdf) | ACL 2021| Alibaba Group | 3 June 2021|
|28| KD-VLP | KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation | [paper](https://arxiv.org/abs/2109.10504) |EMNLP 2021| ShanghaiTech| 22 Sep 2021|
|29| L-Verse| L-Verse: Bidirectional Generation Between Image and Text| [paper](https://arxiv.org/pdf/2111.11133.pdf) | ArXiv | LG AI Research | 22 Nov 2021|
|30| NUWA| NUWA: Visual Synthesis Pre-training for Neural visUal World creAtion| [paper](https://arxiv.org/pdf/2111.12417.pdf) | arXiv | MSRA| 24 Nov 2021|
|31| Florence| Florence: A New Foundation Model for Computer Vision | [paper](https://arxiv.org/abs/2111.11432) | arXiv | Microsoft | 22 Nov 2021|
|32| -|Distilled Dual-Encoder Model for Vision-Language Understanding| [paper](https://arxiv.org/pdf/2112.08723v1.pdf) | arXiv | Microsoft | 16 Dec 2021|
|33| FLAVA| FLAVA : A Foundational Language And Vision Alignment Model| [paper](https://arxiv.org/pdf/2112.04482.pdf) | arXiv | FAIR | 8 Dec 2021|



# Object Detection

|No.  |Model Name |Title |Links |Pub. | Organization| Release Time | 
|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|
|1|MDTER|  MDETR - Modulated Detection for End-to-End Multi-Modal Understanding   | [paper](https://arxiv.org/pdf/2104.12763.pdf)  [code](https://github.com/ashkamath/mdetr)  | __ICCV 2021__|NYU |26 April 2021|
|2| pix2seq| pix2seq: A Language Modeling Framework for Object Detection| [paper](https://arxiv.org/abs/2109.10852) | arXiv | Google Research |  22 Sep 2021 |






================================================
FILE: image-transformer.md
================================================
## Image Classification

|No.  |Model Name |Title |Links |Pub. | Organization| Release Time | 
|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|
|1|ViT |An image is worth 16 * 16 words: transformers for image recognition at scale |[paper]( https://arxiv.org/pdf/2010.11929.pdf) [code]( https://github.com/rwightman/pytorch-image-models) |__ICLR 2021__|Google Brain|22 Oct 2020|
|2|LeViT |LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference |[paper](https://arxiv.org/abs/2104.01136)  |__arXiv__|/|2 Apr 2021|
|3|Swin Transformer |Swin Transformer: Hierarchical Vision Transformer using Shifted Windows |[paper](https://arxiv.org/pdf/2103.14030.pdf) [code](https://github.com/microsoft/Swin-Transformer)  |__arXiv__|MSRA|25 Mar 2021|
|4|DeiT Transformer |Training data-efficient image transformers& distillation through attention |[paper](https://arxiv.org/pdf/2012.12877.pdf) [code](https://github.com/facebookresearch/deit)  |__arXiv__|Facebook AI|15 Jan 2021|
|5|Pyramid Vision Transformer |Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions|[paper](https://arxiv.org/abs/2102.12122) [code](https://github.com/whai362/PVT)  |__arXiv__|Nanjing University of Science and Technology|24 Feb 2021|
|6|TNT |Transformer in Transformer|[paper](https://arxiv.org/pdf/2103.00112.pdf) [code](https://github.com/huawei-noah/noah-research/tree/master/TNT)  |__arXiv__|Noah's Ark Lab|27 Feb 2021|
|7|PiT |Rethinking Spatial Dimensions of Vision Transformers|[paper](https://arxiv.org/pdf/2103.16302.pdf) [code](https://github.com/naver-ai/pit)  |__arXiv__|NAVER AI Lab|30 Mar 2021|
|8|T2T-ViT |Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet|[paper](https://arxiv.org/pdf/2101.11986.pdf) [code](https://github.com/yitu-opensource/T2T-ViT)  |__arXiv__| NUS|22 Mar 2021|
|9|CPVT |Conditional Positional Encodings for Vision Transformers|[paper](https://arxiv.org/pdf/2102.10882.pdf) [code](https://github.com/Meituan-AutoML/CPVT)  |__arXiv__| Meituan Inc|18 Mar 2021|
|10|ViL |Multi-Scale Vision Longformer:A New Vision Transformer for High-Resolution Image Encoding|[paper](https://arxiv.org/pdf/2103.15358.pdf)   |__arXiv__| Microsoft Corporation|29 Mar 2021|
|11|CoaT |Co-Scale Conv-Attentional Image Transformer|[paper](https://arxiv.org/abs/2104.06399) [code](https://github.com/mlpc-ucsd/CoaT)  |__arXiv__| University of California San Diego|13 April 2021|
|12|CoaT |Co-Scale Conv-Attentional Image Transformer|[paper](https://arxiv.org/abs/2104.06399) [code](https://github.com/mlpc-ucsd/CoaT)  |__arXiv__| University of California San Diego|13 April 2021|
|14|pruning |Visual Transforemr Pruning | [paper](https://arxiv.org/pdf/2104.08500.pdf) |__arXiv__|Zhejiang University| 17 April 2021 |
|15|ViL| Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding| [paper](https://arxiv.org/pdf/2103.15358.pdf) |__arXiv__|Microsoft Corporation |29 Mar 2021|
|16|M2TR| M2TR: Multi-modal Multi-scale Transformersfor Deepfake Detection | [paper](https://arxiv.org/pdf/2104.09770.pdf) | __arXiv__ | Fudan Univeristy | 21 Apr|
|17|VisTransformer | Visformer: The Vision-friendly Transformer |[paper](https://arxiv.org/pdf/2104.12533.pdf) [code](https://github.com/danczs/Visformer) | __arXiv__ | Beihang University | 26 April 2021|
|18| ConTNet | ConTNet: Why not use convolution and transformer at the same time?| [paper](https://arxiv.org/pdf/2104.13497.pdf) [code](https://github.com/yan-hao-tian/ConTNet)|__arXiv__| ByteDance AI Lab | 27 Apr 2021 |
|19| Twins-SVT | Twins: Revisiting the Design of Spatial Attention in Vision Transformers  | [paper](https://arxiv.org/pdf/2104.13840.pdf) [code](https://github.com/Meituan-AutoML/Twins) |__arXiv__ | Meituan Inc | 28 Apr 2021|
|20|LeViT| LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference| [paper](https://arxiv.org/pdf/2104.01136.pdf) [code](https://github.com/facebookresearch/LeViT) | __arXiv__ | Facebook | 6 May 2021 |
|21|CoAtNet| CoAtNet: Marrying Convolution and Attentionfor All Data Sizes | [paper](https://arxiv.org/pdf/2106.04803.pdf) | __arXiv__ | Google Brain| 9 June, 2021|
|22|Focal Transformer |Focal Self-attention for Local-Global Interactions in Vision Transformers  |[paper](https://arxiv.org/pdf/2107.00641.pdf) | Microsoft Research at Redmond | 1 Jul 2021|
|23|BEIT| BEIT: BERT Pre-Training of Image Transformers| [paper](https://arxiv.org/pdf/2106.08254.pdf)|arXiv| Microsoft|15 Jun 2021|
|24| ViT-G| Scaling Vision Transformers| [paper](https://arxiv.org/pdf/2106.04560.pdf) | arXiv | google brain | 8 Jun 2021| 
|25| -| Efficient Training of Visual Transformers with Small-Size Datasets | [paper](https://arxiv.org/pdf/2106.03746.pdf) | arXiv |TFBK| 7 Jun 2021|
|26|PS-ViT | Vision Transformer with Progressive Sampling | [paper](https://arxiv.org/pdf/2108.01684.pdf) [code](https://github.com/yuexy/PS-ViT) | arXiv | Centre for Perceptual and Interactive Intelligence| 3 Aug 2021|
|27|-| Masked Autoencoders Are Scalable Vision Learners| [paper](https://arxiv.org/pdf/2010.16056.pdf)  | arXiv | Facebook FAIR| 11 Nov 2021|
|28| Evo-ViT | Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer |[paper](https://arxiv.org/pdf/2108.01390.pdf) | AAAI 2022 | Chinese Academy of Sciences| 6 Dec 2021|
|29| ATS| ATS: Adaptive Token Sampling For Efficient Vision Transformers| [paper](https://arxiv.org/pdf/2111.15667.pdf) | arXiv | Microsoft|30 Nov 2021|
|30| AdaViT | AdaViT: Adaptive Vision Transformers for Efficient Image Recognition| [paper](https://arxiv.org/pdf/2111.15668.pdf) | arXiv | Fudan University| 30 Nov 2021|
|31| PeCo| PeCo : Perceptual Codebook for BERT Pre-training of Vision Transformers| [paper](https://arxiv.org/pdf/2111.12710.pdf) [code](https://github.com/microsoft/PeCo) | arXiv | University of Science and Technology of China| 24 Nov 2021|
|32| DAT| Vision Transformer with Deformable Attention | [paper](https://arxiv.org/pdf/2201.00520.pdf) [code](https://github.com/LeapLabTHU/DAT) |arXiv | Tsinghua University | 3 Jan 2022|



# Viusal Relationship Detection
|No.  |Model Name |Title |Links |Pub. | Organization| Release Time | 
|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|
|1| RelTransformer | RelTransformer: Balancing the Visual Relationship Detection from Local Context, Scene and Memory| [paper](https://arxiv.org/pdf/2104.11934.pdf) [code](https://github.com/Vision-CAIR/RelTransformer) | __arXiv__| KAUST| 24 April 2021|



# Object Tracking
|No.  |Model Name |Title |Links |Pub. | Organization| Release Time | 
|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|
|1| MOTR | MOTR: End-to-End Multiple-Object Tracking with TRansformer| [paper](https://arxiv.org/pdf/2105.03247.pdf) [code](https://github.com/megvii-model/MOTR) | __arXiv__| MEGVII Techonology| 7 May 2021|





================================================
FILE: other_interesting_paper.md
================================================
# Different interesting attention designs

|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |
|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|
|1|Invertible Attention |Invertible Attention |[paper](https://arxiv.org/pdf/2106.09003.pdf) |arxiv|Australian National University|27 Jun 2021|
|2|AutoSampling| AutoSampling : Search for Effective Data Sampling Schedules| [paper](https://arxiv.org/pdf/2105.13695.pdf) | ICML 2021|SenseTime Research  | 28 May 2021|
|3|AdaFocus-TSM| Adaptive Focus for Efficient Video Recognition | [paper](https://arxiv.org/pdf/2105.03245.pdf) [code](https://github.com/blackfeather-wang/AdaFocus) | arXiv | Tsinghua University | 7 May 2021|
|4| SMART| SMART Frame Selection for Action Recognition| [paper](https://arxiv.org/pdf/2012.10671.pdf) |AAAI 2021| University of Edinburgh | 19 Dec 2020|
|5|-| A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Feichtenhofer_A_Large-Scale_Study_on_Unsupervised_Spatiotemporal_Representation_Learning_CVPR_2021_paper.pdf) [code](https://github.com/facebookresearch/SlowFast) | CVPR 2021 | Facebook | 29 Apr 2021|
|6| PixelTransformer| PixelTransformer: Sample Conditioned Signal Generation  |[paper](https://arxiv.org/abs/2103.15813) [code](https://shubhtuls.github.io/PixelTransformer/)| ICML 2021 | Facebook | 29 Mar 2021|
|7| Perceiver | Perceiver: General Perception with Iterative Attention | [paper](https://arxiv.org/pdf/2103.03206.pdf) [code](https://github.com/lucidrains/perceiver-pytorch) | ICML 2021 | DeepMind | 23 Jun 2021|
|8 | DOVE| DOVE: Learning Deformable 3D Objects by Watching Videos|[paper](https://arxiv.org/pdf/2107.10844.pdf) [code](https://dove3d.github.io/) | arXiv | Oxford | 22 Jul 2021 |
|9| MGSampler| MGSampler: An Explainable Sampling Strategy for Video Action Recognition| [paper](https://arxiv.org/abs/2104.09952)| ICCV 2021 | Nanjing university | 20 Apr 2021|
|10| Expire Span| Not All Memories are Created Equal: Learning to Forget by Expiring | [paper](https://arxiv.org/pdf/2105.06548.pdf)| ICML 2021 | Facebook | 13 Jun 2021|
|11| -| STEP-UNROLLED DENOISING AUTOENCODERS FOR TEXT GENERATION| [paper](https://arxiv.org/pdf/2112.06749.pdf) | arXiv | Deepmind|  13 Dec 2021| 
|12| -| dataset meta-learning from kernel ridge regression| [paper](https://openreview.net/pdf?id=l-PrrQrK0QR) | arXiv | Google Brain| 22 Mar 2021 |
|13| -| Dataset Distillation with Infinitely Wide Convolutional Networks| [paper](https://openreview.net/pdf?id=hXWPpJedrVP) | arXiv| Google Brain| 27 Oct 2021|
|14 | bert2BERT | bert2BERT: Towards Reusable Pretrained Language Models | [paper](https://arxiv.org/abs/2110.07143) | ACL 2022 | Huawei Noah’s Ark Lab| 14 Oct 2021 |


================================================
FILE: paper-review.md
================================================
# Multi-Modal Survey Paper

|No.  |topic |Title |Links |Pub. | Organization| Release Time |
|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|
|1|video-language |Bridging Vision and Language from the Video-to-TextPerspective: A Comprehensive Review |[paper](https://arxiv.org/pdf/2103.14785v1.pdf) |__arXiv__|University of Chile|27 Mar 2021|
|2| video-language pretraining| Survey: Transformer based Video-Language Pre-training | [paper](https://arxiv.org/pdf/2109.09920.pdf) | __arXiv__ |  Renmin University of China | 21 Sep 2021|



================================================
FILE: video-language-transformer.md
================================================
# Video & Language Transformer

|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |
|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|
|1|COOT |COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning |[paper](https://proceedings.neurips.cc/paper/2020/file/ff0abbcc0227c9124a804b084d161a2d-Paper.pdf) [code](https://github.com/gingsi/coot-videotext) |__Neurips 2020__|University of Freiburg|1 Nov 2020|
|2|MMT |Multi-modal Transformer for Video Retrieval |[paper](https://arxiv.org/abs/2007.10639) [code](https://github.com/gabeur/mmt) |__ECCV 2020__|Inria & Google|21 Jul 2020|
|3|HiT |HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval |[paper](https://arxiv.org/abs/2103.15049) |__arXiv__|Peking University|28 Mar 2021|
|4|CLIPBERT |Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling |[paper](https://arxiv.org/pdf/2102.06183.pdf) [code](https://github.com/jayleicn/ClipBERT) |__CVPR 2021__|UNC Chapel Hill|11 Feb 2020|
|5|SVRTN |Self-supervised Video Retrieval Transformer Network |[paper](https://arxiv.org/pdf/2104.07993.pdf) |__arXiv__|Alibaba DAMO Academy|16 Apr 2021|
|6| VATT| VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text | [paper](https://arxiv.org/pdf/2104.11178.pdf) | __arXiv__| Google | 22 April 2021|
|7|Forzen in Time | Forzen in Time: A Joint Video and Image Encoder for End-to-End Retrieval| [paper](https://arxiv.org/pdf/2104.00650.pdf) [code](https://github.com/m-bain/frozen-in-time) | __arXiv__ | University of Oxford| 1 April 2021|
|8|CLIP4CLIP| CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval | [paper](https://arxiv.org/pdf/2104.08860.pdf) [code](https://github.com/ArrowLuo/CLIP4Clip)  |   __arXiv__|  Southwest Jiaotong University | 18 April 2021 |
|9|CLIP2Video| CLIP2Video: Mastering Video-Text Retrieval via Image CLIP |  [paper](https://arxiv.org/pdf/2106.11097.pdf) [code](https://github.com/CryhanFang/CLIP2Video) | __arXiv__| PCG, Tencent | 21 June, 2021 |
|10| T2VLAD| T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval | [paper](https://arxiv.org/abs/2104.10054)  | CVPR 2021 | Baidu | 20 April 2021 | 
|11|-| On Semantic Similarity in Video Retrieval | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Wray_On_Semantic_Similarity_in_Video_Retrieval_CVPR_2021_paper.pdf)  [code](https://github.com/mwray/Semantic-Video-Retrieval) | CVPR 2021 |Univesity of Bristol | 21 June, 2021|
|12| VLM|VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding| [paper](https://arxiv.org/pdf/2105.09996.pdf) | arXiv | Facebook AI | 20 May 2021|
|13| VideoBERT| VideoBERT: A Joint Model for Video and Language Representation Learning |[paper](https://arxiv.org/abs/1904.01766) | CVPR 2019 | Google Research | 11 Sep 2019 |
|14| CBT | learning video representations using contrastive bidirectional transformer |[paper](https://arxiv.org/pdf/1906.05743.pdf)| arXiv | Google Research |  27 Sep 2019|
|15 | ActBERT | ActBERT: Learning Global-Local Video-Text Representations |[paper](https://arxiv.org/abs/2011.07231) |  Baidu Research | CVPR 2020 | 14 Nov 2020| 
|16 | HERO |  HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training  |[paper](https://aclanthology.org/2020.emnlp-main.161.pdf) [code](https://github.com/linjieli222/HERO) | EMNLP 2020 | Microsoft Dynamics 365 AI Research | 29 Sep 2020 |
| 17 | UniVL | UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation | [paper](https://arxiv.org/pdf/2002.06353.pdf) [code](https://github.com/microsoft/UniVL) |arXiv| MSRA| 15 Sep 2021|
|18 |G-TAD| Boundary-sensitive Pre-training for Temporal Localization in Videos |[paper](https://arxiv.org/pdf/2011.10830.pdf) | ICCV 2021 | Samsung AI Centre Cambridge, UK | 26 Mar 2021 |
|19 | UniVL |UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation| [paper](https://arxiv.org/abs/2002.06353) [code](https://github.com/microsoft/UniVL) | arXiv | Microsoft | 15 Feb 2020|
|20| ActBERT |ActBERT: Learning Global-Local Video-Text Representations | [paper](https://arxiv.org/pdf/2011.07231.pdf) | CVPR 2020| Baidu Research |14 Nov 2020|
|21| HERO| HERO : Hierarchical Encoder for Video+Language Omni-representation Pre-training | [paper](https://arxiv.org/abs/2005.00200) [code](https://github.com/linjieli222/HERO)| EMNLP 2020| Microsoft Dynamics 365 AI Research| 1 May 2020|
|22| MM-ViT| MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition| [paper](https://arxiv.org/pdf/2108.09322.pdf) | arXiv | Oppo| 20 Aug 2021|
|23| TPT |Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering | [paper](https://arxiv.org/pdf/2109.04735v1.pdf) | arXiv | Chinese Academy of Sciences | 10 Sep 2021|
|24| ActionClip| ActionCLIP: A New Paradigm for Video Action Recognition | [paper](https://arxiv.org/pdf/2109.08472.pdf) [code](https://github.com/sallymmx/ActionCLIP.git) | arXiv | Zhejiang University  |17 Sep 2021|
|25| justAsk | Just Ask: Learning to Answer Questions from Millions of Narrated Video|[paper](https://arxiv.org/pdf/2012.00451.pdf) [code](https://github.com/antoyang/just-ask) | ICCV 2021 | Inria Paris| 12 Aug 2021| 
|26| - | A Bilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer| [paper](https://arxiv.org/pdf/2112.04888.pdf) [code](github.com/weijiawu/BOVText)| arXiv |Zhengjiang University| 9 Dec 2021|
|27| SWINBERT| SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning | [paper](https://arxiv.org/pdf/2111.13196.pdf) | arXiv | Microsoft | 25 Nov 2021|
|28 | VIOLET | VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling | [paper](https://arxiv.org/pdf/2111.12681.pdf) [code](https://github.com/tsujuifu/pytorch_violet) | arXiv | UC Santa Barbara|  24 Nov 2021|
|29| FasionViL | FashionViL: Fashion-Focused Vision-and-Language Representation Learning | [paper](https://arxiv.org/pdf/2207.08150.pdf) [github](https://github.com/BrandonHanx/mmf) | ECCV 2022 | University of Surrey | 17 Jul 2022 |


# cross-domain video-retreival 
|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |
|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|
|1| -| Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Liu_Adaptive_Cross-Modal_Prototypes_for_Cross-Domain_Visual-Language_Retrieval_CVPR_2021_paper.pdf) | CVPR 2021 | Zhejiang University| 20 April 2021| 


# vision & language navigation
|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |
|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|
|1|Episodic Transformer | Episodic Transformer for Vision-and-Language Navigation| [paper](https://arxiv.org/pdf/2105.06453.pdf) | arXiv | Inria |  13 May 2021| 









================================================
FILE: video-transformer.md
================================================
# Video Transformer

|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |
|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|
|1|TimeSformer |Is Space-Time Attention All You Need for Video Understanding? |[paper](https://arxiv.org/abs/2102.05095) [code](https://github.com/facebookresearch/TimeSformer) |__arXiv__|Facebook AI|24 Feb 2021|
|2|Video Transformer |Video Transformer Network |[paper](https://arxiv.org/abs/2102.00719) |__arXiv__|Theator|1 Feb 2021|
|3|ViViT |ViViT: A Video Vision Transformer |[paper](https://arxiv.org/pdf/2103.15691.pdf) |__arXiv__|Google AI|29 Mar 2021|
|4|VideoGPT |  VideoGPT: Video Generation using VQ-VAE and Transformers |  [paper](https://arxiv.org/pdf/2104.10157.pdf) [code](https://wilson1yan.github.io/videogpt/index.html)  | __arXiv__ | UC Berkeley | 20 Apr 2021|
|5|VIMPAC|VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning| [paper](https://arxiv.org/pdf/2106.11250.pdf) [code](https://github.com/airsplay/vimpac) | __arXiv__ | UNC| 21 June 2021|
|6|-| Self-supervised Video Representation Learning by Context and Motion Decoupling | [paper](https://arxiv.org/pdf/2104.00862.pdf)| CVPR 2021 | Alibaba | 2 April 2021|
|7|VideoLightFormer| VideoLightFormer: Lightweight Action Recognition using Transformers| [paper](https://arxiv.org/pdf/2107.00451v1.pdf) | arXiv| the university of shefield| 1 Jul 2021|
|8|Video Swin Transformer| Video Swin Transformer| [paper](https://arxiv.org/pdf/2106.13230.pdf) [code](https://github.com/SwinTransformer/Video-Swin-Transformer) | arXiv | MSRA | 24 Jun 2021|
|9| ST Swin| Long-Short Temporal Contrastive Learning of Video Transformers| [paper](https://arxiv.org/pdf/2106.09212.pdf) |arXiv|Facebook AI|  17 Jun 2021|
|10|X-ViT|Space-time Mixing Attention for Video Transformer| [paper](https://arxiv.org/pdf/2106.05968.pdf) | arXiv|  Samsung AI Cambridge |11 Jun 2021| 
|11| OCVT | Generative Video Transformer: Can Objects be the Words? | [paper](https://arxiv.org/abs/2107.09240) | ICML 2021 |Rutgers University | 20 Jul 2021|
|12|-|An Image is Worth 16x16 Words, What is a Video Worth?| [paper](https://arxiv.org/pdf/2103.13915.pdf) [code](https://github.com/Alibaba-MIIL/STAM) | arXiv | Alibaba |27 May 2021|
|13| SCT| Shifted Chunk Transformer for Spatio-Temporal Representational Learning | [paper](https://arxiv.org/pdf/2108.11575.pdf) | arXiv | Kuaishou Technology | 26 Aug 2021|
|14| -| Evaluating Transformers for Lightweight Action Recognition | [paper](https://arxiv.org/pdf/2111.09641.pdf) | arXiv | University of Sheffield | 18 Nov 2021|
|15| DualFormer| DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition | [paper](https://arxiv.org/pdf/2112.04674v1.pdf) | arXiv |Sea AI Lab | 9 Dec 2021|
|16| BEVT| BEVT: BERT Pretraining of Video Transformers | [paper](https://arxiv.org/pdf/2112.01529.pdf) | arXiv | Shanghai Key Lab of Intelligent Information Processing | 2 Dec 2021|
|17|-| Efficient Video Transformers with Spatial-Temporal Token Selection|[paper](https://arxiv.org/pdf/2111.11591.pdf)| arXiv | Shanghai Key Lab of Intelligent Information Processing | 23 Nov 2021|
|18| -| Lite Vision Transformer with Enhanced Self-Attention| [paper](https://arxiv.org/pdf/2112.10809.pdf) [code](https://github.com/Chenglin-Yang/LVT) | arXiv | Johns Hopkins University | 20 Dec 2021|
|19|MViT| Multiscale Vision Transformers| [paper](https://arxiv.org/pdf/2104.11227.pdf) [code](https://github.com/facebookresearch/SlowFast)| ICCV 2021 | Facebook| 22 Apr 2021|
|20| Uniformer| Uniformer: Unified Transformer For Efficient Spatiotemporal Representation Learning| [paper](https://openreview.net/pdf?id=nBU_u6DLvoK) [code](https://github.com/sense-x/uniformer) | arXiv | Chinese Academy of Sciences|12 Jan 2022|
|21|MaskFeat| Masked Feature Prediction for Self-Supervised Visual Pre-Training| [paper](https://arxiv.org/pdf/2112.09133v1.pdf)| arXiv | Facebook AI |16 Dec 2021|
|22|MTV| Multiview Transformers for Video Recognition| [paper](https://arxiv.org/pdf/2201.04288.pdf) |arXiv| Google | 20 Jan 2022|
|23| MeMViT | MeMViT : Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition| [paper](https://arxiv.org/pdf/2201.08383.pdf) |arXiv | Facebook AI Research | 20 Jan 2022|



================================================
FILE: vision_model_compression.md
================================================
# Compressed Transformer

|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |
|-----|:-----:|:-----:|:-----:|:--------:|:---:|:-------:|
|1| VTP |Vision Transformer Pruning |[paper](https://arxiv.org/pdf/2104.08500.pdf) |__KDD 2021 workshop__|Westlake University|14 Aug 2021|
|2| IA-RED2 | IA-RED2 : Interpretability-Aware Redundancy Reduction for Vision Transformers | [paper](https://proceedings.neurips.cc/paper/2021/hash/d072677d210ac4c03ba046120f0802ec-Abstract.html) [code](http://people.csail.mit.edu/bpan/ia-red/) | __NeurIPS 2021__ | MIT| 23 Jun 2021|
|3| DynamicViT| DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification | [paper](https://arxiv.org/pdf/2106.02034.pdf) [code](https://github.com/raoyongming/DynamicViT) | | __NeurIPS 2021__| Tsinghua University| 26 Oct 2021|
|4|  Evo-ViT| Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer| [paper](https://arxiv.org/pdf/2108.01390.pdf) [code](https://github.com/YifanXu74/Evo-ViT)|__arXiv__|Chinese Academy of Sciences |6 Dec 2021|
|5| - |Patch Slimming for Efficient Vision Transformers| [paper](https://arxiv.org/pdf/2106.02852.pdf) |__arXiv__| Peking University|5 Jun 2021|
|6|-| Chasing Sparsity in Vision Transformers: An End-to-End Exploration| [paper](https://arxiv.org/pdf/2106.04533.pdf) [code](https://github.com/VITA-Group/SViTE) | __arXiv__| University of Texas at Austin| 22 Oct 2021|
|7|DeIT| Training data-efficient image transformers & distillation through attention | [paper](https://arxiv.org/pdf/2012.12877.pdf) | __ICML 2021__|Facebook | 15 Jan 2021|
|8| -|Post-Training Quantization for Vision Transformer| [paper](https://arxiv.org/abs/2106.14156) | __NeurIPS 2021__| Peking University| 27 Jun 2021|
|9| -| Multi-Dimensional Model Compression of Vision Transformer | [paper](https://arxiv.org/pdf/2201.00043.pdf) | __arXiv__| Princeton University |31 Dec 2021|
|10|-| Patch Slimming for Efficient Vision Transformers|[paper](https://arxiv.org/pdf/2106.02852.pdf) | __arXiv__ |Peking University|5 Jun 2021|
|11|-| Chasing Sparsity in Vision Transformers: An End-to-End Exploration| [paper](https://arxiv.org/pdf/2106.04533.pdf) [code](https://github.com/VITA-Group/SViTE)| NeurIPS 2021 | University of Texas at Austin|22 Oct 2021|

Download .txt

gitextract_p2i50wah/

├── MultiModal-CVPR2021.md
├── NLP-transformer.md
├── README.md
├── Self-supervised_learning.md
├── datasets.md
├── efficiency-transformer.md
├── image-language-transformer.md
├── image-transformer.md
├── other_interesting_paper.md
├── paper-review.md
├── video-language-transformer.md
├── video-transformer.md
└── vision_model_compression.md

Download .json

Condensed preview — 13 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (72K chars).

[
  {
    "path": "MultiModal-CVPR2021.md",
    "chars": 15047,
    "preview": "## Multi-modal learning paper in CVPR2021\n\nthe Navigation of [CVPR 2021 papers](https://blog.kitware.com/demos/cvpr-2021"
  },
  {
    "path": "NLP-transformer.md",
    "chars": 1127,
    "preview": "# Natural Language Processing Transformer\r\n\r\n\r\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time | \r\n|-"
  },
  {
    "path": "README.md",
    "chars": 1995,
    "preview": "# Reading list in Transformer\n \n\nThis repo is aimed to collect all the recent popular Transformer paper, codes and learn"
  },
  {
    "path": "Self-supervised_learning.md",
    "chars": 6971,
    "preview": "# this will collect many papers that relates to self-supervied learning in vision domains.\n\n\nSelf-supervised learning fo"
  },
  {
    "path": "datasets.md",
    "chars": 10057,
    "preview": "# Common multimodal datasets\n\n## Image Datasets\n[COCO](https://cocodataset.org/#home)\\\n[conceptual 3M](https://ai.google"
  },
  {
    "path": "efficiency-transformer.md",
    "chars": 4647,
    "preview": "# Efficient Transformer\n\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |\n|-----|:-----:|:-----:|:--"
  },
  {
    "path": "image-language-transformer.md",
    "chars": 7241,
    "preview": "# Image & Language (Retrieval & captioning & image generation )\r\n\r\n|No.  |Model Name |Title |Links |Pub. | Organization|"
  },
  {
    "path": "image-transformer.md",
    "chars": 6902,
    "preview": "## Image Classification\r\n\r\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time | \r\n|-----|:-----:|:-----:"
  },
  {
    "path": "other_interesting_paper.md",
    "chars": 2773,
    "preview": "# Different interesting attention designs\n\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |\n|-----|:"
  },
  {
    "path": "paper-review.md",
    "chars": 542,
    "preview": "# Multi-Modal Survey Paper\n\n|No.  |topic |Title |Links |Pub. | Organization| Release Time |\n|-----|:-----:|:-----:|:----"
  },
  {
    "path": "video-language-transformer.md",
    "chars": 7037,
    "preview": "# Video & Language Transformer\r\n\r\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |\r\n|-----|:-----:|:"
  },
  {
    "path": "video-transformer.md",
    "chars": 4303,
    "preview": "# Video Transformer\r\n\r\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |\r\n|-----|:-----:|:-----:|:---"
  },
  {
    "path": "vision_model_compression.md",
    "chars": 2267,
    "preview": "# Compressed Transformer\n\n|No.  |Model Name |Title |Links |Pub. | Organization| Release Time |\n|-----|:-----:|:-----:|:-"
  }
]

About this extraction

This page contains the full source code of the junchen14/Multi-Modal-Transformer GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 13 files (69.2 KB), approximately 23.0k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo