Full Code of lwpyh/Awesome-MLLM-Reasoning-Collection for AI

main b428eece84cc cached

2 files

199.9 KB

61.3k tokens

1 requests

Download .txt

Showing preview only (209K chars total). Download the full file or copy to clipboard to get everything.

Repository: lwpyh/Awesome-MLLM-Reasoning-Collection
Branch: main
Commit: b428eece84cc
Files: 2
Total size: 199.9 KB

Directory structure:
gitextract_eaow443p/

├── README.md
└── contribution.md

================================================
FILE CONTENTS
================================================

================================================
FILE: README.md
================================================
# Awesome-MLLM-Reasoning-Collection
[![License: MIT](https://img.shields.io/badge/License-MIT-purple.svg)](LICENSE)

👏 Welcome to the Awesome-MLLM-Reasoning-Collections repository! This repository is a carefully curated collection of papers, code, datasets, benchmarks, and resources focused on reasoning within Multimodal Large Language Models (MLLMs).

Feel free to ⭐ star and fork this repository to keep up with the latest advancements and contribute to the community.
### Table of Contents
- [Awesome-MLLM-Reasoning-Collection](#awesome-mllm-reasoning-collection)
    - [Table of Contents](#table-of-contents)
  - [Papers and Projects 📄](#papers-and-projects-)
    - [Commonsense Reasoning](#commonsense-reasoning)
      - [Image MLLM](#image-mllm)
      - [Video MLLM](#video-mllm)
      - [Audio MLLM](#audio-mllm)
      - [Omni MLLM](#omni-mllm)
    - [Reasoning Segmentation and Detection](#reasoning-segmentation-and-detection)
      - [Image MLLM](#image-mllm-1)
      - [Video MLLM](#video-mllm-1)
      - [Audio MLLM](#audio-mllm-1)
      - [Omni MLLM](#omni-mllm-1)
    - [Spatial and Temporal Grounding and Understanding](#spatial-and-temporal-grounding-and-understanding)
      - [Image MLLM](#image-mllm-2)
      - [Video MLLM](#video-mllm-2)
      - [Audio MLLM](#audio-mllm-2)
      - [Omni MLLM](#omni-mllm-2)
    - [Math Reasoning](#math-reasoning)
      - [Image MLLM](#image-mllm-3)
    - [Chart Rasoning](#chart-rasoning)
    - [Visual-AUdio Generation](#visual-generation)
      - [Image MLLM](#image-mllm-4)
      - [Video MLLM](#video-mllm-3)
      - [Audio MLLM](#audio-mllm-3)
    - [Reasoning with Agent/Tool](#reasoning-with-agent) 
    - [Medical Reasoning](#medical-reasoning)
      - [Audio MLLM](#audio-mllm-4)
      - [Omni MLLM](#omni-mllm-3)
    - [Embodied Reasoning](#embodied-reasoning)
    - [Others](#others)
      - [Image MLLM](#image-mllm-5)
      - [Video MLLM](#video-mllm-4)
      - [Audio MLLM](#audio-mlllm-5)
      - [Omni MLLM](#omni-mllm-4)
  - [Benchmarks 📊](#benchmarks-)
  - [Open-source Projects](#open-source-projects)
  - [Contributing](#contributing)


<a name="PapersandProjects"></a>
## Papers and Projects 📄

<a name="VQA"></a>
### Commonsense Reasoning
#### Image MLLM
* 26.03 [Phi-4-reasoning-vision-15B Technical Report](https://arxiv.org/abs/2603.03975) | [Paper📑](https://arxiv.org/abs/2603.03975) [Model🤗](https://huggingface.co/microsoft/Phi-4-reasoning-vision-plus)
  - Compact 15B open-weight multimodal reasoning model excelling at scientific/math reasoning and UI understanding via careful architecture choices, data curation, and hybrid reasoning/non-reasoning training. | Task: Reasoning & Understanding
* 26.03 [Beyond Language Modeling: An Exploration of Multimodal Pretraining](https://arxiv.org/abs/2603.03276) | [Paper📑](https://arxiv.org/abs/2603.03276)
  - Systematic empirical study of unified multimodal pretraining from scratch using Transfusion, finding vision and language are complementary and world modeling emerges from general pretraining. | Task: Reasoning & Understanding
* 26.03 [Unified Vision-Language Modeling via Concept Space Alignment](https://arxiv.org/abs/2603.01096) | [Paper📑](https://arxiv.org/abs/2603.01096)
  - v-LCM extends the Large Concept Model with vision-language instruction tuning in a unified latent embedding space, outperforming SOTA VLMs across 61 languages. | Task: Reasoning & Understanding
* 26.03 [MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning](https://arxiv.org/abs/2603.02024) | [Paper📑](https://arxiv.org/abs/2603.02024) [Project🌐](https://mmr-life-bench.github.io/)
  - Benchmark of 2,646 QAs on 19,108 images across 7 reasoning types for multi-image reasoning; even GPT-5 achieves only 58% accuracy vs. human performance. | Task: Reasoning & Understanding
* 26.03 [Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks](https://arxiv.org/abs/2602.23898) | [Paper📑](https://arxiv.org/abs/2602.23898) [Project🌐](https://ref-adv.github.io/)
  - REC benchmark suppressing shortcuts with linguistically nontrivial expressions, revealing MLLM reliance on shortcuts and gaps in visual reasoning and grounding. | Task: Reasoning & Understanding
* 26.03 [Recursive Think-Answer Process for LLMs and VLMs](https://arxiv.org/abs/2603.02099) | [Paper📑](https://arxiv.org/abs/2603.02099)
  - Recursive think-answer process enabling iterative reasoning refinement in VLMs for improved multimodal reasoning performance. | Task: Reasoning & Understanding
* 26.02 [From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models](https://arxiv.org/abs/2602.22859) | [Paper📑](https://arxiv.org/abs/2602.22859) [Code🖥️](https://github.com/hongruijia/DPE) [Model🤗](https://huggingface.co/hongruijia/Qwen3_VL_8B_Instruct_DPE_v3)
  - Spiral-loop framework diagnosing capability gaps in MLLMs and generating targeted data and RL training to close them iteratively. | Task: Reasoning & Understanding
* 26.02 [Imagination Helps Visual Reasoning, But Not Yet in Latent Space](https://arxiv.org/abs/2602.22766) | [Paper📑](https://arxiv.org/abs/2602.22766) [Code🖥️](https://github.com/AI9Stars/CapImagine)
  - CapImagine proposes text-based explicit imagination outperforming latent-space baselines on vision-centric benchmarks via causal mediation analysis. | Task: Reasoning & Understanding
* 26.02 [NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors](https://arxiv.org/abs/2602.22144) | [Paper📑](https://arxiv.org/abs/2602.22144) [Code🖥️](https://github.com/lingfengren/NoLan)
  - Training-free decoding dynamically suppressing language priors by comparing multimodal vs. text-only output distributions, achieving +6.45/+7.21 accuracy on POPE. | Task: Reasoning & Understanding
* 26.02 [Selective Training for Large Vision Language Models via Visual Information Gain](https://arxiv.org/abs/2602.17186) | [Paper📑](https://arxiv.org/abs/2602.17186)
  - Visual Information Gain (VIG) metric quantifying how much visual input reduces prediction uncertainty for improved visual grounding and reduced language bias. | Task: Reasoning & Understanding
* 26.02 [MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning](https://arxiv.org/abs/2602.10575) | [Paper📑](https://arxiv.org/abs/2602.10575) [Code🖥️](https://github.com/MING-ZCH/MetaphorStar) [Model🤗](https://huggingface.co/MING-ZCH/MetaphorStar-7B)
  - End-to-end visual RL framework for image metaphor comprehension with TFQ-GRPO method, achieving 82.6% average improvement on image implication benchmarks. | Task: Reasoning & Understanding
* 26.02 [Learning Self-Correction in Vision-Language Models via Rollout Augmentation](https://arxiv.org/abs/2602.08503) | [Paper📑](https://arxiv.org/abs/2602.08503) [Model🤗](https://huggingface.co/Tuwhy/Octopus-8B)
  - Octopus synthesizes dense self-correction examples for VLMs via RL, achieving SOTA among open-source VLMs on 7 benchmarks. | Task: Reasoning & Understanding
* 26.02 [SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs](https://arxiv.org/abs/2602.06566) | [Paper📑](https://arxiv.org/abs/2602.06566)
  - Decouples visual perception from reasoning in VLMs via a two-stage pipeline, enabling efficient test-time scaling with 200× lower token budget. | Task: Reasoning & Understanding
* 26.02 [Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models](https://arxiv.org/abs/2602.07026) | [Paper📑](https://arxiv.org/abs/2602.07026) [Code🖥️](https://github.com/Yu-xm/ReVision)
  - Fixed-frame Modality Gap Theory with training-free ReAlign alignment and scalable ReVision pretraining using unpaired data to bridge the modality gap. | Task: Reasoning & Understanding
* 26.02 [Kimi K2.5: Visual Agentic Intelligence](https://arxiv.org/abs/2602.02276) | [Paper📑](https://arxiv.org/abs/2602.02276) [Code🖥️](https://github.com/MoonshotAI/Kimi-K2.5) [Model🤗](https://huggingface.co/moonshotai/Kimi-K2.5)
  - Open-source multimodal agentic model achieving SOTA across coding, vision, reasoning, and agentic tasks via joint text-vision RL and Agent Swarm parallel execution. | Task: Reasoning & Understanding
* 26.02 [Toward Cognitive Supersensing in Multimodal Large Language Model](https://arxiv.org/abs/2602.01541) | [Paper📑](https://arxiv.org/abs/2602.01541) [Code🖥️](https://github.com/PediaMedAI/Cognition-MLLM) [Model🤗](https://huggingface.co/PediaMedAI/CogSense-8B) [Dataset🤗](https://huggingface.co/datasets/PediaMedAI/CogSense-Bench)
  - Trains MLLMs to generate internal visual imagery sequences for abstract visual reasoning, evaluated on CogSense-Bench spanning five cognitive dimensions. | Task: Reasoning & Understanding
* 26.02 [Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling](https://arxiv.org/abs/2602.02453) | [Paper📑](https://arxiv.org/abs/2602.02453) [Code🖥️](https://github.com/andongBlue/Think-with-Comics)
  - Uses comics as a visual medium to improve multimodal reasoning efficiency while preserving temporal structure and narrative coherence. | Task: Reasoning & Understanding
* 26.02 [SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs](https://arxiv.org/abs/2602.06040) | [Paper📑](https://arxiv.org/abs/2602.06040) [Code🖥️](https://github.com/Accio-Lab/SwimBird) [Dataset🤗](https://huggingface.co/datasets/Accio-Lab/SwimBird-SFT-92K)
  - Hybrid autoregressive MLLM dynamically switching between text-only, vision-only, and interleaved vision-text reasoning modes based on input queries. | Task: Reasoning & Understanding
* 26.02 [What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis](https://arxiv.org/abs/2602.12395) | [Paper📑](https://arxiv.org/abs/2602.12395) [Code🖥️](https://github.com/tianyi-lab/Frankenstein) [Model🤗](https://huggingface.co/AIcell/Frankenstein-RL)
  - Analyzes what RL actually improves in VLMs for visual reasoning, finding RL primarily refines mid-to-late transformer layers that improve vision-to-reasoning alignment. | Task: Reasoning & Understanding
* 26.02 [On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs](https://arxiv.org/abs/2602.12506) | [Paper📑](https://arxiv.org/abs/2602.12506)
  - Shows RL fine-tuning of VLMs introduces vulnerability to textual perturbations and reveals an accuracy-faithfulness trade-off undermining chain-of-thought reliability. | Task: Reasoning & Understanding
* 26.02 [Thinking with Drafting: Optical Decompression via Logical Reconstruction](https://arxiv.org/abs/2602.11731) | [Paper📑](https://arxiv.org/abs/2602.11731)
  - TwD reconstructs logical structures from compressed visual tokens via Domain-Specific Language, forcing models to draft reasoning as executable code for self-verification. | Task: Reasoning & Understanding
* 26.02 [Visual Persuasion: What Influences Decisions of Vision-Language Models?](https://arxiv.org/abs/2602.15278) | [Paper📑](https://arxiv.org/abs/2602.15278)
  - Studies VLM visual decision-making through controlled image-based choice tasks with systematic perturbations to identify visual vulnerabilities and safety concerns. | Task: Reasoning & Understanding
* 26.02 [Adapting Vision-Language Models for E-commerce Understanding at Scale](https://arxiv.org/abs/2602.11733) | [Paper📑](https://arxiv.org/abs/2602.11733)
  - Adapts general-purpose VLMs for e-commerce product understanding via a 4M-item visual instruction tuning dataset covering deep product attributes and dynamic extraction. | Task: Reasoning & Understanding
* 26.02 [Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception](https://arxiv.org/abs/2602.11858) | [Paper📑](https://arxiv.org/abs/2602.11858) [Code🖥️](https://github.com/inclusionAI/Zooming-without-Zooming) [Model🤗](https://huggingface.co/inclusionAI/ZwZ-8B)
  - Trains MLLMs to internally perform iterative zooming during inference via distillation, eliminating repeated tool calls while improving fine-grained visual perception. | Task: Reasoning & Understanding
* 26.02 [DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories](https://arxiv.org/abs/2602.10809) | [Paper📑](https://arxiv.org/abs/2602.10809) [Code🖥️](https://github.com/RUC-NLPIR/DeepImageSearch) [Dataset🤗](https://huggingface.co/datasets/RUC-NLPIR/DISBench)
  - Reformulates image retrieval as multi-step reasoning over visual histories with DISBench benchmark and a modular agent with dual-memory system. | Task: Reasoning & Understanding
* 26.01 [MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods](https://arxiv.org/abs/2601.21821) | [Paper📑](https://arxiv.org/abs/2601.21821) [Model🤗](https://huggingface.co/OpenDataArena/MMFineReason-8B) [Dataset🤗](https://huggingface.co/datasets/OpenDataArena/MMFineReason-1.8M-Qwen3-VL-235B-Thinking)
  - A 1.8M-sample multimodal reasoning dataset with high-quality CoT annotations; the 8B model approaches Qwen3-VL-32B-Thinking performance. | Task: Reasoning & Understanding
* 26.01 [DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models](https://arxiv.org/abs/2512.24165) | [Paper📑](https://arxiv.org/abs/2512.24165) [Code🖥️](https://github.com/lcqysl/DiffThinker) [Model🤗](https://huggingface.co/yhx12/DiffThinker) [Dataset🤗](https://huggingface.co/datasets/yhx12/DiffThinker_Eval)
  - Reformulates multimodal reasoning as a native image-to-image generative task using diffusion models. | Task: Reasoning & Understanding
* 26.01 [Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models](https://arxiv.org/abs/2601.19834) | [Paper📑](https://arxiv.org/abs/2601.19834) [Code🖥️](https://github.com/thuml/Reasoning-Visual-World) [Dataset🤗](https://huggingface.co/datasets/thuml/VisWorld-Eval)
  - Proposes the visual superiority hypothesis: visual generation serves as a more natural world model for physical/spatial reasoning tasks. | Task: Reasoning & Understanding
* 26.01 [VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning](https://arxiv.org/abs/2601.22069) | [Paper📑](https://arxiv.org/abs/2601.22069) [Code🖥️](https://github.com/w-yibo/VTC-R1)
  - Compresses textual reasoning traces into compact images as "optical memory" for VLMs, achieving 3.4x token compression. | Task: Reasoning & Understanding
* 26.01 [UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision](https://arxiv.org/abs/2601.03193) | [Paper📑](https://arxiv.org/abs/2601.03193) [Code🖥️](https://github.com/Hungryyan1/UniCorn) [Model🤗](https://huggingface.co/CostaliyA/UniCorn)
  - Self-improvement framework partitioning a single model into Proposer/Solver/Judge roles via self-play to improve comprehension and generation. | Task: Reasoning & Understanding
* 26.01 [LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning](https://arxiv.org/abs/2601.10129) | [Paper📑](https://arxiv.org/abs/2601.10129) [Code🖥️](https://github.com/Svardfox/LaViT) [Model🤗](https://huggingface.co/Svard/LaViT-3B) [Dataset🤗](https://huggingface.co/datasets/Svard/LaViT-15k)
  - Addresses the "Perception Gap" by aligning latent visual thoughts (attention trajectories) between teacher and student models. | Task: Reasoning & Understanding
* 26.01 [STEP3-VL-10B Technical Report](https://arxiv.org/abs/2601.09668) | [Paper📑](https://arxiv.org/abs/2601.09668) [Code🖥️](https://github.com/stepfun-ai/Step3-VL-10B) [Model🤗](https://huggingface.co/stepfun-ai/Step3-VL-10B)
  - A 10B multimodal foundation model with Parallel Coordinated Reasoning (PaCoRe) for test-time compute scaling. | Task: Reasoning & Understanding
* 26.01 [Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation](https://arxiv.org/abs/2601.21406) | [Paper📑](https://arxiv.org/abs/2601.21406) [Code🖥️](https://github.com/Sugewud/UniMRG)
  - Trains unified multimodal models to generate pixel, depth, and segmentation representations alongside understanding. | Task: Reasoning & Understanding
* 26.01 [What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge](https://arxiv.org/abs/2601.10922) | [Paper📑](https://arxiv.org/abs/2601.10922)
  - First-place NeurIPS 2025 DCVLR challenge submission revealing difficulty-based example selection as dominant driver in data curation. | Task: Reasoning & Understanding
* 26.01 [MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models](https://arxiv.org/abs/2601.21181) | [Paper📑](https://arxiv.org/abs/2601.21181)
  - Modality-adaptive decoding to mitigate cross-modal hallucinations in MLLMs by dynamically adjusting decoding. | Task: Reasoning & Understanding
* 25.12 [OneThinker: All-in-one Reasoning Model for Image and Video](https://arxiv.org/abs/2512.03043) | [Paper📑](https://arxiv.org/abs/2512.03043)
  - Unifies image and video understanding across diverse visual tasks using RL with EMA-GRPO technique. | Task: Reasoning & Understanding
* 25.12 [Puzzle Curriculum GRPO for Vision-Centric Reasoning](https://arxiv.org/abs/2512.14944) | [Paper📑](https://arxiv.org/abs/2512.14944)
  - Supervision-free RL method enhancing visual reasoning in VLMs through self-supervised puzzle environments. | Task: Reasoning & Understanding
* 25.12 [Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding](https://arxiv.org/abs/2512.17532) | [Paper📑](https://arxiv.org/abs/2512.17532)
  - Enhances MLLM robustness to visual degradations by modeling degradation parameters through structured reasoning chains. | Task: Reasoning & Understanding
* 25.12 [See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning](https://arxiv.org/abs/2512.22120) | [Paper📑](https://arxiv.org/abs/2512.22120)
  - Improves VLM multimodal reasoning via paired masked views to enforce fine-grained visual reliance. | Task: Reasoning & Understanding
* 25.11 [OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe](https://arxiv.org/abs/2511.16334) | [Paper📑](https://arxiv.org/abs/2511.16334)
  - Open general-purpose framework for advancing multimodal reasoning. | Task: Reasoning & Understanding
* 25.11 [ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning](https://arxiv.org/abs/2510.27492) | [Paper📑](https://arxiv.org/abs/2510.27492)
  - Studies emergent properties in multimodal interleaved chain-of-thought reasoning. | Task: Reasoning & Understanding
* 25.11 [TiDAR: Think in Diffusion, Talk in Autoregression](https://arxiv.org/abs/2511.08923) | [Paper📑](https://arxiv.org/abs/2511.08923)
  - Combines diffusion-based thinking with autoregressive generation for multimodal reasoning. | Task: Reasoning & Understanding
* 25.10 [TTRV: Test-Time Reinforcement Learning for Vision Language Models](https://arxiv.org/abs/2510.06783) | [Paper📑](https://arxiv.org/abs/2510.06783)
  - Test-time reinforcement learning applied to vision-language models for improved reasoning. | Task: Reasoning & Understanding
* 25.10 [VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs](https://arxiv.org/abs/2509.25916) | [Paper📑](https://arxiv.org/abs/2509.25916)
  - Improves VLMs' ability to combine high-level reasoning with detailed visual perception. | Task: Reasoning & Understanding
* 25.10 [ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping](https://arxiv.org/abs/2510.08457) | [Paper📑](https://arxiv.org/abs/2510.08457)
  - Adaptive reasoning for multimodal models using entropy shaping. | Task: Reasoning & Understanding
* 25.09 [R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and RL](https://arxiv.org/abs/2508.21113) | [Paper📑](https://arxiv.org/abs/2508.21113)
  - Training method using RL and annealing to improve auto-thinking and reasoning in multimodal LLMs. | Task: Reasoning & Understanding
* 25.09 [LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training](https://arxiv.org/abs/2509.23661) | [Paper📑](https://arxiv.org/abs/2509.23661)
  - Open-source framework for training multimodal vision-language models. | Task: Reasoning & Understanding
* 25.08 [Thyme: Think Beyond Images](https://arxiv.org/abs/2508.11630) | [Paper📑](https://arxiv.org/abs/2508.11630)
  - Multimodal reasoning system that extends beyond surface-level image understanding to higher-level thinking. | Task: Reasoning & Understanding
* 25.08 [Controlling Multimodal LLMs via Reward-guided Decoding](https://arxiv.org/abs/2508.11616) | [Paper📑](https://arxiv.org/abs/2508.11616)
  - Controls MLLM reasoning outputs through reward-based generation guidance at decoding time. | Task: Reasoning & Understanding
* 25.08 [Self-Rewarding Vision-Language Model via Reasoning Decomposition](https://arxiv.org/abs/2508.15882) | [Paper📑](https://arxiv.org/abs/2508.15882)
  - VLM that uses reasoning decomposition and self-reward to improve visual reasoning quality. | Task: Reasoning & Understanding
* 25.08 [GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models](https://arxiv.org/abs/2508.06471) | [Paper📑](https://arxiv.org/abs/2508.06471)
  - Foundation model with strong agentic, reasoning, and coding capabilities across modalities. | Task: Reasoning & Understanding
* 25.07 [GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning](https://arxiv.org/abs/2507.01006) | [Paper📑](https://arxiv.org/abs/2507.01006) [Code🖥️](https://github.com/THUDM/GLM-4.1V-Thinking)
  - A reasoning-centric training framework for general-purpose multimodal reasoning. | Task: Reasoning & Understainding
* 25.07 [MiCo: Multi-image Contrast for Reinforcement Visual Reasoning](https://arxiv.org/abs/2506.22434) | [Paper📑](https://arxiv.org/abs/2506.22434)
   - Construct image triplets comprising two augmented views of the same image and a third, similar but distinct image. | Task: Reasoning & Understainding
* 25.06 [Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning](https://arxiv.org/abs/2506.09736) | [Paper📑](https://arxiv.org/abs/2506.09736) [Code🖥️](https://github.com/YutingLi0606/Vision-Matters) [Model🤗](https://huggingface.co/collections/Yuting6/vision-matters-684801dd1879d3e639a930d1)
  - Simple visual perturbation framework that can be easily integrated into existing post-training pipelines including SFT, DPO, and GRPO. | Task: Reasoning & Understainding
* 25.05 [Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning](https://arxiv.org/pdf/2505.14677) | [Paper📑](https://arxiv.org/pdf/2505.14677) [Code🖥️](https://github.com/maifoundations/Visionary-R1) [Model🤗](https://huggingface.co/maifoundations/Visionary-R1)
* 25.05 [Sherlock: Self-Correcting Reasoning in Vision-Language Models](http://arxiv.org/pdf/2505.22651) | [Paper📑](http://arxiv.org/pdf/2505.22651) [Code🖥️](https://github.com/DripNowhy/Sherlock) [Model🤗](https://huggingface.co/collections/Tuwhy/sherlock-6835f46e450a48f228f7e80d)
  - Explore self-correction as a strategy to enhance reasoning VLMs | Task: Reasoning & Understainding
* 25.05 [EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning](https://arxiv.org/pdf/2505.04623) | [Paper📑](https://arxiv.org/pdf/2505.04623) [Code🖥️](https://github.com/HarryHsing/EchoInk) [Model🤗](https://huggingface.co/harryhsing/EchoInk-R1-7B)
  - The first general framework for unified audio-visual reasoning via reinforcement learning | Task: Reasoning & Understainding
* 25.03 [Skywork-R1V: Pioneering Multimodal Reasoning with CoT](https://github.com/SkyworkAI/Skywork-R1V/blob/main/Skywork_R1V.pdf) | [Paper📑](https://github.com/SkyworkAI/Skywork-R1V/blob/main/Skywork_R1V.pdf) [Code🖥️](https://github.com/SkyworkAI/Skywork-R1V) [Model🤗](https://huggingface.co/Skywork/Skywork-R1V-38B)
  - The first industry open-sourced multimodal reasoning model with advanced visual chain-of-thought capabilities | Task: Reasoning & Understainding
* 25.03 [CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation](https://arxiv.org/pdf/2503.05255) | [Paper📑](https://arxiv.org/pdf/2503.05255)
  - Mimic human-like ”slow thinking” for multi-image understanding. | Task: VQA
* 25.03 [DAPO: an Open-Source LLM Reinforcement Learning System at Scale](https://dapo-sia.github.io/static/pdf/dapo_paper.pdf) | [Paper📑](https://dapo-sia.github.io/static/pdf/dapo_paper.pdf) [Code🖥️](https://github.com/BytedTsinghua-SIA/DAPO) [Data🤗](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k)
  - Propose the Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) algorithm. | Task: Math
* 25.03 [VisRL: Intention-Driven Visual Perception via Reinforced Reasoning](https://arxiv.org/pdf/2503.07523) | [Paper📑](https://arxiv.org/pdf/2503.07523) [Code🖥️](https://github.com/zhangquanchen/VisRL) 
  - The first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception | Task: VQA
* 25.03  [Unified Reward Model for Multimodal Understanding and Generation](https://arxiv.org/abs/2503.05236) | [Paper📑](https://arxiv.org/abs/2503.05236) [Code🖥️](https://codegoat24.github.io/UnifiedReward/) [Dataset🤗](https://huggingface.co/collections/CodeGoat24/unifiedreward-training-data-67c300d4fd5eff00fa7f1ede)
  -  Improve MLLM's understanding and generation ability with DPO  | Task: VQA & Generation
* 25.02 [Qwen2.5-VL Technical Report](https://arxiv.org/pdf/2502.13923) | [Paper📑](https://arxiv.org/pdf/2502.13923) [Code🖥️](https://github.com/QwenLM/Qwen2.5-VL) [Huggingface🤗](https://huggingface.co/Qwen)
   - The latest flagship model of Qwen vision-language series for various multimodal tasks | Task: Reasoning & Understainding
* 25.02 [MM-RLHF: The Next Step Forward in Multimodal LLM Alignment](https://arxiv.org/abs/2502.10391) | [Paper📑](https://arxiv.org/abs/2502.10391) [Project🌐](https://mm-rlhf.github.io/)
  - A comprehensive project for aligning MlLMs with human preferences | Task: Reward & VQA
* 25.01 [Kimi k1.5: Scaling Reinforcement Learning with LLMs](https://arxiv.org/pdf/2501.12599) (MoonshotAI) | [Project🌐](https://github.com/MoonshotAI/Kimi-k1.5)
  - The latest flagship model of Kimi series for various multimodal tasks | Task: Reasoning & Understainding
* 25.01 [InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model](https://arxiv.org/abs/2501.12368) | [Paper📑](https://arxiv.org/abs/2501.12368) [Code🖥️](https://github.com/InternLM/InternLM-XComposer)
  - A simple yet effective multi-modal reward model that aligns MLLMs with human preferences | Reward & VQA
* 25.01 [LlamaV-o1: Rethinking Step-By-Step Visual Reasoning in LLMs](https://arxiv.org/abs/2501.06186) | [Paper📑](https://arxiv.org/abs/2501.06186) [Code🖥️](https://github.com/mbzuai-oryx/LlamaV-o1)
  - A combined multi-step curriculum learning and beam search multimodal reasoning model |  VQA
* 25.01 [ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding](https://arxiv.org/pdf/2501.05452) | [Paper📑](https://arxiv.org/pdf/2501.05452) [Code🖥️](https://github.com/zeyofu/ReFocus_Code) [Model🤗](https://huggingface.co/Fiaa/ReFocus)
  - Perform visual chain of thought via input-image editing to help multimodal reasoning. | Task: VQA
* 24.12 [Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search](https://arxiv.org/abs/2412.18319) | [Paper📑](https://arxiv.org/abs/2412.18319) [Code🖥️](https://github.com/HJYao00/Mulberry) [Dataset🤗](https://huggingface.co/datasets/HuanjinYao/Mulberry-SFT)
  - Improve MLLM reasoning ability via collective monte carlo tree search | VQA 
* 24.11  [LLaVA-CoT: Let Vision Language Models Reason Step-by-Step](https://arxiv.org/abs/2411.10440) | [Paper📑](https://arxiv.org/abs/2411.10440) [Code🖥️](https://github.com/PKU-YuanGroup/LLaVA-CoT) [Model🤗](https://huggingface.co/Xkev/Llama-3.2V-11B-cot)
  -  A novel MLLM designed to conduct autonomous multistage reasoning. | VQA
* 24.11 [Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models](https://arxiv.org/abs/2411.14432) | [Paper📑](https://arxiv.org/abs/2411.14432) [Code🖥️](https://github.com/dongyh20/Insight-V) [Model🤗](https://huggingface.co/collections/THUdyh/insight-v-673f5e1dd8ab5f2d8d332035)
  - Explore long-chain visual reasoning with MLLMs  | VQA  
* 24.11 [Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization](https://arxiv.org/abs/2411.10442) | [Paper📑](https://arxiv.org/abs/2411.10442) [Code🖥️](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/internvl2.0_mpo) [Model🤗](https://huggingface.co/OpenGVLab/InternVL2-8B-MPO)
  - A preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs.  | VQA                
* 24.10 [Improve Vision Language Model Chain-of-thought Reasoning](https://arxiv.org/pdf/2410.16198) | [Paper📑](https://arxiv.org/pdf/2410.16198) [Code🖥️](https://github.com/RifleZhang/LLaVA-Reasoner-DPO)
  - Apply reinforcement learning on 193k CoT sft data for reasoning | VQA    
* 24.03  (NeurIPS24)[Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning](https://proceedings.neurips.cc/paper_files/paper/2024/file/0ff38d72a2e0aa6dbe42de83a17b2223-Paper-Datasets_and_Benchmarks_Track.pdf) | [Paper📑](https://proceedings.neurips.cc/paper_files/paper/2024/file/0ff38d72a2e0aa6dbe42de83a17b2223-Paper-Datasets_and_Benchmarks_Track.pdf) [Code🖥️](https://github.com/deepcs233/Visual-CoT) 
 [Dataset🤗](https://huggingface.co/datasets/deepcs233/Visual-CoT)
  - Visual CoT for improve MLLMs' reasoning ability | VQA
* 23.02 [Multimodal Chain-of-Thought Reasoning in Language Models](https://arxiv.org/abs/2302.00923) | [Paper📑](https://arxiv.org/abs/2302.00923) [Code🖥️](https://github.com/amazon-science/mm-cot)
  - Visual CoT for MLLM reasoning | VQA

#### Video MLLM
* 26.03 [LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding](https://arxiv.org/abs/2602.20913) | [Paper📑](https://arxiv.org/abs/2602.20913) [Code🖥️](https://github.com/qiujihao19/LongVideo-R1)
  - Active reasoning-equipped MLLM agent for efficient long video navigation using hierarchical visual summaries and iterative clip refinement, trained via SFT+RL on 33K CoT trajectories. | Task: Video Understanding & Reasoning
* 26.03 [Proact-VL: A Proactive VideoLLM for Real-Time AI Companions](https://arxiv.org/abs/2603.03447) | [Paper📑](https://arxiv.org/abs/2603.03447)
  - Framework enabling real-time proactive AI companions via chunk-wise video processing with autonomous response timing decisions, evaluated on Live Gaming Benchmark. | Task: Video Understanding & Reasoning
* 26.02 [A Very Big Video Reasoning Suite](https://arxiv.org/abs/2602.20159) | [Paper📑](https://arxiv.org/abs/2602.20159) [Model🤗](https://huggingface.co/Video-Reason/VBVR-Wan2.2) [Dataset🤗](https://huggingface.co/datasets/Video-Reason/VBVR-Dataset)
  - 1M+ video clip dataset spanning 200 reasoning tasks (VBVR) with VBVR-Bench for verifiable evaluation, enabling emergent generalization via large-scale scaling. | Task: Video Understanding & Reasoning
* 26.02 [Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning](https://arxiv.org/abs/2601.21037) | [Paper📑](https://arxiv.org/abs/2601.21037) [Project🌐](https://thinking-in-frames.github.io/)
  - Video generation models achieve zero-shot generalization for visual reasoning by using generated frames as intermediate reasoning steps with a visual test-time scaling law. | Task: Video Understanding & Reasoning
* 26.02 [Multimodal Fact-Level Attribution for Verifiable Reasoning](https://arxiv.org/abs/2602.11509) | [Paper📑](https://arxiv.org/abs/2602.11509) [Code🖥️](https://github.com/meetdavidwan/murgat)
  - MuRGAt benchmark requiring MLLMs to provide precise fact-level citations across video, audio, and modalities, finding strong models frequently hallucinate citations despite correct reasoning. | Task: Video Understanding & Reasoning
* 26.02 [Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions](https://arxiv.org/abs/2602.13013) | [Paper📑](https://arxiv.org/abs/2602.13013) [Code🖥️](https://github.com/HVision-NKU/ASID-Caption) [Model🤗](https://huggingface.co/AudioVisual-Caption/ASID-Captioner-7B) [Dataset🤗](https://huggingface.co/datasets/AudioVisual-Caption/ASID-1M)
  - ASID-Caption suite with 1M structured audiovisual annotations and quality verification pipeline for fine-grained audiovisual video understanding across multiple attribute dimensions. | Task: Video Understanding & Reasoning
* 26.02 [VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction](https://arxiv.org/abs/2602.13294) | [Paper📑](https://arxiv.org/abs/2602.13294) [Code🖥️](https://github.com/TIGER-AI-Lab/VisPhyWorld)
  - Execution-based framework evaluating physical reasoning in MLLMs by requiring executable simulator code from visual observations; VisPhyBench (209 scenes) reveals MLLMs struggle to infer physical parameters. | Task: Video Understanding & Reasoning
* 26.01 [Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation](https://arxiv.org/abs/2512.24271) | [Paper📑](https://arxiv.org/abs/2512.24271)
  - Uses counterfactual video generation to reduce hallucinations and improve temporal reasoning in multimodal LLMs. | Task: Video Understanding & Reasoning
* 25.12 [Rethinking Chain-of-Thought Reasoning for Videos](https://arxiv.org/abs/2512.09616) | [Paper📑](https://arxiv.org/abs/2512.09616)
  - Proposes improved chain-of-thought reasoning strategies specifically designed for video understanding tasks. | Task: Video Understanding & Reasoning
* 25.12 [SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with RL](https://arxiv.org/abs/2512.13874) | [Paper📑](https://arxiv.org/abs/2512.13874)
  - RL-based framework training agents for long-horizon video reasoning across variable time spans. | Task: Video Understanding & Reasoning
* 25.11 [Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination](https://arxiv.org/abs/2511.17490) | [Paper📑](https://arxiv.org/abs/2511.17490)
  - Enhances reasoning over text-rich video content via visual rumination. | Task: Video Understanding & Reasoning
* 25.10 [Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning](https://arxiv.org/abs/2510.23473) | [Paper📑](https://arxiv.org/abs/2510.23473)
  - Reasoning framework enabling models to think with video inputs via RL. | Task: Video Understanding & Reasoning
* 25.10 [StreamingVLM: Real-Time Understanding for Infinite Video Streams](https://arxiv.org/abs/2510.09608) | [Paper📑](https://arxiv.org/abs/2510.09608)
  - Real-time video stream understanding with multimodal LLMs. | Task: Video Understanding & Reasoning
* 25.09 [Video models are zero-shot learners and reasoners](https://arxiv.org/abs/2509.20328) | [Paper📑](https://arxiv.org/abs/2509.20328)
  - Demonstrates zero-shot reasoning capabilities in video models. | Task: Video Understanding & Reasoning
* 25.07 [Scaling RL to Long Videos](https://arxiv.org/abs/2507.07966)| [Paper📑](https://arxiv.org/pdf/2507.07966) [Model🤗](https://huggingface.co/Efficient-Large-Model/LongVILA-R1-7B) [Code🖥️](https://github.com/NVlabs/Long-RL)
* 25.06 [DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO](https://arxiv.org/abs/2506.07464)|[Paper📑](https://arxiv.org/pdf/2506.07464)
* 25.06 [VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning](https://arxiv.org/abs/2505.12434) | [Paper📑](https://arxiv.org/abs/2505.12434) [Model🤗](https://huggingface.co/QiWang98/VideoRFT) [Code🖥️](https://github.com/QiWang98/VideoRFT)
  - Extend Reinforcement Fine-Tuning (RFT) to the video reasoning domain, a long-standing challenge. | Task: Video Understanding & Reasoning
* 25.06 [VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks](https://www.arxiv.org/abs/2506.09079)|[Paper📑](https://www.arxiv.org/pdf/2506.09079) [Model🤗](https://huggingface.co/VersaVid-R1/VersaVid-R1) [Code🖥️](https://github.com/VersaVid-R1/VersaVid-R1)
* 25.05 [SpaceR: Reinforcing MLLMs in Video Spatial  Reasoning](https://arxiv.org/abs/2504.01805v2)|[Paper📑](https://arxiv.org/pdf/2504.01805v2) [Model🤗](https://huggingface.co/RUBBISHLIKE/SpaceR) [Code🖥️](https://github.com/OuyangKun10/SpaceR)
* 25.05 [Video-R1: Reinforcing Video Reasoning in MLLMs](https://arxiv.org/abs/2503.21776) | [Paper📑](https://arxiv.org/pdf/2503.21776)[Model🤗](https://huggingface.co/Video-R1/Video-R1-7B)  [Code🖥️](https://github.com/tulerfeng/Video-R1)
* 25.04 [TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning](https://arxiv.org/abs/2504.09641) |  [Paper📑](https://arxiv.org/pdf/2504.09641) [Model🤗](https://huggingface.co/Zhang199/TinyLLaVA-Video-R1) [Code🖥️](https://github.com/ZhangXJ199/TinyLLaVA-Video-R1)
* 25.04 [Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning](https://arxiv.org/pdf/2505.03318) | [Paper📑](https://arxiv.org/pdf/2505.03318) [Project🌐](https://codegoat24.github.io/UnifiedReward/think) [Code🖥️](https://github.com/CodeGoat24/UnifiedReward)
  - The first unified multimodal CoT reward model, capable of step-by-step long-chain reasoning for visual understanding and generation reward tasks. | Task: Video Understanding and Feneration
* 25.04 [ViSMaP: Unsupervised Hour-long Video Summarisation by Meta-Prompting](https://arxiv.org/abs/2504.15921) | [Paper📑](https://arxiv.org/abs/2504.15921)
  - A system to summarise hour long videos with no-supervision. | Task: Video Summary
* 25.04 [TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning](https://arxiv.org/abs/2504.09641) | [Paper📑](https://arxiv.org/abs/2504.09641) [Code🖥️](https://github.com/ZhangXJ199/TinyLLaVA-Video-R1) | [Model🤗](https://huggingface.co/Zhang199/TinyLLaVA-Video-R1)
  - Present the small-scale video reasoning model TinyLLaVA-Video-R1 | Task: Video QA
* 25.04 [VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning](https://arxiv.org/abs/2503.13444) | [Paper📑](https://arxiv.org/abs/2503.13444) [Code🖥️](https://github.com/yeliudev/VideoMind) | [Dataset🤗](https://huggingface.co/datasets/yeliudev/VideoMind-Dataset)
  - A novel video-language agent designed for temporal-grounded video understanding. | Task: Video QA
* 25.04 [Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1](https://arxiv.org/pdf/2503.24376) | [Paper📑](https://arxiv.org/pdf/2503.24376) [Code🖥️](https://github.com/TencentARC/SEED-Bench-R1) | [Dataset🤗](https://huggingface.co/datasets/TencentARC/SEED-Bench-R1)
  - Reveals that RL enhances visual perception but often produces less logically coherent reasoning chains. | Task: Video QA
* 25.03 [VIDEOTREE: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos](https://arxiv.org/abs/2405.19209) |[Paper📑](https://arxiv.org/pdf/2405.19209) [Code🖥️](https://github.com/Ziyang412/VideoTree) 
* 25.02 [CoS: Chain-of-Shot Prompting for Long Video Understanding](https://arxiv.org/pdf/2502.06428) | [Paper📑](https://arxiv.org/pdf/2502.06428) [Code🖥️](https://github.com/lwpyh/CoS_codes1)
  - Approach long video understanding by optimising input video information to fully utilise MLLM’s ability to comprehend long videos. | Task: Video VQA
* 25.02 [video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model](https://arxiv.org/abs/2502.11775) | [Paper📑](https://arxiv.org/abs/2502.11775) [Demo🖥️](https://github.com/BriansIDP/video-SALMONN-o1)
  - A open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks.  | Task: Video QA  
* 25.02 [Open-R1-Video]((https://github.com/Wang-Xiaodong1899/Open-R1-Video)) | [Code🖥️](https://github.com/Wang-Xiaodong1899/Open-R1-Video) [Dataset🤗](https://huggingface.co/datasets/Xiaodong/open-r1-video-4k)
  - A open-source R1-style video understanding model | Task: Video QA
* 25.01 [Temporal Preference Optimization for Long-Form Video Understanding](https://arxiv.org/abs/2501.13919) | [Paper📑](https://arxiv.org/abs/2501.13919)[Code🖥️](https://ruili33.github.io/tpo_website/)
  - A novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning | Task: Video QA
* 25.01 [Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding](https://arxiv.org/abs/2501.07888https://github.com/bytedance/tarsier) | [Paper📑](https://arxiv.org/abs/2501.07888) [Code🖥️](https://github.com/bytedance/tarsier?tab=readme-ov-file)
  [Model🤗](https://huggingface.co/omni-research/Tarsier-34b)
  - A family of VLMs designed for high-quality video captioning and understanding | Task: Video captioning & QA
* 24.12 (ECCV24) [VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding](https://arxiv.org/abs/2403.11481) | [Paper📑](https://arxiv.org/abs/2403.11481) [Code🖥️](https://github.com/YueFan1014/VideoAgent) [Project🌐](https://videoagent.github.io/)
  - Explore how reconciling several foundation models with a novel unified memory mechanism could tackle the challenging video understanding problem  | Task: Video captioning & QA

#### Audio MLLM
* 25.10 [UALM: Unified Audio Language Model for Understanding, Generation and Reasoning](https://arxiv.org/abs/2510.12000)  [Project🌐](https://research.nvidia.com/labs/adlr/UALM/)
* 25.09 [MiMo Audio: Audio Language Models are Few-Shot Learners](https://github.com/XiaomiMiMo/MiMo-Audio) [Project🌐](https://xiaomimimo.github.io/MiMo-Audio-Demo/)  [Code🖥️](https://github.com/XiaomiMiMo/MiMo-Audio)
* 25.07 [Audio Entailment: Assessing Deductive Reasoning for Audio Understanding](https://arxiv.org/abs/2407.18062)
* 25.07 [Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models](https://arxiv.org/abs/2507.08128)
* 25.05 [AudSemThinker: Enhancing Audio-Language Models Through Reasoning over Semantics of Sound](https://arxiv.org/abs/2505.14142)
* 25.05 [Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?](https://arxiv.org/abs/2505.09439)  
 - Utilizing GRPO to enhance audio reasoning performance
* 25.04 [SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning](https://arxiv.org/abs/2504.15900)
* 25.04 [Kimi-Audio Technical Report](https://arxiv.org/abs/2504.18425)  [Code🖥️](https://github.com/MoonshotAI/Kimi-Audio)
* 25.03 [Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering](https://arxiv.org/html/2503.11197v1)
* 25.03 [Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models](https://arxiv.org/pdf/2503.02318)  [Project🌐](https://xzf-thu.github.io/Audio-Reasoner/)
  - Utilizing CoT data for audio understanding tasks.
* 25.03 [Mellow: a small audio language model for reasoning](https://arxiv.org/pdf/2503.08540)  [Code🖥️](https://github.com/soham97/mellow)
  - Small audio-language model (167M) designed for audio understanding, audio entailment, audio difference and captioning.
* 25.03 [Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities](https://arxiv.org/pdf/2503.03983) [Project🌐](https://research.nvidia.com/labs/adlr/AF2/)
  - NVIDIA audio-language for various audio understanding and reasoning.
* 25.02 [Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction](https://arxiv.org/abs/2502.11946) [Code🖥️](https://github.com/stepfun-ai/Step-Audio)
* 25.01 [Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model](https://arxiv.org/pdf/2501.07246)
  - Finetuning Qwen2-Audio with CoT data for audio understanding and retrieval tasks.
* 24.07 [Qwen2-Audio Technical Report](https://arxiv.org/abs/2407.10759) [Paper📑](https://arxiv.org/abs/2407.10759)  [Code🖥️](https://github.com/QwenLM/Qwen2-Audio)
  - Qwen audio-language series for various audio understanding tasks especially for speech.
* 24.07 (EMNLP2024) [GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities](https://arxiv.org/pdf/2406.11768)  [Project🌐](https://sreyan88.github.io/gamaaudio/)
  - NVIDIA audio-language for various audio understanding and reasoning.
* 24.02 (ICML2024)[Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities](https://arxiv.org/pdf/2402.01831) [Code🖥️](https://github.com/NVIDIA/audio-flamingo)
  - audio-language for various audio understanding and reasoning with Q-formers.
* 23.11 [Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models](https://arxiv.org/pdf/2311.07919) [Code🖥️](https://github.com/QwenLM/Qwen-Audio)
  - Qwen audio-language series for various audio understanding tasks in speech sound and music.
* 23.10 (ICLR2024) [SALMONN: Towards Generic Hearing Abilities for Large Language Models](https://arxiv.org/pdf/2310.13289) [Code🖥️](https://github.com/bytedance/SALMONN)
  - Bytedance audio-language for various audio understanding tasks especially for speech and sound with Q-former.
* 23.09 (NAACL2024) [MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response](https://arxiv.org/pdf/2309.08730)
  - Music-language for understanding and captioning tasks.

#### Omni MLLM
* 26.02 [OmniGAIA: Towards Native Omni-Modal AI Agents](https://arxiv.org/abs/2602.22897) | [Paper📑](https://arxiv.org/abs/2602.22897) [Code🖥️](https://github.com/RUC-NLPIR/OmniGAIA) [Model🤗](https://huggingface.co/RUC-NLPIR/OmniAtlas-Qwen3-30B-A3B) [Dataset🤗](https://huggingface.co/datasets/RUC-NLPIR/OmniGAIA)
  - OmniGAIA benchmark for omni-modal agent evaluation on cross-modal reasoning and tool-use, with OmniAtlas agent trained via hindsight-guided tree exploration and OmniDPO. | Task: Reasoning & Understanding
* 26.02 [Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device](https://arxiv.org/abs/2602.20161) | [Paper📑](https://arxiv.org/abs/2602.20161) [Code🖥️](https://github.com/Amshaker/Mobile-O) [Model🤗](https://huggingface.co/Amshaker/Mobile-O-0.5B-iOS)
  - Compact on-device unified multimodal model (~3s/512×512 on iPhone) outperforming Show-O and JanusFlow on generation and visual understanding benchmarks. | Task: Reasoning & Understanding
* 26.02 [OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention](https://arxiv.org/abs/2602.05847) | [Paper📑](https://arxiv.org/abs/2602.05847)
  - Reinforced framework for omnivideo models improving mixed-modality reasoning by combining query-intensive grounding and modality-attentive fusion via contrastive learning. | Task: Reasoning & Understanding
* 26.02 [UniT: Unified Multimodal Chain-of-Thought Test-time Scaling](https://arxiv.org/abs/2602.12279) | [Paper📑](https://arxiv.org/abs/2602.12279)
  - Framework enabling unified multimodal models to perform iterative CoT test-time scaling, showing sequential reasoning is more efficient than parallel sampling for both generation and understanding. | Task: Reasoning & Understanding
* 26.02 [Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching](https://arxiv.org/abs/2602.12221) | [Paper📑](https://arxiv.org/abs/2602.12221)
  - UniDFlow, a unified discrete flow-matching framework decoupling understanding and generation via low-rank adapters and multimodal preference alignment, achieving SOTA across 8 benchmarks. | Task: Reasoning & Understanding
* 26.02 [Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models](https://arxiv.org/abs/2602.15772) | [Paper📑](https://arxiv.org/abs/2602.15772) [Code🖥️](https://github.com/sen-ye/R3)
  - R3 (Reason-Reflect-Refine) framework reformulating single-step generation into multi-step generate-understand-regenerate process to resolve the trade-off between multimodal understanding and generation. | Task: Reasoning & Understanding
* 26.02 [BrowseComp-V3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents](https://arxiv.org/abs/2602.12876) | [Paper📑](https://arxiv.org/abs/2602.12876)
  - 300-question benchmark for complex multi-hop reasoning across text and visual modalities with deep web search; even SOTA models achieve only 36% accuracy, with the OmniSeeker unified browsing agent. | Task: Reasoning & Understanding
* 25.12 [Qwen3-VL Technical Report](https://arxiv.org/abs/2511.21631) | [Paper📑](https://arxiv.org/abs/2511.21631)
  - Advanced VLM excelling in text and multimodal understanding supporting up to 256K tokens of interleaved text, images, and video. | Task: Reasoning & Understanding
* 25.10 [InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue](https://arxiv.org/abs/2510.13747#:~:text=,visual%20interaction.%20To%20enable)
* 25.10 [Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation](https://arxiv.org/abs/2510.24821) | [Paper📑](https://arxiv.org/abs/2510.24821)
  - Unified sparse architecture for multimodal perception and generation across modalities. | Task: Reasoning & Understanding
* 25.10 [OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM](https://arxiv.org/abs/2510.15870) | [Paper📑](https://arxiv.org/abs/2510.15870)
  - Multimodal LLM for comprehensive understanding across all modalities. | Task: Reasoning & Understanding
* 25.09 [Qwen3-Omni Technical Report](https://arxiv.org/abs/2509.17765)
* 25.09 [Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation](https://arxiv.org/abs/2509.19244) | [Paper📑](https://arxiv.org/abs/2509.19244)
  - Unified model for multimodal understanding and generation across modalities. | Task: Reasoning & Understanding
* 25.07 [Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities](https://arxiv.org/abs/2507.06261)
* 25.05 [EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning](https://arxiv.org/abs/2505.04623)
* 25.03 [Qwen2.5-Omni Technical Report](https://arxiv.org/abs/2503.20215)
* 25.01 [OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis](https://arxiv.org/abs/2501.04561)
* 24.10 [Baichuan-Omni Technical Report](https://arxiv.org/abs/2410.08565)
* 24.09 [MIO: A Foundation Model on Multimodal Tokens](https://arxiv.org/html/2409.17692v1)
* 24.08 [MiniCPM-V: A GPT-4V Level MLLM on Your Phone](https://arxiv.org/abs/2408.01800) [[Code]](https://github.com/OpenBMB/MiniCPM-o)
* 24.02 [AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling](https://arxiv.org/html/2402.12226v2)
* 23.12 [Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action](https://arxiv.org/abs/2312.17172)

<a name="ReasoningSegmentation"></a>

### Reasoning Segmentation and Detection
#### Image MLLM
* 26.02 [Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?](https://arxiv.org/abs/2602.23339) | [Paper📑](https://arxiv.org/abs/2602.23339) [Code🖥️](https://github.com/TilemahosAravanis/Retrieve-and-Segment)
  - Retrieval-augmented test-time adapter for open-vocabulary segmentation fusing textual prompts with pixel-annotated visual support features to narrow zero-shot vs. supervised gap. | Task: Reasoning Segmentation
* 26.02 [Seg-ReSearch: Segmentation with Interleaved Reasoning and External Search](https://arxiv.org/abs/2602.04454) | [Paper📑](https://arxiv.org/abs/2602.04454) [Code🖥️](https://github.com/iSEE-Laboratory/Seg-ReSearch) [Dataset🤗](https://huggingface.co/datasets/iSEE-Laboratory/OK_VOS)
  - Novel segmentation paradigm enabling interleaved reasoning and external search to overcome knowledge bottlenecks, with OK-VOS benchmark for open-knowledge video object segmentation. | Task: Reasoning Segmentation
* 26.02 [Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision](https://arxiv.org/abs/2602.13195) | [Paper📑](https://arxiv.org/abs/2602.13195) [Code🖥️](https://github.com/AadSah/ConverSeg)
  - CIS task grounding abstract intent-driven concepts into pixel-accurate masks beyond categorical queries, with ConverSeg benchmark, ConverSeg-Net model, and AI-powered scalable data engine. | Task: Reasoning Segmentation
* 26.03 [HDINO: A Concise and Efficient Open-Vocabulary Detector](https://arxiv.org/abs/2603.02924) | [Paper📑](https://arxiv.org/abs/2603.02924)
  - Concise open-vocabulary detector with one-to-many semantic alignment achieving strong performance with reduced complexity compared to Grounding DINO. | Task: Detection & Grounding
* 26.01 [Urban Socio-Semantic Segmentation with Vision-Language Reasoning](https://arxiv.org/abs/2601.10477) | [Paper📑](https://arxiv.org/abs/2601.10477) [Code🖥️](https://github.com/AMAP-ML/SocioReasoner) [Model🤗](https://huggingface.co/vvangfaye/SocioReasoner-3B) [Dataset🤗](https://huggingface.co/datasets/vvangfaye/SocioSeg)
  - Vision-language reasoning framework for urban satellite segmentation identifying both physical and social categories via multi-stage reasoning. | Task: Reasoning Segmentation
* 26.01 [SAMTok: Representing Any Mask with Two Words](https://arxiv.org/abs/2601.16093) | [Paper📑](https://arxiv.org/abs/2601.16093)
  - Efficient mask tokenization representing arbitrary segmentation masks with just two tokens, enabling reasoning-driven segmentation. | Task: Reasoning Segmentation
* 26.01 [Towards Pixel-Level VLM Perception via Simple Points Prediction](https://arxiv.org/abs/2601.19228) | [Paper📑](https://arxiv.org/abs/2601.19228)
  - Enables pixel-level perception in VLMs through simple points prediction, bridging VLM reasoning and fine-grained spatial detection. | Task: Detection & Grounding
* 25.12 [ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning](https://arxiv.org/abs/2512.02835) | [Paper📑](https://arxiv.org/abs/2512.02835)
  - Uses RL to incentivize reasoning chains for improved video segmentation. | Task: Reasoning Segmentation
* 25.12 [InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search](https://arxiv.org/abs/2512.18745) | [Paper📑](https://arxiv.org/abs/2512.18745)
  - Enhances multimodal models with generalized visual search for improved grounding. | Task: Detection & Grounding
* 25.11 [SAM 3: Segment Anything with Concepts](https://arxiv.org/abs/2511.16719) | [Paper📑](https://arxiv.org/abs/2511.16719)
  - Advances segmentation with concept-based reasoning. | Task: Reasoning Segmentation
* 25.10 [Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation](https://arxiv.org/abs/2510.19592) | [Paper📑](https://arxiv.org/abs/2510.19592)
  - Video reasoning and segmentation with multimodal models without training. | Task: Reasoning Segmentation
* 25.09 [RefAM: Attention Magnets for Zero-Shot Referral Segmentation](https://arxiv.org/abs/2509.22650) | [Paper📑](https://arxiv.org/abs/2509.22650)
  - Zero-shot referral segmentation using attention-based visual reasoning. | Task: Reasoning Segmentation
* 25.07 [UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding](https://arxiv.org/abs/2506.23219) | [Paper📑](https://arxiv.org/abs/2506.23219) [Code🖥️](https://github.com/tsinghua-fib-lab/UrbanLLaVA) 
  - A multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving performance across diverse urban tasks.   | Task: Urban tasks
* 25.07 [Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs](https://arxiv.org/abs/2506.21656) | [Paper📑](https://arxiv.org/abs/2506.21656)
  - A novel fine-grained preference optimization approach that significantly improves spatial reasoning capabilities in  VLMs | Task: Spatial Tasks
* 25.06 [Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning](https://arxiv.org/abs/2506.04034) | [Paper📑](https://arxiv.org/abs/2506.04034) [Code🖥️](https://rexthinker.github.io/) 
 [Model🤗](https://huggingface.co/IDEA-Research/Rex-Thinker-GRPO-7B) 
  - a grounded model reasons step-by-step—just like a human would   | Task: Detection & Grounding
* 25.03 [Visual-RFT: Visual Reinforcement Fine-Tuning](https://arxiv.org/abs/2503.01785) | [Paper📑](https://arxiv.org/abs/2503.01785) [Code🖥️](https://github.com/Liuziyu77/Visual-RFT) 
 [Dataset🤗](https://huggingface.co/collections/laolao77/virft-datasets-67bc271b6f2833eccc0651df) 
  - Extend Reinforcement Fine-Tuning on visual tasks with GRPO   | Task: Detection & Grounding & Classification
* 25.03 [Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning](https://arxiv.org/pdf/2503.07065) | [Paper📑](https://arxiv.org/pdf/2503.07065)
  - Improve generalization and reasoning of VLMs with GRPO | Task: Detection & Classification & Math
* 25.03 [Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement](https://arxiv.org/abs/2503.06520) | [Paper📑](https://arxiv.org/abs/2503.06520) [Code🖥️](https://github.com/dvlab-research/Seg-Zero) [Model🤗](https://huggingface.co/Ricky06662/Seg-Zero-7B)
  - Address object detection and segmentation with GRPO | Task: Object Detection & Object Segmentation
* 24.08 (NeurIPS) [Leveraging Hallucinations to Reduce Manual Prompt Dependency in Promptable Segmentation](https://arxiv.org/abs/2408.15205) | [Paper📑](https://arxiv.org/abs/2408.15205) [Code🖥️](https://github.com/lwpyh/ProMaC_code)
  - Utilize hallucinations to mine task-related information from images and verify its accuracy for enhancing precision of the generated prompts. | Task: Reasoning Segmentation
* 24.07 (CVPR24) [Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs](https://openaccess.thecvf.com/content/CVPR2024/papers/Ranasinghe_Learning_to_Localize_Objects_Improves_Spatial_Reasoning_in_Visual-LLMs_CVPR_2024_paper.pdf) | [Paper📑](https://openaccess.thecvf.com/content/CVPR2024/papers/Ranasinghe_Learning_to_Localize_Objects_Improves_Spatial_Reasoning_in_Visual-LLMs_CVPR_2024_paper.pdf)
  - Explore how instruction fine-tuning objectives could inject spatial awareness into V-LLMs| | Task: Reasoning Localization
* 23.04 (AAAI24) [Relax Image-Specific Prompt Requirement in SAM: A Single Generic Prompt for Segmenting Camouflaged Objects](https://arxiv.org/abs/2312.07374) | [Paper📑](https://arxiv.org/abs/2312.07374) [Code🖥️](https://github.com/jyLin8100/GenSAM)
  - Employ Cross-modal Chains of Thought Prompting (CCTP) to reason visual prompts using the semantic information given by a generic text prompt. | Task: Reasoning segmentation
* 23.12 (CVPR24) [PixelLM:Pixel Reasoning with Large Multimodal Model](https://arxiv.org/abs/2312.02228) | [Paper📑](https://arxiv.org/pdf/2312.02228.pdf) [Code🖥️](https://github.com/MaverickRen/PixelLM)
  - An effective and efficient LMM for pixel-level reasoning and understanding | Task: Reasoning Segmentation
* 23.08 (CVPR24)[LISA: Reasoning Segmentation via Large Language Model](https://arxiv.org/abs/2308.00692) | [Paper📑](https://arxiv.org/abs/2308.00692) [Code🖥️](https://github.com/showlab/VideoLISA) [Dataset🤗](https://drive.google.com/drive/folders/125mewyg5Ao6tZ3ZdJ-1-E3n04LGVELqy?usp=sharing)
  - Inherit the language generation capabilities of the MLLM while also possessing the ability to produce segmentation masks. | Task: Reasoning Segmentation
#### Video MLLM
* 26.02 [VidEoMT: Your ViT is Secretly Also a Video Segmentation Model](https://arxiv.org/abs/2602.17807) | [Paper📑](https://arxiv.org/abs/2602.17807) [Code🖥️](https://github.com/tue-mps/videomt) [Model🤗](https://huggingface.co/tue-mps/videomt-dinov2-small-ytvis2019)
  - Lightweight encoder-only video segmentation on plain ViT with query propagation and fusion, achieving 160 FPS with ViT-L without dedicated tracking modules. | Task: Reasoning Segmentation
* 26.02 [Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction](https://arxiv.org/abs/2602.18996) | [Paper📑](https://arxiv.org/abs/2602.18996) [Code🖥️](https://github.com/shannany0606/CCMP)
  - Conditional binary segmentation with cycle-consistency training for object-level correspondence across egocentric/exocentric viewpoints without ground-truth annotations (CVPR 2026). | Task: Reasoning Segmentation
* 24.08 (ECCV24)[VISA: Reasoning Video Object Segmentation via Large Language Model](http://arxiv.org/abs/2407.11325) | [Paper📑](http://arxiv.org/abs/2407.11325) [Code🖥️](https://github.com/cilinyan/VISA) [Dataset🤗](https://github.com/cilinyan/ReVOS-api)
   - Leverage the world knowledge reasoning capabilities of MLLMs while possessing the ability to segment and track objects in videos with a mask decoder | Task: Reasoning Segmentation
* 24.07 (NeruIPS24)[One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos](https://arxiv.org/abs/2409.19603) |  [Paper📑](https://arxiv.org/abs/2409.19603) [Code🖥️](https://github.com/dvlab-research/LISA) [Model🤗](https://huggingface.co/ZechenBai/VideoLISA-3.8B)
  - Integrating a Sparse Dense Sampling strategy into the video-LLM to balance temporal context and spatial detail within computational constraints |  Task: Reasoning Segmentation
* 24.01 (CVPR24) [OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding](https://arxiv.org/abs/2401.10229) | [Paper📑](https://arxiv.org/abs/2401.10229) [Code🖥️](https://github.com/lxtGH/OMG-Seg)
  - A transformer-based encoder-decoder architecture with task-specific queries and outputs for multiple tasks | Task: Reasoning Segmentation/Detection
#### Audio MLLM
* 24.10 [Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning](https://arxiv.org/abs/2410.16130)

#### Omni MLLM
* 25.07 [Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation](https://arxiv.org/abs/2507.22886)
* 24.08 [Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation](https://arxiv.org/abs/2408.15876)

<a name="Spatio-TemporalReasoning"></a>
### Spatial and Temporal Grounding and Understanding
#### Image MLLM
* 26.02 [GeoWorld: Geometric World Models](https://arxiv.org/abs/2602.23058) | [Paper📑](https://arxiv.org/abs/2602.23058)
  - Hyperbolic JEPA preserving latent state structures for improved long-horizon world model prediction and Geometric RL planning (CVPR 2026). | Task: Spatial Reasoning
* 26.02 [When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning](https://arxiv.org/abs/2602.08236) | [Paper📑](https://arxiv.org/abs/2602.08236) [Code🖥️](https://github.com/Yui010206/Adaptive-Visual-Imagination-Control)
  - AVIC adaptively invokes visual imagination via world models to match or outperform fixed imagination strategies for spatial reasoning with far fewer calls. | Task: Spatial Reasoning
* 26.02 [Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?](https://arxiv.org/abs/2602.07055) | [Paper📑](https://arxiv.org/abs/2602.07055) [Code🖥️](https://github.com/mll-lab-nu/Theory-of-Space) [Dataset🤗](https://huggingface.co/datasets/MLL-Lab/tos-data)
  - Evaluates VLMs' ability to construct spatial beliefs through active exploration, revealing Active-Passive Gap and Belief Inertia—VLMs fail to update stale spatial priors. | Task: Spatial Understanding
* 26.02 [SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?](https://arxiv.org/abs/2602.03916) | [Paper📑](https://arxiv.org/abs/2602.03916) [Code🖥️](https://github.com/SpatiaLab-Reasoning/SpatiaLab) [Dataset🤗](https://huggingface.co/datasets/ciol-research/SpatiaLab)
  - Benchmark of 1,400 VQA pairs across six spatial reasoning categories revealing VLMs achieve only ~55% vs. 87.6% human accuracy (ICLR 2026). | Task: Spatial Reasoning
* 26.02 [LangMap: A Hierarchical Benchmark for Open-Vocabulary Goal Navigation](https://arxiv.org/abs/2602.02220) | [Paper📑](https://arxiv.org/abs/2602.02220) [Code🖥️](https://github.com/bo-miao/LangMap)
  - Multi-granularity open-vocabulary navigation task with 414 object categories and 18K+ navigation tasks across scene, room, region, and instance levels. | Task: Spatial Grounding & Navigation
* 26.02 [GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics](https://arxiv.org/abs/2602.12617) | [Paper📑](https://arxiv.org/abs/2602.12617) [Code🖥️](https://github.com/HVision-NKU/GeoAgent) [Model🤗](https://huggingface.co/ghost233lism/GeoAgent) [Dataset🤗](https://huggingface.co/datasets/ghost233lism/GeoSeek)
  - Geolocation reasoning model using RL with geo-similarity and consistency rewards over GeoSeek dataset, enabling fine-grained address-level localization with human-like reasoning. | Task: Spatial Reasoning
* 26.03 [Enhancing Spatial Understanding in Image Generation via Reward Modeling](https://arxiv.org/abs/2602.24233) | [Paper📑](https://arxiv.org/abs/2602.24233) [Project🌐](https://dagroup-pku.github.io/SpatialT2I/)
  - SpatialReward-Dataset (80K preference pairs) and SpatialScore reward model for evaluating spatial relationship accuracy; enables online RL to significantly enhance spatial reasoning in image generation. | Task: Spatial Reasoning
* 26.01 [CoV: Chain-of-View Prompting for Spatial Reasoning](https://arxiv.org/abs/2601.05172) | [Paper📑](https://arxiv.org/abs/2601.05172) [Code🖥️](https://github.com/ziplab/CoV)
  - Training-free test-time reasoning framework transforming VLMs into active viewpoint reasoners through coarse-to-fine 3D exploration, +11.56% on OpenEQA. | Task: Spatial Reasoning
* 26.01 [Think3D: Thinking with Space for Spatial Reasoning](https://arxiv.org/abs/2601.13029) | [Paper📑](https://arxiv.org/abs/2601.13029)
  - Framework for spatial reasoning enabling models to reason in 3D space for improved visual understanding tasks. | Task: Spatial Reasoning & 3D Understanding
* 25.12 [SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL](https://arxiv.org/abs/2512.04069) | [Paper📑](https://arxiv.org/abs/2512.04069)
  - Tool-augmented spatial reasoning using double interactive reinforcement learning. | Task: Spatial Reasoning
* 25.12 [COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence](https://arxiv.org/abs/2512.04563) | [Paper📑](https://arxiv.org/abs/2512.04563)
  - Unified model combining cooperative perception with spatial intelligence reasoning. | Task: Spatial Reasoning
* 25.11 [SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards](https://arxiv.org/abs/2511.07403) | [Paper📑](https://arxiv.org/abs/2511.07403)
  - Uses reinforcement learning with spatial rewards to improve 3D reasoning in MLLMs. | Task: Spatial Reasoning & 3D Understanding
* 25.11 [G2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning](https://arxiv.org/abs/2511.21688) | [Paper📑](https://arxiv.org/abs/2511.21688)
  - Unifies 3D reconstruction and spatial reasoning in a geometry-grounded VLM. | Task: Spatial Reasoning & 3D Understanding
* 25.10 [SpaceVista: All-Scale Visual Spatial Reasoning from mm to km](https://arxiv.org/abs/2510.09606) | [Paper📑](https://arxiv.org/abs/2510.09606)
  - Spatial reasoning across multiple scales in visual understanding. | Task: Spatial Reasoning
* 25.08 [3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding](https://arxiv.org/abs/2507.23478) | [Paper📑](https://arxiv.org/abs/2507.23478)
  - Enhances reasoning capabilities of 3D vision-language models for unified 3D scene understanding. | Task: Spatial Reasoning & 3D Understanding
* 25.04 [Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation](https://arxiv.org/pdf/2504.17207) | [Paper📑](https://arxiv.org/pdf/2504.17207) [Project🌐](https://apc-vlm.github.io/) [Code🖥️](https://github.com/KAIST-Visual-AI-Group/APC-VLM) 
  - A framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation. | Task: Spatial Reasoning & Understanding
* 25.04 [Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning](https://arxiv.org/html/2503.20752v2) | [Paper📑](https://arxiv.org/html/2503.20752v2) [Project🌐](https://tanhuajie.github.io/ReasonRFT/) [Code🖥️](https://github.com/tanhuajie/Reason-RFT) [Dataset🤗](https://huggingface.co/datasets/tanhuajie2001/Reason-RFT-CoT-Dataset)
  - Introduce a combined RL and SFT training paradigm to enhance visual reasoning capabilities in multimodal models. | Task: Spatial Reasoning & Understanding
* 25.04 [InteractVLM: 3D Interaction Reasoning from 2D Foundational Models](https://arxiv.org/abs/2504.05303) | [Paper📑](https://arxiv.org/abs/2504.05303) [Code💻](https://github.com/saidwivedi/InteractVLM)
  - Harnesses the broad visual knowledge of large Vision-Language Models (VLMs), fine-tuned with limited 3D contact data. Task: 3D Reconstruction
* 25.03 [Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks](https://arxiv.org/abs/2503.21696) | [Paper📑](https://arxiv.org/abs/2503.21696) [Code💻](https://github.com/zwq2018/embodied_reasoner)  [Project🌐](https://embodied-reasoner.github.io/ ) [Dataset🤗](https://huggingface.co/datasets/zwq2018/embodied_reasoner)
  - A model that extends O1-style reasoning to interactive embodied tasks. | Task: Interactive Embodied Tasks
* 25.03 [VisualThinker-R1-Zero](https://arxiv.org/abs/2503.05132) | [Paper📑](https://arxiv.org/abs/2503.05132) [Code💻](https://github.com/turningpoint-ai/VisualThinker-R1-Zero)
  - R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model | Task: Counting & Reasoning & 3D Understanding (CV-Bench)
* 25.03 (CVPR2025)[GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks](https://arxiv.org/pdf/2503.06514) | [Paper📑](https://arxiv.org/pdf/2503.06514)
  - Fine-tune VLMs using GFlowNet to promote generation of diverse solutions.|  Task: NumberLine (NL) & BlackJack (BJ)
* 25.02 [R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3](https://github.com/Deep-Agent/R1-V) |  [Code🖥️](https://github.com/Deep-Agent/R1-V)
  - A open-source project for VLM reasoning with GRPO | Task: Counting, Number Related Reasoning and Geometry Reasoning
* 25.01 [Imagine while Reasoning in Space: Multimodal Visualization-of-Thought](https://arxiv.org/pdf/2501.07542) | [Paper📑](https://arxiv.org/pdf/2501.07542)
  - Enables visual thinking in MLLMs by generating image visualizations of their reasoning traces.  | Task: Spatial Reasoning
#### Video MLLM
* 26.02 [TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions](https://arxiv.org/abs/2602.08711) | [Paper📑](https://arxiv.org/abs/2602.08711) [Code🖥️](https://github.com/yaolinli/TimeChat-Captioner) [Model🤗](https://huggingface.co/yaolily/TimeChat-Captioner-GRPO-7B)
  - Omni Dense Captioning with six-dimensional structural schema generating time-aware audio-visual narratives with explicit timestamps, surpassing Gemini-2.5-Pro on the task. | Task: Temporal Grounding/Understanding
* 26.02 [4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere](https://arxiv.org/abs/2602.10094) | [Paper📑](https://arxiv.org/abs/2602.10094)
  - 4D dynamic scene reconstruction framework with conditional querying at arbitrary space-time locations for flexible spatiotemporal understanding of dynamic scenes. | Task: Spatial-Temporal Understanding
* 26.02 [Learning Situated Awareness in the Real World](https://arxiv.org/abs/2602.16682) | [Paper📑](https://arxiv.org/abs/2602.16682)
  - SAW-Bench evaluates egocentric situated awareness using 786 real-world videos from smart glasses with 2,071+ QA pairs, revealing a 37.66% human-model performance gap in observer-centric spatial reasoning. | Task: Temporal Grounding/Understanding
* 26.02 [MoRL: Reinforced Reasoning for Unified Motion Understanding and Generation](https://arxiv.org/abs/2602.14534) | [Paper📑](https://arxiv.org/abs/2602.14534) [Code🖥️](https://github.com/AIGeeksGroup/MoRL)
  - Unified multimodal motion model combining SFT and RL with Chain-of-Motion (CoM) reasoning and large-scale CoT datasets for human motion understanding and generation. | Task: Spatial-Temporal Understanding
* 26.03 [RIVER: A Real-Time Interaction Benchmark for Video LLMs](https://arxiv.org/abs/2603.03985) | [Paper📑](https://arxiv.org/abs/2603.03985)
  - Benchmark evaluating real-time interaction in Video LLMs with Retrospective Memory, Live-Perception, and Proactive Response tasks, revealing gaps in long-term memory and future perception. | Task: Temporal Understanding
* 26.01 [VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding](https://arxiv.org/abs/2601.07290) | [Paper📑](https://arxiv.org/abs/2601.07290) [Code🖥️](https://github.com/JPShi/VideoLoom) [Model🤗](https://huggingface.co/JPShi/VideoLoom-8B)
  - Unified Video LLM for joint spatial-temporal understanding with LoomData-8.7k dataset and LoomBench benchmark. | Task: Spatial-Temporal Understanding
* 26.01 [VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice](https://arxiv.org/abs/2601.05175) | [Paper📑](https://arxiv.org/abs/2601.05175) [Code🖥️](https://github.com/IVUL-KAUST/VideoAuto-R1)
  - Video understanding framework with "reason-when-necessary" strategy using confidence-based reasoning activation, reducing response length 3.3x. | Task: Video Understanding & Reasoning
* 26.01 [Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding](https://arxiv.org/abs/2601.10611) | [Paper📑](https://arxiv.org/abs/2601.10611) [Code🖥️](https://github.com/allenai/molmo2)
  - Open-source video-language model family with point-driven grounding and video tracking capabilities surpassing Gemini 3 Pro on grounding. | Task: Spatial Understanding & Grounding
* 26.01 [PROGRESSLM: Towards Progress Reasoning in Vision-Language Models](https://arxiv.org/abs/2601.15224) | [Paper📑](https://arxiv.org/abs/2601.15224) [Code🖥️](https://github.com/ProgressLM/ProgressLM) [Model🤗](https://huggingface.co/Raymond-Qiancx/ProgressLM-3B-RL) [Dataset🤗](https://huggingface.co/datasets/Raymond-Qiancx/ProgressLM-Dataset)
  - Addresses task progress estimation in VLMs with Progress-Bench benchmark and ProgressLM-3B model. | Task: Temporal Reasoning & Understanding
* 26.01 [HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding](https://arxiv.org/abs/2601.14724) | [Paper📑](https://arxiv.org/abs/2601.14724)
  - Efficient streaming video understanding via hierarchical KV cache memory enabling temporal reasoning over long videos. | Task: Temporal Reasoning
* 25.12 [4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation](https://arxiv.org/abs/2512.17012) | [Paper📑](https://arxiv.org/abs/2512.17012)
  - Region-level 4D (3D + temporal) understanding through perceptual distillation. | Task: Spatial-Temporal Understanding
* 25.12 [MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence](https://arxiv.org/abs/2512.10863) | [Paper📑](https://arxiv.org/abs/2512.10863)
  - Comprehensive benchmark for evaluating spatial intelligence in video understanding. | Task: Spatial-Temporal Understanding
* 25.11 [VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation](https://arxiv.org/abs/2511.17199) | [Paper📑](https://arxiv.org/abs/2511.17199)
  - Incorporates 4D spatiotemporal awareness into VLA models for coherent robotic manipulation. | Task: Spatial-Temporal Understanding
* 25.10 [Trace Anything: Representing Any Video in 4D via Trajectory Fields](https://arxiv.org/abs/2510.13802) | [Paper📑](https://arxiv.org/abs/2510.13802)
  - 4D spatial-temporal representation learning from video. | Task: Spatial-Temporal Understanding
* 25.08 [VLM4D: Towards Spatiotemporal Awareness in Vision Language Models](https://arxiv.org/abs/2508.02095) | [Paper📑](https://arxiv.org/abs/2508.02095)
  - Extends VLMs with spatiotemporal reasoning for understanding spatial and temporal dynamics. | Task: Spatial-Temporal Understanding
* 25.05 [MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding](https://arxiv.org/abs/2505.20715) | [Paper📑](https://arxiv.org/abs/2505.20715) [Code💻](https://github.com/THUNLP-MT/MUSEG)
* 25.04 [VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning](https://arxiv.org/pdf/2504.06958) | [Paper📑](https://arxiv.org/pdf/2504.06958) [Code💻](https://github.com/OpenGVLab/VideoChat-R1)
  - A novel spatiao-temporal perception framework with GRPO | Task: Spatial Understanding and Grounding
* 25.04 [VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search](https://arxiv.org/html/2504.09130v1) | [Paper📑](https://arxiv.org/html/2504.09130v1) [Code💻](https://github.com/ekonwang/VisuoThink)
  - A novel framework that seamlessly integrates visuospatial and linguistic domains | Task: Geometry and Spatial Reasoning
* 25.04 [Improved Visual-Spatial Reasoning via R1-Zero-Like Training](https://arxiv.org/abs/2504.00883) | [Paper📑](https://arxiv.org/abs/2504.00883) [Code💻](https://github.com/zhijie-group/R1-Zero-VSI)
  - Incorporate GRPO training for improved visual-spatial reasoning, using the carefully curated VSI-100k dataset. | Task: Video Understanding
* 25.03 [Envolving Temporal Reasoning Capability into LMMs via Temporal Consistent Reward](https://github.com/appletea233/Temporal-R1) | [Code💻](https://github.com/appletea233/Temporal-R1) [Model🤗](https://huggingface.co/appletea2333)
  - Investigate the potential of GRPO in the video temporal grounding task, which demands precise temporal alignment between visual and linguistic modalities as well as advanced reasoning capabilities | Task: Temporal Grounding
* 25.03 [TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM](https://arxiv.org/abs/2503.13377) | [Paper📑](https://arxiv.org/abs/2503.13377) [Code💻](https://github.com/www-Ye/TimeZero) [Model🤗](https://huggingface.co/wwwyyy/TimeZero-Charades-7B)
  - A reasoning-guided MLLM for temporal video grounding, trained with GRPO. | Task: Temporal Grounding
* 25.03 [LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding](https://arxiv.org/abs/2501.08282) | [Paper📑](https://arxiv.org/abs/2501.08282) [Code💻](https://github.com/appletea233/LLaVA-ST)
  - A MLLM for fine-grained spatial-temporal multimodal understanding. | Task: Spatial-Temporal Understanding
* 25.03 [MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse](https://github.com/PzySeere/MetaSpatial) | [Code🖥️](https://github.com/PzySeere/MetaSpatial)
  - Enhance spatial reasoning in VLMs using GRPO  | Task: 3D Spatial Reasoning
* 25.02 [Video-R1: Towards Super Reasoning Ability in Video Understanding](https://github.com/tulerfeng/Video-R1) | [Code🖥️](https://github.com/tulerfeng/Video-R1)
  - Integrate deep thinking capabilities into video understanding tasks through the R1 paradigm | Task:  Video Counting
* 24.12 [TIMEREFINE: Temporal Grounding with Time Refining Video LLM](https://arxiv.org/pdf/2412.09601) | [Paper📑](https://arxiv.org/pdf/2412.09601) | [Code🖥️](https://github.com/SJTUwxz/TimeRefine)
  * Enhance Video LLMs to handle the temporal grounding task by modifying the learning objective | Task: Temporal Grounding
* 24.11 (CVPR2025) [Number it: Temporal Grounding Videos like Flipping Manga](https://arxiv.org/pdf/2411.10332) | [Paper📑](https://arxiv.org/pdf/2411.10332) | [Code💻](https://github.com/yongliang-wu/NumPro)
  * Enhances Video-LLMs by overlaying frame numbers onto video frames | Task: Temporal Grounding
* 24.11 [TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability](https://arxiv.org/abs/2411.18211) | [Paper📑](https://arxiv.org/pdf/2411.18211) | [Code💻](https://github.com/TimeMarker-LLM/TimeMarker/)
  * A versatile Video-LLM featuring robust temporal localization abilities | Task: Temporal Grounding and Video QA
* 24.08 (AAAI2025) [Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos](https://arxiv.org/abs/2408.14469) | [Paper📑](https://arxiv.org/pdf/2408.14469) | [Code💻](https://github.com/qirui-chen/MultiHop-EgoQA)
  * Leverage the world knowledge reasoning capabilities of MLLMs to retrieve temporal evidence in the video with flexible grounding tokens. | Task: Multi-Hop VideoQA
* 24.08 (ICLR2025) [TRACE: Temporal Grounding Video LLM via Casual Event Modeling](https://arxiv.org/abs/2410.05643) | [Paper📑](https://arxiv.org/pdf/2410.05643) | [Code💻](https://github.com/gyxxyg/TRACE)
  * Tailored to implement the causal event modeling framework through timestamps, salient scores, and textual captions. | Task: Temporal Grounding
#### Audio MLLM
* 25.07 [Towards Spatial Audio Understanding via Question Answering](https://arxiv.org/abs/2507.09195)
* 24.06 (InterSpeech 2024) [Can Large Language Models Understand Spatial Audio?](https://arxiv.org/abs/2406.07914) | 
* 24.02 (ICML 2024)[BAT: Learning to Reason about Spatial Sounds with Large Language Models](https://arxiv.org/abs/2402.01591) |

#### Omni MLLM
* 24.06 [VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs](https://arxiv.org/abs/2406.07476)

<a name="MathReasoning"></a>

### Math Reasoning
#### Image MLLM
* 26.02 [P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads](https://arxiv.org/abs/2602.09443) | [Paper📑](https://arxiv.org/abs/2602.09443) [Code🖥️](https://github.com/PRIME-RL/P1-VL) [Model🤗](https://huggingface.co/PRIME-RL/P1-VL-30B-A3B)
  - Open-source VLM family for advanced scientific reasoning using curriculum RL and agentic augmentation, achieving the first open-source model winning 12 gold medals at physics olympiad level. | Task: Math
* 26.02 [DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning](https://arxiv.org/abs/2602.16742) | [Paper📑](https://arxiv.org/abs/2602.16742) [Code🖥️](https://github.com/SKYLENAGE-AI/DeepVision-103K) [Dataset🤗](https://huggingface.co/datasets/skylenage/DeepVision-103K)
  - 103K-sample RLVR training dataset for multimodal K12 mathematical reasoning with diverse topics and rich visual elements, generalizing to general multimodal reasoning tasks. | Task: Math
* 26.02 [Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models](https://arxiv.org/abs/2601.22060) | [Paper📑](https://arxiv.org/abs/2601.22060) [Code🖥️](https://github.com/Osilly/Vision-DeepResearch)
  - Multimodal deep-research paradigm enabling multi-turn, multi-entity, multi-scale visual and textual search via cold-start supervision and RL. | Task: Math
* 26.02 [LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models](https://arxiv.org/abs/2602.14147) | [Paper📑](https://arxiv.org/abs/2602.14147)
  - Multimodal reasoning diffusion LM using SFT + multi-task RL with answer-forcing, tree search, and complementary likelihood estimation for visual math reasoning and image editing. | Task: Math
* 26.02 [Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings](https://arxiv.org/abs/2602.13823) | [Paper📑](https://arxiv.org/abs/2602.13823) [Code🖥️](https://github.com/ZoengHN/Embed-RL)
  - Reasoning-driven multimodal embedding framework using Embedder-Guided RL (EG-RL) to optimize Traceability Chain-of-Thought generation for improved cross-modal semantic consistency. | Task: Math
* 26.01 [CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving](https://arxiv.org/abs/2601.01874) | [Paper📑](https://arxiv.org/abs/2601.01874) [Project🌐](https://shchen233.github.io/cogflow/)
  - Cognitive-inspired three-stage framework (Perception-Internalization-Reasoning) for visual math with MathCog dataset of 120K+ annotations. | Task: Math
* 26.01 [MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning](https://arxiv.org/abs/2512.23412) | [Paper📑](https://arxiv.org/abs/2512.23412)
  - Multimodal tool-integrated reasoning framework enhancing chain-of-thought with tool use for complex math/science problems. | Task: Math
* 26.01 [MMFormalizer: Multimodal Autoformalization in the Wild](https://arxiv.org/abs/2601.03017) | [Paper📑](https://arxiv.org/abs/2601.03017)
  - Framework for automatically formalizing multimodal mathematical content from images and text into formal representations. | Task: Math
* 25.11 [MathSE: Improving Multimodal Mathematical Reasoning via Self-Evolving Iterative Reflection and Reward-Guided Fine-Tuning](https://arxiv.org/abs/2511.06805) | [Paper📑](https://arxiv.org/abs/2511.06805)
  - Improves multimodal math reasoning via iterative self-evolution and reward-guided training. | Task: Math
* 25.10 [Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning](https://arxiv.org/abs/2509.23250) | [Paper📑](https://arxiv.org/abs/2509.23250)
  - Process reward models for scaling multimodal reasoning at test time. | Task: Math
* 25.09 [BaseReward: A Strong Baseline for Multimodal Reward Model](https://arxiv.org/abs/2509.16127) | [Paper📑](https://arxiv.org/abs/2509.16127)
  - Strong baseline reward model for multimodal RL-based alignment. | Task: Math
* 25.08 [MathReal: A Real Scene Benchmark for Evaluating Math Reasoning in MLLMs](https://arxiv.org/abs/2508.06009) | [Paper📑](https://arxiv.org/abs/2508.06009)
  - Benchmark for evaluating multimodal math reasoning using real-world scene photographs. | Task: Math
* 25.11 [Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning](https://arxiv.org/abs/2511.18437) | [Paper📑](https://arxiv.org/abs/2511.18437) [Code🖥️](https://github.com/MiliLab/PEARL) [Model🤗](https://huggingface.co/Rex1090/PEARL-8B)
   - Introduce a perception checklist to anchor RL policy updates in verified visual evidence and prevent hallucinations. | Task: Math
* 25.11 [Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning](https://arxiv.org/abs/2510.20519) | [Paper📑](https://arxiv.org/abs/2510.20519) [Code🖥️](https://github.com/MM-Thinking/Metis-HOME) [Model🤗](https://huggingface.co/mmthinking/Metis-HOME)
  - Use a mixture-of-experts framework with dynamic routing for balancing complex reasoning and general tasks. | Task: Math
* 25.10 [Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start](https://arxiv.org/abs/2510.25801) | [Paper📑](https://arxiv.org/abs/2510.25801) [Code🖥️](https://github.com/Kwen-Chen/SPECS-VL)
  - Replace supervised fine-tuning with self-distilled, preference-based cold starts to improve RL generalization. | Task: Math
* 25.09 [DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning](https://arxiv.org/abs/2509.25866) | [Paper📑](https://arxiv.org/abs/2509.25866) [Code🖥️](https://github.com/MiliLab/DeepSketcher)
  - Internalize visual reasoning by directly manipulating visual embeddings using code-rendered trajectories, bypassing external tools and reducing grounding noise. | Task: Math
* 25.07 [The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs](https://www.arxiv.org/abs/2507.07562) [Paper📑](https://www.arxiv.org/abs/2507.07562) [Code🖥️](https://github.com/JierunChen/SFT-RL-SynergyDilemma) 
  - a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. | Task: Math
* 25.06 [Metis-RISE: RL Incentivizes and SFT Enhances Multimodal Reasoning Model Learning](https://arxiv.org/abs/2506.13056) | [Paper📑](https://arxiv.org/abs/2506.13056) [Code🖥️](https://github.com/MM-Thinking/Metis-RISE) [Model🤗](https://github.com/MM-Thinking/Metis-RISE)
  - Reverse the training pipeline by first using RL for reasoning exploration, then applying SFT with self-distilled and expert-augmented trajectories for stability and capability enhancement. | Task: Math
* 25.06 [SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis](https://arxiv.org/abs/2506.02096) [Paper📑](https://arxiv.org/abs/2506.02096) [Code🖥️](https://github.com/NUS-TRAIL/SynthRL) [Model🤗](https://huggingface.co/collections/Jakumetsu/synthrl-6839d265136fa9ca717105c5)
  - A novel framework that enhances the reasoning capabilities of multimodal large language models. | Task: Math
* 25.06 [SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning](https://arxiv.org/abs/2506.01713) [Paper📑](https://arxiv.org/abs/2506.01713) [Code🖥️](https://github.com/SUSTechBruce/SRPO_MLLMs) [Model🤗](https://huggingface.co/datasets/SRPOMLLMs/srpo-sft-data)
  - scale the training data with correctness and distribution guarantees to achieve better performance. | Task: Math
* 25.05 [Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO](https://arxiv.org/pdf/2505.22453) [Paper📑](https://arxiv.org/pdf/2505.22453) [Code🖥️](https://github.com/waltonfuture/MM-UPT) 
  - A Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO. | Task: Math
* 25.05 [X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains](https://arxiv.org/abs/2505.03981) | [Paper📑](https://arxiv.org/abs/2505.03981) [Code🖥️](github.com/microsoft/x-reasoner) 
  - A training recipe that optimizes the reasoning capability of VLMs with SFT and RL on general-domain text-only data. | Task: Math
* 25.04 [NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation](https://arxiv.org/pdf/2504.13055) | [Paper📑](https://arxiv.org/pdf/2504.13055) [Code🖥️](https://github.com/John-AI-Lab/NoisyRollout) [Model🤗](https://huggingface.co/collections/xyliu6/noisyrollout-67ff992d1cf251087fe021a2)
  - Introduces targeted rollout diversity by mixing rollouts from both clean and moderately distorted images, encouraging the model to learn more robust behaviors. | Task: Math
* 25.04 [VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning](https://arxiv.org/abs/2504.08837) | [Paper📑](https://arxiv.org/abs/2504.08837) [Code🖥️](https://github.com/TIGER-AI-Lab/VL-Rethinker/) [Model🤗](https://huggingface.co/TIGER-Lab/VL-Rethinker-7B)
  - Aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the SOTA. | Task: Math
* 25.04 [SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement](https://arxiv.org/abs/2504.07934) | [Paper📑](https://arxiv.org/abs/2504.07934) [Code🖥️](https://github.com/si0wang/ThinkLite-VL)
  - Propose a novel way of repurposing Monte Carlo Tree Search (MCTS) to enable effective data filtering. | Task: Math reasoning
* 25.04 [GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning]() | [Paper📑](https://github.com/RyanLiu112/GenPRM/blob/main/static/paper.pdf) [Project🌐](https://ryanliu112.github.io/GenPRM/) [Code🖥️](https://github.com/RyanLiu112/GenPRM)
  - A generative process reward model that performs explicit COT reasoning with code verification before providing judgment for each reasoning step. | Task: Math
* 25.03 [OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement](https://arxiv.org/abs/2503.17352) | [Paper📑](https://arxiv.org/abs/2503.17352) [Code🖥️](https://github.com/yihedeng9/OpenVLThinker) [Dataset🤗](https://huggingface.co/ydeng9/OpenVLThinker-7B)
  - Investigate whether R1-like reasoning capabilities can be successfully integrated into LVLMs and assesses their impact on challenging multimodal reasoning tasks. | Task: Math
* 25.03 [R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization](https://arxiv.org/html/2503.12937v1) | [Paper📑](https://arxiv.org/html/2503.12937v1) [Code🖥️](https://github.com/jingyi0000/R1-VL) [Dataset🤗](https://github.com/jingyi0000/R1-VL#)
  - Design Step-wise Group Relative Policy Optimization (StepGRPO) that enables MLLMs to self-improve reasoning ability. | Task: Math 
* 25.03 [LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL](https://arxiv.org/pdf/2503.07536) | [Paper📑](https://arxiv.org/pdf/2503.07536) [Code🖥️](https://github.com/TideDra/lmm-r1) [Dataset🤗](https://huggingface.co/datasets/VLM-Reasoner/VerMulti)
  - A two-stage rule-based RL framework that efficiently enhances reasoning capabilities | Task: Math & Sokoban
* 25.03 [VisualPRM: An Effective Process Reward Model for Multimodal Reasoning](https://arxiv.org/abs/2503.10291) | [Paper📑](https://arxiv.org/abs/2503.10291) [Code🖥️](https://github.com/OpenGVLab/InternVL) [Dataset🤗](https://huggingface.co/datasets/OpenGVLab/VisualProcessBench)
  - Improve the reasoning abilities of existing MLLMs with Best-of-N evaluation strategies | Task: Math & MMMU  
* 25.03 [R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization](https://arxiv.org/pdf/2503.10615) | [Paper📑](https://arxiv.org/pdf/2503.10615) [Code🖥️](https://github.com/Fancy-MLLM/R1-Onevision) [Dataset🤗](https://huggingface.co/datasets/Fancy-MLLM/R1-Onevision)
  - A multimodal reasoning model bridged the gap between multimodal capabilities and reasoning abilities with GRPO | Task: Math
* 25.03 [MMR1: Advancing the Frontiers of Multimodal Reasoning](https://github.com/LengSicong/MMR1) | [Code🖥️](https://github.com/LengSicong/MMR1)
  - a Large Multimodal Model specialized in mathematical tasks using GRPO | Task: Math
* 25.03 [Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning](https://arxiv.org/pdf/2503.07065) | [Paper📑](https://arxiv.org/pdf/2503.07065)
  - Improve generalization and reasoning of VLMs with GRPO | Task: Detection & Classification & Math
* 25.03 [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://arxiv.org/abs/2503.06749) | [Paper📑](https://arxiv.org/abs/2503.06749)[Code🖥️](https://github.com/Osilly/Vision-R1)
  - Improve reasoning ability of MLLM with GRPO                                                         | Task: Math
* 25.03 [MM-EUREKA: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning](https://arxiv.org/abs/2503.07365) | [Paper📑](https://arxiv.org/abs/2503.07365) [Code🖥️](https://github.com/ModalMinds/MM-EUREKA) [Dataset🤗](https://huggingface.co/datasets/FanqingM/MM-Eureka-Dataset)
  - Extend large-scale rule-based reinforcement learning to multimodal reasoning                              | Task: Math
* 25.03 [EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework] [Code🖥️](https://github.com/hiyouga/EasyR1)
  - A Multimodal GRPO training framework              | Task: Math
* 25.02 [Qwen2.5-VL] [Qwen2.5-VL Technical Report](https://arxiv.org/pdf/2502.13923) | [Paper📑](https://arxiv.org/pdf/2502.13923) [Code🖥️](https://github.com/QwenLM/Qwen2.5-VL) [Huggingface🤗](https://huggingface.co/Qwen)
   - The latest flagship model of Qwen vision-language series for various multimodal tasks | Task: Reasoning & Understainding               * 25.02    [Multimodal Open R1]((https://github.com/EvolvingLMMs-Lab/open-r1-multimodal)) | [Code🖥️](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal) 
  - A open-source database for video R1 reproduce.    | Task: Math                    
* 25.02 [Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking](https://arxiv.org/pdf/2502.02339) | [Paper📑](https://arxiv.org/pdf/2502.02339)
  - An automated structured thinking paradigm for multimodal reasoning via Monte Carlo Tree Search | Task: Math
* 25.02 [MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification](https://www.arxiv.org/pdf/2502.13383) | [Paper📑](https://www.arxiv.org/pdf/2502.13383) [Code🖥️](https://github.com/Aurora-slz/MM-Verify)
  - Enhance multimodal reasoning through longer inference and more robust verification. | Task: Math
* 25.01 [Kimi k1.5: Scaling Reinforcement Learning with LLMs](https://arxiv.org/pdf/2501.12599) (MoonshotAI) | [Project🌐](https://github.com/MoonshotAI/Kimi-k1.5)
  - The latest flagship model of Kimi series for various multimodal tasks | Task: Reasoning & Understainding                  
* 25.01 [Virgo: A Preliminary Exploration on Reproducing o1-like MLLM](https://arxiv.org/abs/2501.01904) | [Paper📑](https://arxiv.org/abs/2501.01904) [Code🖥️](https://github.com/RUCAIBox/Virgo) [Model🤗](https://huggingface.co/RUC-AIBOX/Virgo-72B)
  - A o1-like MLLM for multimodal reasoning    |Task: Math & MMMU      

<a name="ChartRasoning"></a>
### Chart Rasoning

* 26.03 [FireRed-OCR Technical Report](https://arxiv.org/abs/2603.01840) | [Paper📑](https://arxiv.org/abs/2603.01840)
  - Framework transforming general VLMs into pixel-precise document parsing experts via three-stage progressive training (pre-alignment, SFT, format-constrained GRPO), achieving 92.94% SOTA on OmniDocBench v1.5. | Task: Document Reasoning
* 26.02 [OCR-Agent: Agentic OCR with Capability and Memory Reflection](https://arxiv.org/abs/2602.21053) | [Paper📑](https://arxiv.org/abs/2602.21053) [Code🖥️](https://github.com/AIGeeksGroup/OCR-Agent)
  - Iterative self-correction framework using Capability Reflection (error diagnosis) and Memory Reflection (avoiding repeated attempts), achieving SOTA on OCRBench v2 without training. | Task: Document Reasoning
* 26.02 [OmniOCR: Generalist OCR for Ethnic Minority Languages](https://arxiv.org/abs/2602.21042) | [Paper📑](https://arxiv.org/abs/2602.21042) [Code🖥️](https://github.com/AIGeeksGroup/OmniOCR)
  - Universal OCR framework using Dynamic LoRA for low-resource ethnic minority scripts, achieving 39-66% accuracy improvements on Tibetan, Shui, and other scripts. | Task: Document Reasoning
* 26.02 [DODO: Discrete OCR Diffusion Models](https://arxiv.org/abs/2602.16872) | [Paper📑](https://arxiv.org/abs/2602.16872)
  - Adapts block discrete diffusion for OCR enabling parallel token processing, achieving up to 3× faster inference while maintaining near-SOTA accuracy. | Task: Document Reasoning
* 26.02 [PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing](https://arxiv.org/abs/2601.21957) | [Paper📑](https://arxiv.org/abs/2601.21957)
  - Compact 0.9B VLM for multi-task document parsing in diverse real-world conditions covering OCR, layout understanding, and chart comprehension. | Task: Document Reasoning
* 26.02 [ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images](https://arxiv.org/abs/2602.12203) | [Paper📑](https://arxiv.org/abs/2602.12203)
  - Benchmark unifying key entity extraction, relation extraction, and VQA for structured information extraction from document images, evaluating VLMs on schema adaptation and answer localization. | Task: Document Reasoning
* 26.02 [MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning](https://arxiv.org/abs/2601.21468) | [Paper📑](https://arxiv.org/abs/2601.21468)
  - Layout-aware visual memory mechanisms for MLLMs to improve long-horizon document and OCR reasoning efficiency. | Task: Document Reasoning
* 26.01 [ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch](https://arxiv.org/abs/2601.13606) | [Paper📑](https://arxiv.org/abs/2601.13606) [Code🖥️](https://github.com/starriver030515/ChartVerse) [Model🤗](https://huggingface.co/opendatalab/ChartVerse-8B) [Dataset🤗](https://huggingface.co/datasets/opendatalab/ChartVerse-SFT-1.8M)
  - Scalable chart reasoning framework using Rollout Posterior Entropy; ChartVerse-8B surpasses its teacher model Qwen3-VL-30B. | Task: Chart Reasoning
* 25.10 [From Charts to Code: A Hierarchical Benchmark for Multimodal Models](https://arxiv.org/abs/2510.17932) | [Paper📑](https://arxiv.org/abs/2510.17932)
  - Benchmark for chart understanding and code generation from charts. | Task: Chart Reasoning
* 25.09 [Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images](https://arxiv.org/abs/2509.07966) | [Paper📑](https://arxiv.org/abs/2509.07966)
  - Benchmark for visual question answering and reasoning over table images. | Task: Chart Reasoning
* 25.09 [MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing](https://arxiv.org/abs/2509.22186) | [Paper📑](https://arxiv.org/abs/2509.22186)
  - Efficient VLM for parsing and understanding high-resolution documents. | Task: Document Reasoning
* 25.09 [Visual Programmability: A Guide for Code-as-Thought in Chart Understanding](https://arxiv.org/abs/2509.09286) | [Paper📑](https://arxiv.org/abs/2509.09286) [Code🖥️](https://github.com/Aphelios-Tang/Code-as-Thought)
   - Introduce an adaptive framework that enables VLMs to dynamically choose between code-based and visual reasoning pathways for chart understanding. | Task: Chart Reasoning
* 25.07 [Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner](https://arxiv.org/abs/2507.15509) | [Paper📑](https://arxiv.org/abs/2507.15509) [Code🖥️](https://github.com/DocTron-hub/Chart-R1) [Model🤗](https://huggingface.co/collections/DocTron/chart-r1)
  - Combine chain-of-thought supervision with reinforcement learning, supported by programmatically synthesized step-by-step reasoning data. | Task: Chart Reasoning
* 25.06 [ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering](https://arxiv.org/abs/2506.10116) | [Paper📑](https://arxiv.org/abs/2506.10116)
  - Combine chart code generation with long-chain reasoning LLMs to produce detailed reasoning processes. | Task: Chart Reasoning
* 25.05 [Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning](https://arxiv.org/abs/2505.19702) | [Paper📑](https://arxiv.org/abs/2505.19702)
  - Introduce a visually grounded chain-of-thought (CoT) paradigm, enabling the model to generate CoT reasoning aligned with visual elements. | Task: Chart Reasoning
* 25.04 [Bespoke-MiniChart-7B: Pushing The Frontiers Of Open VLMs For Chart Understanding](https://www.bespokelabs.ai/blog/bespoke-minichart-7b) | [Project🌐](https://www.bespokelabs.ai/blog/bespoke-minichart-7b) [Model🤗](https://huggingface.co/bespokelabs/Bespoke-MiniChart-7B)
  - Employ a three-stage training process, combining rejection sampling and DPO optimization to enhance out-of-distribution generalization. | Task: Chart Reasoning
* 25.03 [MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding](https://arxiv.org/pdf/2503.13964v1) | [Paper📑](https://arxiv.org/pdf/2503.13964v1) [Code🖥️](https://github.com/aiming-lab/MDocAgent)
  - Integrate text and image retrieval through various agents, enabling collaborative reasoning across modalities. | Task: Document Reasoning
* 24.11 [ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild](https://arxiv.org/abs/2407.04172) | [Paper📑](https://arxiv.org/abs/2407.04172) [Code🖥️](https://github.com/vis-nlp/ChartGemma) [Model🤗](https://huggingface.co/ahmed-masry/chartgemma) [Dataset🤗](https://huggingface.co/datasets/ahmed-masry/ChartGemma)
  - Generate multi-task instruction-tuning data from real chart images and integrating both COT and POT reasoning pathways. | Task: Chart Reasoning
* 24.09 (ICLR25 Oral) [ChartMoE: Mixture of Diversely Aligned Expert Connector for Chart Understanding](https://arxiv.org/abs/2409.03277) | [Paper📑](https://arxiv.org/abs/2409.03277) [Code🖥️](https://github.com/IDEA-FinAI/ChartMoE) [Model🤗](https://huggingface.co/IDEA-FinAI/chartmoe) [Dataset🤗](https://huggingface.co/datasets/Coobiw/ChartMoE-Data)
  - Utilize diverse chart-text aligned tasks (chart -> table/json/python-code) to augment chart understanding and reasoning. | Task: Chart Reasoning
* 24.09 [ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning](https://arxiv.org/abs/2402.12185) | [Project🌐](https://unimodal4reasoning.github.io/DocGenome_page/)  [Paper📑](https://arxiv.org/abs/2402.12185) [Code🖥️](https://github.com/Alpha-Innovator/ChartVLM)
  - Offer a new perspective on handling chart reasoning tasks that strongly depend on interpretable patterns. | Task: Chart Reasoning
* 24.07 (EMNLP24) [Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model](https://arxiv.org/abs/2407.07053) | [Paper📑](https://arxiv.org/abs/2407.07053)  [Project🌐](https://multi-modal-self-instruct.github.io/) [Code🖥️](https://github.com/zwq2018/Multi-modal-Self-instruct) [Dataset🤗](https://huggingface.co/datasets/zwq2018/Multi-modal-Self-instruct)
  - A multi-modal self-instruct, utilizing large language models and their code capabilities to synthesize massive abstract images and visual reasoning instructions across daily scenarios. | Task: Chart Reasoning
* 24.04 (EMNLP24) [TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning](https://arxiv.org/abs/2404.16635) | [Paper📑](https://arxiv.org/abs/2404.16635) [Code🖥️](https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/TinyChart) [Model🤗](https://huggingface.co/mPLUG/TinyChart-3B-768-siglip) [Dataset🤗](https://huggingface.co/datasets/mPLUG/TinyChartData)
  - Employ PoT learning for numerical reasoning and Vision Token Merging to compress visual features from high-resolution images. | Task: Chart Reasoning
* 24.04 (MM24) [OneChart: Purify the Chart Structural Extraction via One Auxiliary Token](https://arxiv.org/abs/2404.09987) | [Paper📑](https://arxiv.org/abs/2404.09987) [Project🌐](https://onechartt.github.io/) [Code🖥️](https://github.com/LingyvKong/OneChart) [Model🤗](https://huggingface.co/kppkkp/OneChart) 
  - Introduce an auxiliary token and decoder combined with a customized L1 loss to enhance the reliability of structured and numerical information extraction. | Task: Chart Reasoning
* 24.04 (MM24) [NovaChart: A Large-scale Dataset towards Chart Understanding and Generation of Multimodal Large Language Models](https://dl.acm.org/doi/10.1145/3664647.3680790) | [Paper📑](https://dl.acm.org/doi/10.1145/3664647.3680790) [Code🖥️](https://github.com/Elucidator-V/NovaChart) [Dataset🤗](https://huggingface.co/datasets/ympan/novachart)
  - Construct a large-scale dataset for chart understanding and generation, covering 18 different chart types and 15 unique tasks. | Task: Chart Reasoning
* 24.02 (ACL24) [ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning](https://arxiv.org/abs/2401.02384) | [Paper📑](https://arxiv.org/abs/2401.02384) [Code🖥️](https://github.com/OpenGVLab/ChartAst) [Dataset🤗](https://huggingface.co/datasets/FanqingM/ChartAssistant)
  - Use large-scale chart data to align and instruction tuning | Task: Chart Reasoning
* 23.11 [ChartLlama: A Multimodal LLM for Chart Understanding and Generation](https://arxiv.org/abs/2311.16483) | [Paper📑](https://arxiv.org/abs/2311.16483) [Project🌐](https://tingxueronghua.github.io/ChartLlama/) [Code🖥️](https://github.com/tingxueronghua/ChartLlama-code) [Model🤗](https://huggingface.co/listen2you002/ChartLlama-13b) [Dataset🤗](https://huggingface.co/datasets/listen2you002/ChartLlama-Dataset)
  - Generate a diverse and high-quality instruction-tuning dataset using GPT-4, and use LLaVA for unified multi-task training. | Task: Chart Reasoning
* 23.10 (EMNLP23) [UniChart: A Universal Vision-language Pretrained Model for
Chart Comprehension and Reasoning](https://arxiv.org/abs/2305.14761) | [Paper📑](https://arxiv.org/abs/2305.14761) [Code🖥️](https://github.com/vis-nlp/UniChart) [Model🤗](https://huggingface.co/ahmed-masry/unichart-base-960) [Dataset🤗](https://huggingface.co/datasets/ahmed-masry/unichart-pretrain-data)
  - Pretrains on a large and diverse chart dataset, explicitly modeling visual elements and structures. | Task: Chart Reasoning

#### Benchmark
* 25.11 (EMNLP25) [ChartM3: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension](https://arxiv.org/abs/2511.02415) | [Paper📑](https://arxiv.org/abs/2511.02415)
  - Provide an evaluation set of 2,871 high-quality samples covering 62 chart types and 60 real-world scenarios, focusing on multi-dimensional and multi-step visual reasoning and complex business analysis. | Task: Chart Reasoning
* 25.05 [ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models](https://arxiv.org/abs/2505.13444) | [Paper📑](https://arxiv.org/abs/2505.13444) [Project🌐](https://chartmuseum-leaderboard.github.io/) [Code🖥️](https://github.com/Liyan06/ChartMuseum) [Dataset🤗](https://huggingface.co/datasets/lytang/ChartMuseum)
  - Feature real-world chart images and four distinct question types that assess textual, visual, combined, and synthesis reasoning abilities. | Task: Chart Reasoning
* 25.04 [CHARTQAPRO : A More Diverse and Challenging Benchmark for Chart Question Answering](https://arxiv.org/abs/2504.05506v2) | [Paper📑](https://arxiv.org/abs/2504.05506v2) [Code🖥️](https://github.com/vis-nlp/ChartQAPro) [Dataset🤗](https://huggingface.co/datasets/ahmed-masry/ChartQAPro)
  - Introduce a diverse benchmark with 1,341 charts and 1,948 questions covering various chart types and question formats, designed to rigorously evaluate the chart reasoning capabilities of large vision-language models in real-world scenarios. | Task: Chart Reasoning
* 25.01 (AAAI25) [EvoChart: A Benchmark and a Self-Training Approach Towards Real-World Chart Understanding](https://arxiv.org/abs/2409.01577) | [Paper📑](https://arxiv.org/abs/2409.01577) [Code🖥️](https://github.com/MuyeHuang/EvoChart) [Dataset🤗](https://huggingface.co/datasets/MuyeHuang/EvoChart-QA-Benchmark)
  - Feature 650 real-world charts, 1,250 expert-curated questions, and strict and flexible automatic evaluation metrics to assess chart comprehension abilities of VLMs in practical scenarios. | Task: Chart Reasoning
* 24.06 (NIPS24) [CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs](https://arxiv.org/abs/2406.18521) | [Paper📑](https://arxiv.org/abs/2406.18521) [Project🌐](https://charxiv.github.io/) [Code🖥️](https://github.com/princeton-nlp/CharXiv) [Dataset🤗](https://huggingface.co/datasets/princeton-nlp/CharXiv)
  - Focus on real and complex charts from arXiv papers, covering eight major domains. All content is expert-curated and verified, with evaluation using GPT-4o scoring and binary correctness metrics. | Task: Chart Reasoning
* 24.06 (VRISP25) [ChartBench: A Benchmark for Complex Visual Reasoning in Charts](https://arxiv.org/abs/2312.15915) | [Paper📑](https://arxiv.org/abs/2312.15915) [Project🌐](https://chartbench.github.io/) [Code🖥️](https://github.com/IDEA-FinAI/ChartBench) [Dataset🤗](https://huggingface.co/datasets/SincereX/ChartBench)
  - Cover 9 major categories and 42 subcategories of charts without data point annotations, emphasizing numerical extraction ability. | Task: Chart Reasoning
* 24.04 (NAACL24) [MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning](https://arxiv.org/abs/2311.10774) | [Paper📑](https://arxiv.org/abs/2311.10774) [Code🖥️](https://github.com/FuxiaoLiu/MMC) [Dataset🤗](https://huggingface.co/datasets/xywang1/MMC)
  - Propose a comprehensive human-annotated benchmark with nine distinct tasks evaluating reasoning capabilities over various charts, and support both GPT-4 scoring and multiple-choice exact matching. | Task: Chart Reasoning
* 22.05 (ACL22) [ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning](https://arxiv.org/abs/2203.10244) | [Paper📑](https://arxiv.org/abs/2203.10244) [Code🖥️](https://github.com/vis-nlp/ChartQA) [Dataset🤗](https://huggingface.co/datasets/ahmed-masry/ChartQA)
  - Use real-world charts and open-ended questions to evaluate chart understanding, reasoning, and data extraction, with relaxed accuracy as the metric. | Task: Chart Reasoning


<a name="VisualGeneration"></a>
### Visual-Audio Generation
#### Image MLLM
* 26.02 [DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing](https://arxiv.org/abs/2602.12205) | [Paper📑](https://arxiv.org/abs/2602.12205) [Code🖥️](https://github.com/DeepGenTeam/DeepGen) [Model🤗](https://huggingface.co/deepgenteam/DeepGen-1.0)
  - Lightweight 5B unified model for image generation and editing using hierarchical feature extraction, learnable think tokens, and MR-GRPO reinforcement learning, outperforming much larger models. | Task: Image Generation
* 26.02 [UniWeTok: An Unified Binary Tokenizer with Codebook Size 2^128 for Unified Multimodal Large Language Model](https://arxiv.org/abs/2602.14178) | [Paper📑](https://arxiv.org/abs/2602.14178) [Code🖥️](https://github.com/shallowdream204/BitDance)
  - Unified discrete tokenizer with massive binary codebook (2^128) for high-fidelity image reconstruction and generation in multimodal LLMs, achieving FID 1.38 with lower training compute. | Task: Image Generation
* 26.02 [UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing](https://arxiv.org/abs/2602.02437) | [Paper📑](https://arxiv.org/abs/2602.02437) [Code🖥️](https://github.com/AlenjandroWang/UniReason) [Model🤗](https://huggingface.co/Alex11556666/UniReason)
  - Integrates text-to-image generation and editing through dual reasoning with world knowledge planning and visual refinement on reasoning-intensive benchmarks. | Task: Image Generation
* 26.02 [Generated Reality: Human-centric World Simulation using Interactive Video Generation](https://arxiv.org/abs/2602.18422) | [Paper📑](https://arxiv.org/abs/2602.18422) [Project🌐](https://codeysun.github.io/generated-reality/)
  - Human-centric video world model conditioned on tracked head and hand poses via bidirectional video diffusion for dexterous XR interactions. | Task: Image/Video Generation
* 26.01 [Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders](https://arxiv.org/abs/2601.10332) | [Paper📑](https://arxiv.org/abs/2601.10332) [Code🖥️](https://github.com/zhijie-group/Think-Then-Generate)
  - "Think-then-generate" paradigm where LLM encoders reason about prompts before image generation using Dual-GRPO reinforcement optimization. | Task: Image Generation
* 26.01 [Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing](https://arxiv.org/abs/2601.05124) | [Paper📑](https://arxiv.org/abs/2601.05124) [Code🖥️](https://github.com/hrz2000/realign)
  - Bridges multimodal understanding and image generation via In-Context Chain-of-Thought (IC-CoT) with RL-based training. | Task: Image Generation & Editing
* 26.01 [Unified Thinker: A General Reasoning Modular Core for Image Generation](https://arxiv.org/abs/2601.03127) | [Paper📑](https://arxiv.org/abs/2601.03127)
  - General reasoning modular core enhancing image generation models with chain-of-thought reasoning capabilities. | Task: Image Generation
* 25.12 [REASONEDIT: Towards Reasoning-Enhanced Image Editing Models](https://arxiv.org/abs/2511.22625) | [Paper📑](https://arxiv.org/abs/2511.22625)
  - Enhances image editing models with explicit reasoning capabilities. | Task: Image Editing
* 25.12 [EditThinker: Unlocking Iterative Reasoning for Any Image Editor](https://arxiv.org/abs/2512.05965) | [Paper📑](https://arxiv.org/abs/2512.05965)
  - Enables iterative reasoning in image editing through a reasoning-aware framework. | Task: Image Editing
* 25.11 [IE-Critic-R1: Advancing the Explanatory Measurement of Text-Driven Image Editing for Human Perception Alignment](https://arxiv.org/abs/2511.18055) | [Paper📑](https://arxiv.org/abs/2511.18055) [Code🖥️](https://github.com/Coobiw/IE-Critic-R1) [Model🤗](https://huggingface.co/Coobiw/IE-Critic-R1-7B) [Dataset🤗](https://huggingface.co/datasets/Coobiw/IE-Bench-4k) [ColdStart SFT🤗](https://huggingface.co/datasets/Coobiw/IE-Bench-CoT-mixed)
  - IE-Critic-R1 treats image editing quality assessment as a reasoning task and implement "R1 moment" (longer reasoning thoughts, better performance). It is a pointwise, generative reward model, leveraging Chain-of-Thought (CoT) reasoning SFT and RLVR to provide accurate, human-aligned evaluations of image editing. | Task: Image Editing Quality Asssessment
* 25.05 [T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT](https://arxiv.org/pdf/2505.00703) | [Paper📑](https://arxiv.org/pdf/2505.00703) [Code🖥️](https://github.com/CaraJ7/T2I-R1)
  - A novel reasoning-enhanced text-to-image generation model powered by RL with a bi-level CoT reasoning process | Task: Video Generation
* 25.03 [GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing](https://arxiv.org/pdf/2503.10639) | [Paper📑](https://arxiv.org/pdf/2503.10639) 
  - A paradigm that enables generation and editing through an explicit language reasoning process before outputting images   | Task: Image Generation
* 25.03  [Unified Reward Model for Multimodal Understanding and Generation](https://arxiv.org/abs/2503.05236) | [Paper📑](https://arxiv.org/abs/2503.05236) [Code🖥️](https://codegoat24.github.io/UnifiedReward/) [Dataset🤗](https://huggingface.co/collections/CodeGoat24/unifiedreward-training-data-67c300d4fd5eff00fa7f1ede)
  -  Improve MLLM's understanding and generation ability with DPO  | Task: VQA & Generation
* 25.01 (CVPR25) [Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step](https://arxiv.org/pdf/2501.13926) | [Paper📑](https://arxiv.org/pdf/2501.13926) [Code🖥️](https://github.com/ZiyuGuo99/Image-Generation-CoT) [Model🤗](https://huggingface.co/ZiyuG/Image-Generation-CoT)
  - The first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. | Task: Image Generation
* 24.12 [EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing](https://arxiv.org/pdf/2412.10566) | [Paper📑](https://arxiv.org/pdf/2412.10566)
  - A system designed to interpret such instructions in conjunction with reference visuals, producing precise and context-aware editing prompts.  | Task: Image Generation
#### Video MLLM
* 26.02 [SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model](https://arxiv.org/abs/2602.21818) | [Paper📑](https://arxiv.org/abs/2602.21818)
  - Unified multimodal video foundation model enabling simultaneous video+audio generation, editing, and inpainting via dual-stream architecture, supporting 1080p/32FPS/15s with synchronized audio. | Task: Video Generation
* 26.03 [Helios: Real Real-Time Long Video Generation Model](https://arxiv.org/abs/2603.04379) | [Paper📑](https://arxiv.org/abs/2603.04379) [Code🖥️](https://github.com/PKU-YuanGroup/Helios) [Model🤗](https://huggingface.co/BestWishYsh/Helios-Distilled)
  - 14B autoregressive diffusion model achieving real-time video generation at 19.5 FPS on a single H100 GPU while supporting minute-scale synthesis without conventional anti-drifting heuristics. | Task: Video Generation
* 26.03 [DreamWorld: Unified World Modeling in Video Generation](https://arxiv.org/abs/2603.00466) | [Paper📑](https://arxiv.org/abs/2603.00466)
  - Unified world modeling framework in video generation with representation alignment for improved physical consistency and scene understanding. | Task: Video Generation
* 26.03 [RealWonder: Real-Time Physical Action-Conditioned Video Generation](https://arxiv.org/abs/2603.05449) | [Paper📑](https://arxiv.org/abs/2603.05449) [Project🌐](https://liuwei283.github.io/RealWonder)
  - First real-time system for action-conditioned video generation using physics simulation as intermediate bridge, achieving 13.2 FPS at 480×832 for interactive exploration of forces and robot actions. | Task: Video Generation
* 26.02 [AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories](https://arxiv.org/abs/2602.14941) | [Paper📑](https://arxiv.org/abs/2602.14941) [Code🖥️](https://github.com/wz0919/AnchorWeave)
  - Addresses long-term video generation consistency using multiple local geometric memories and multi-anchor weaving controller for camera-controllable long-horizon scene generation. | Task: Video Generation
* 26.02 [OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence](https://arxiv.org/abs/2602.08683) | [Paper📑](https://arxiv.org/abs/2602.08683) [Code🖥️](https://github.com/EvolvingLMMs-Lab/OneVision-Encoder) [Model🤗](https://huggingface.co/collections/lmms-lab-encoder/onevision-encoder)
  - Vision encoder applying codec-aligned sparsity to focus on 3.1%-25% of signal-rich patches, outperforming Qwen3-ViT and SigLIP2 across 16 benchmarks with 4.1% average improvement on video understanding. | Task: Video Understanding
* 26.02 [CoPE-VideoLM: Codec Primitives For Efficient Video Language Models](https://arxiv.org/abs/2602.13191) | [Paper📑](https://arxiv.org/abs/2602.13191)
  - Uses video codec primitives (motion vectors and residuals) instead of dense per-frame embeddings, reducing time-to-first-token by up to 86% and token usage by up to 93% across 14 video benchmarks. | Task: Video Understanding
* 26.02 [Solaris: Building a Multiplayer Video World Model in Minecraft](https://arxiv.org/abs/2602.22208) | [Paper📑](https://arxiv.org/abs/2602.22208) [Code🖥️](https://github.com/solaris-wm/solaris) [Model🤗](https://huggingface.co/nyu-visionx/solaris)
  - Multiplayer video world model for consistent multi-view observations in coordinated multi-agent Minecraft environments using Checkpointed Self Forcing technique. | Task: Video Generation
* 26.02 [MOVA: Towards Scalable and Synchronized Video-Audio Generation](https://arxiv.org/abs/2602.08794) | [Paper📑](https://arxiv.org/abs/2602.08794) [Code🖥️](https://github.com/OpenMOSS/MOVA) [Model🤗](https://huggingface.co/collections/OpenMOSS-Team/mova)
  - Open-source 32B MoE model generating high-quality synchronized audio-visual content including lip-synced speech, environment sounds, and music from image-text inputs. | Task: Video-Audio Generation
* 25.11 [Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation](https://arxiv.org/abs/2511.14993) | [Paper📑](https://arxiv.org/abs/2511.14993)
  - Foundation model family for image and video generation. | Task: Video Generation
* 25.11 [Planning with Sketch-Guided Verification for Physics-Aware Video Generation](https://arxiv.org/abs/2511.17450) | [Paper📑](https://arxiv.org/abs/2511.17450)
  - Physics-aware video generation with sketch-based planning and verification. | Task: Video Generation
* 25.10 [PhysMaster: Mastering Physical Representation for Video Generation via RL](https://arxiv.org/abs/2510.13809) | [Paper📑](https://arxiv.org/abs/2510.13809)
  - Physical reasoning for video generation with reinforcement learning. | Task: Video Generation
* 25.02 [C-Drag:Chain-of-Thought Driven Motion Controller for Video Generation](https://arxiv.org/pdf/2502.19868) | [Paper📑](https://arxiv.org/pdf/2502.19868) [Code🖥️](https://github.com/WesLee88524/C-Drag-Official-Repo) [Dataset🤗](https://drive.google.com/file/d/1L2SYadeqZPObvSj9Mb6fK-KHtR0n-DKk/view)
  - A Chain-of-Thought-based motion controller for controllable video generation | Task: Video Generation
#### Audio MLLM
* 26.02 [AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization](https://arxiv.org/abs/2602.07054) | [Paper📑](https://arxiv.org/abs/2602.07054) [Dataset🤗](https://huggingface.co/datasets/chaubeyG/EmoReAlM)
  - AVEm-DPO preference optimization improves audiovisual emotion reasoning in MLLMs by aligning responses with audiovisual cues and reducing text-prior hallucinations. | Task: Audio-Visual Reasoning
* 26.02 [EgoAVU: Egocentric Audio-Visual Understanding](https://arxiv.org/abs/2602.06139) | [Paper📑](https://arxiv.org/abs/2602.06139) [Dataset🤗](https://huggingface.co/datasets/facebook/EgoAVU_data)
  - Scalable data engine and 3M-sample dataset for egocentric audio-visual understanding, enabling up to 113% performance improvement on joint audio-visual reasoning tasks. | Task: Audio-Visual Reasoning
* 26.01 [LTX-2: Efficient Joint Audio-Visual Foundation Model](https://arxiv.org/abs/2601.03233) | [Paper📑](https://arxiv.org/abs/2601.03233) [Code🖥️](https://github.com/Lightricks/LTX-2) [Model🤗](https://huggingface.co/Lightricks/LTX-2)
  - Open-source 14B+5B asymmetric dual-stream audiovisual diffusion model generating synchronized video and audio with bidirectional cross-attention. | Task: Audio-Visual Generation
* 25.11 [UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions](https://arxiv.org/abs/2511.03334) | [Paper📑](https://arxiv.org/abs/2511.03334)
  - Unified audio-video generation using cross-modal interactions. | Task: Audio-Visual Generation
* 25.11 [Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy](https://arxiv.org/abs/2511.21579) | [Paper📑](https://arxiv.org/abs/2511.21579)
  - Harmonizes audio and video generation via cross-task synergy. | Task: Audio-Visual Generation
* 25.06 [ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing](https://arxiv.org/abs/2506.21448)


<a name="reasoning-with-agent"></a>
### Reasoning with Agent/Tool
* 26.02 [Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents](https://arxiv.org/abs/2602.16855) | [Paper📑](https://arxiv.org/abs/2602.16855) [Code🖥️](https://github.com/X-PLUG/MobileAgent) [Model🤗](https://huggingface.co/mPLUG/GUI-Owl-1.5-8B-Think)
  - GUI-Owl-1.5 multi-platform GUI agent family achieving SOTA on GUI automation (56.5 OSWorld, 71.6 AndroidWorld) and grounding (80.3 ScreenSpotPro) via MRPO multi-platform RL. | Task: GUI Agent
* 26.02 [GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL](https://arxiv.org/abs/2602.22190) | [Paper📑](https://arxiv.org/abs/2602.22190) [Code🖥️](https://github.com/GUI-Libra/GUI-Libra) [Model🤗](https://huggingface.co/collections/Ray2333/gui-libra)
  - Trains open-source GUI agents using action-aware SFT (81K curated dataset) and conservative RL with KL regularization for web and mobile tasks. | Task: GUI Agent
* 26.02 [PyVision-RL: Forging Open Agentic Vision Models via RL](https://arxiv.org/abs/2602.20739) | [Paper📑](https://arxiv.org/abs/2602.20739) [Code🖥️](https://github.com/agents-x-project/PyVision-RL) [Model🤗](https://huggingface.co/Agents-X/PyVision-Image-7B-RL)
  - RL framework for open-weight multimodal agents using oversampling-filtering-ranking rollout; releases PyVision-Image-7B and PyVision-Video-7B for tool-augmented reasoning. | Task: Agent/Tool Use
* 26.02 [Computer-Using World Model](https://arxiv.org/abs/2602.17365) | [Paper📑](https://arxiv.org/abs/2602.17365)
  - World model for desktop software predicting UI state changes via two-stage factorization to help agents simulate candidate actions before execution. | Task: GUI Agent
* 26.02 [V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval](https://arxiv.org/abs/2602.06034) | [Paper📑](https://arxiv.org/abs/2602.06034) [Code🖥️](https://github.com/chendy25/V-Retrver) [Model🤗](https://huggingface.co/V-Retrver/V-Retrver-SFT-7B)
  - Reformulates multimodal retrieval as an agentic reasoning process where an MLLM selectively acquires visual evidence via external tools, achieving 23% average improvement. | Task: Agent/Tool Use
* 26.02 [Reasoning-Augmented Representations for Multimodal Retrieval](https://arxiv.org/abs/2602.07125) | [Paper📑](https://arxiv.org/abs/2602.07125) [Code🖥️](https://github.com/AugmentedRetrieval/ReasoningAugmentedRetrieval)
  - Data-centric framework externalizing reasoning before retrieval by using VLMs to densely caption visual evidence and resolve ambiguous multimodal queries. | Task: Agent/Tool Use
* 26.02 [WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents](https://arxiv.org/abs/2601.21872) | [Paper📑](https://arxiv.org/abs/2601.21872) [Code🖥️](https://github.com/yaoz720/GroundedPRM)
  - Reasoning-first WebPRM formulating reward modeling as text generation to improve web navigation through structured justifications and preference verdicts (ICLR 2026). | Task: GUI Agent
* 26.02 [Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation](https://arxiv.org/abs/2602.05827) | [Paper📑](https://arxiv.org/abs/2602.05827) [Code🖥️](https://github.com/opendrivelab/sparsevideonav)
  - SparseVideoNav uses video generation for sparse future planning in beyond-the-view VLN tasks, achieving 27× speed-up and 2.5× higher success rate over LLM baselines. | Task: Visual Reasoning Agent
* 26.02 [WebWorld: A Large-Scale World Model for Web Agent Training](https://arxiv.org/abs/2602.14721) | [Paper📑](https://arxiv.org/abs/2602.14721)
  - Open-web simulator trained on 1M+ interactions enabling long-horizon reasoning for web agents; models trained on WebWorld-synthesized trajectories show +9.2% improvement on WebArena. | Task: GUI Agent
* 26.02 [AutoWebWorld: Synthesizing Infinite Verifiable Web Environments via Finite State Machines](https://arxiv.org/abs/2602.14296) | [Paper📑](https://arxiv.org/abs/2602.14296)
  - Synthesizes controllable web environments as FSMs translated to interactive websites by coding agents for automated trajectory generation at $0.04/trajectory, with 7B agent outperforming baselines on WebVoyager. | Task: GUI Agent
* 26.02 [MMA: Multimodal Memory Agent](https://arxiv.org/abs/2602.16493) | [Paper📑](https://arxiv.org/abs/2602.16493) [Code🖥️](https://github.com/AIGeeksGroup/MMA)
  - Improves long-horizon multimodal agent performance via dynamic memory reliability scoring and introduces the "Visual Placebo Effect" with MMA-Bench for evaluating belief dynamics. | Task: Multimodal Agent
* 26.03 [CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification](https://arxiv.org/abs/2603.01940) | [Paper📑](https://arxiv.org/abs/2603.01940)
  - Post-training data synthesis framework for interactive tool-use agents using constraint-guided verification; compact 4B model achieves success rates competitive with 17× larger models on τ²-bench. | Task: Tool Use Agent
* 26.03 [AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios](https://arxiv.org/abs/2602.23166) | [Paper📑](https://arxiv.org/abs/2602.23166)
  - Benchmark evaluating multimodal agents in ultra-challenging realistic visual scenarios requiring complex visual reasoning and multi-step planning. | Task: Multimodal Agent
* 26.01 [AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning](https://arxiv.org/abs/2601.18631) | [Paper📑](https://arxiv.org/abs/2601.18631) [Code🖥️](https://github.com/ssmisya/AdaReasoner) [Model🤗](https://huggingface.co/AdaReasoner/AdaReasoner-7B-Randomized)
  - Multimodal model family learning tool usage as a reasoning skill via Tool-GRPO, +24.9% improvement surpassing GPT-4 on visual reasoning benchmarks. | Task: Visual Reasoning with Tools
* 26.01 [SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning](https://arxiv.org/abs/2512.24330) | [Paper📑](https://arxiv.org/abs/2512.24330)
  - Multimodal agentic reasoning and search framework using RL to empower visual reasoning with agent capabilities. | Task: Multimodal Agentic Reasoning
* 26.01 [EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience](https://arxiv.org/abs/2601.15876) | [Paper📑](https://arxiv.org/abs/2601.15876) [Code🖥️](https://github.com/meituan/EvoCUA) [Model🤗](https://huggingface.co/meituan/EvoCUA-32B-20260105)
  - SOTA computer-use agent (56.7% OSWorld) using autonomous task generation and iterative evolving learning with self-correction. | Task: GUI Agent
* 26.01 [DocDancer: Towards Agentic Document-Grounded Information Seeking](https://arxiv.org/abs/2601.05163) | [Paper📑](https://arxiv.org/abs/2601.05163)
  - Agentic framework for document-grounded multimodal information seeking and reasoning. | Task: Document Reasoning Agent
* 26.01 [ShowUI-pi: Flow-based Generative Models as GUI Dexterous Hands](https://arxiv.org/abs/2512.24965) | [Paper📑](https://arxiv.org/abs/2512.24965)
  - Flow-based generative models applied as GUI interaction agents with visual reasoning capabilities. | Task: GUI Agent
* 26.01 [PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent](https://arxiv.org/abs/2601.09636) | [Paper📑](https://arxiv.org/abs/2601.09636)
  - Personalized GUI agent aligning hierarchical implicit user intent with long-term user-centric records. | Task: GUI Agent
* 25.12 [Step-GUI Technical Report](https://arxiv.org/abs/2512.15431) | [Paper📑](https://arxiv.org/abs/2512.15431)
  - Step-by-step GUI agent with visual understanding. | Task: GUI Agent
* 25.12 [MAI-UI Technical Report: Real-World Centric Foundation GUI Agents](https://arxiv.org/abs/2512.22047) | [Paper📑](https://arxiv.org/abs/2512.22047)
  - Foundation model for real-world GUI agent interaction with visual grounding. | Task: GUI Agent
* 25.11 [Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything](https://arxiv.org/abs/2511.02834)
* 25.11 [DeepEyesV2: Toward Agentic Multimodal Model](https://arxiv.org/abs/2511.05271) | [Paper📑](https://arxiv.org/abs/2511.05271)
  - Agentic multimodal model with tool-use and reasoning capabilities. | Task: Multimodal Agent
* 25.11 [GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization](https://arxiv.org/abs/2511.15705) | [Paper📑](https://arxiv.org/abs/2511.15705)
  - Combines visual reasoning with web augmentation for agentic geolocalization. | Task: Visual Reasoning Agent
* 25.10 [AudioToolAgent: An Agentic Framework for Audio-Language Models](https://arxiv.org/abs/2510.02995v1) | [Paper📑](https://arxiv.org/abs/2510.02995v1)
* 25.10 [GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness](https://arxiv.org/abs/2510.00536) | [Paper📑](https://arxiv.org/abs/2510.00536)
  - Efficient GUI interaction agents using visual understanding with spatio-temporal KV cache. | Task: GUI Agent
* 25.09 [UItron: Foundational GUI Agent with Advanced Perception and Planning](https://arxiv.org/abs/2508.21767) | [Paper📑](https://arxiv.org/abs/2508.21767)
  - Multimodal agent for GUI understanding and interaction. | Task: GUI Agent
* 25.09 [BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent](https://arxiv.org/abs/2509.15566) | [Paper📑](https://arxiv.org/abs/2509.15566)
  - Reasoning model for GUI agent visual understanding and interaction. | Task: GUI Agent
* 25.08 [Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation](https://arxiv.org/abs/2508.04418)
* 25.08 [OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use](https://arxiv.org/abs/2508.04482) | [Paper📑](https://arxiv.org/abs/2508.04482)
  - Survey of MLLM-based agents that operate computing devices via visual understanding. | Task: GUI Agent
* 25.08 [InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization](https://arxiv.org/abs/2508.05731) | [Paper📑](https://arxiv.org/abs/2508.05731)
  - Multimodal agent for GUI understanding with visual grounding and adaptive exploration. | Task: GUI Agent
* 25.08 [CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent](https://arxiv.org/abs/2508.20096) | [Paper📑](https://arxiv.org/abs/2508.20096)
  - Dual-brain architecture for multimodal computer-use agent with decoupled RL. | Task: GUI Agent
* 25.06 [Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning](https://arxiv.org/abs/2506.13654)|[Paper📑](https://arxiv.org/pdf/2506.13654) [Code🖥️](https://github.com/egolife-ai/Ego-R1) [Project🌐](https://egolife-ai.github.io/Ego-R1/)
* 25.05 [ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning](https://arxiv.org/abs/2503.19470) | [Paper📑](https://arxiv.org/pdf/2503.19470) [Code🖥️](https://github.com/Agent-RL/ReCall) 
* 25.05 [Reinforcement Learning for Long-Horizon Interactive LLM Agents](https://arxiv.org/abs/2502.01600)|[Paper📑](https://arxiv.org/pdf/2502.01600) 
* 25.05 [RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning](https://arxiv.org/abs/2504.20073) |[Paper📑](https://arxiv.org/pdf/2504.20073) [Code🖥️](https://github.com/RAGEN-AI/RAGEN) [Project🌐](https://ragen-ai.github.io/)
* 25.05 [Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning](https://arxiv.org/abs/2505.00024) | [Paper📑](https://arxiv.org/pdf/2505.00024) [Code🖥️](https://github.com/NVlabs/Tool-N1) 
* 25.05 [Agent RL Scaling Law: Spontaneous Code Execution for Mathematical Problem Solving](https://arxiv.org/abs/2505.07773)| [Paper📑](https://arxiv.org/pdf/2505.07773) [Code🖥️](https://github.com/yyht/openrlhf_async_pipline) 
* 25.04 [ToolRL: Reward is All Tool Learning Needs](https://arxiv.org/abs/2504.13958)|[Paper📑](https://arxiv.org/pdf/2504.13958) [Code🖥️](https://github.com/qiancheng0/ToolRL) 
* 25.04 [Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning](https://arxiv.org/abs/2503.09516v4) | [Paper📑](https://arxiv.org/pdf/2503.09516v4) [Code🖥️](https://github.com/PeterGriffinJin/Search-R1) 
* 25.04 [Acting Less is Reasoning More! Teaching Model to Act Efficiently](https://arxiv.org/abs/2504.14870)
* 25.04 [Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning](https://arxiv.org/abs/2505.01441) | [Paper📑](https://arxiv.org/abs/2505.01441) 
* 25.04 [DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments](https://arxiv.org/abs/2504.03160) |[Paper📑](https://arxiv.org/pdf/2504.03160) [Code🖥️](https://github.com/GAIR-NLP/DeepResearcher) 
* 25.03 [TORL: Scaling Tool-Integrated RL](https://arxiv.org/abs/2503.23383) | [Paper📑](https://arxiv.org/pdf/2503.23383) [Code🖥️](https://github.com/GAIR-NLP/ToRL) 
* 25.03 [R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning](https://arxiv.org/abs/2503.05592) | [Paper📑](https://arxiv.org/pdf/2503.05592) 
* 25.02 (CVPR25)[Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation](https://arxiv.org/abs/2412.01694) | [Paper📑](https://arxiv.org/pdf/2412.01694) 
* 24.12 (ECCV24) [VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding](https://arxiv.org/abs/2403.11481) | [Paper📑](https://arxiv.org/abs/2403.11481) [Code🖥️](https://github.com/YueFan1014/VideoAgent) [Project🌐](https://videoagent.github.io/)
  - Explore how reconciling several foundation models with a novel unified memory mechanism could tackle the challenging video understanding problem  | Task: Video captioning & QA

<a name="medical-reasoning"></a>
### Medical Reasoning
#### Image MLLM
* 26.02 [MediX-R1: Open Ended Medical Reinforcement Learning](https://arxiv.org/abs/2602.23363) | [Paper📑](https://arxiv.org/abs/2602.23363) [Code🖥️](https://github.com/mbzuai-oryx/MediX-R1) [Model🤗](https://huggingface.co/MBZUAI/MediX-R1-8B) [Dataset🤗](https://huggingface.co/datasets/MBZUAI/medix-rl-data)
  - Open-ended RL framework for medical MLLMs enabling free-form clinical answers via Group-Based RL with composite rewards; 8B model outperforms 27B MedGemma with ~51K training samples. | Task: Medical Reasoning
* 26.02 [Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making](https://arxiv.org/abs/2602.06570) | [Paper📑](https://arxiv.org/abs/2602.06570) [Code🖥️](https://github.com/baichuan-inc/Baichuan-M3-235B) [Model🤗](https://huggingface.co/baichuan-inc/Baichuan-M3-235B)
  - Medical LLM shifting from passive Q&A to active clinical-grade decision support via proactive information acquisition, long-horizon reasoning, and hallucination suppression, achieving SOTA on HealthBench. | Task: Medical Reasoning
* 26.02 [MedSAM-Agent: Empowering Interactive Medical Image Segmentation with Multi-turn Agentic Reinforcement Learning](https://arxiv.org/abs/2602.03320) | [Paper📑](https://arxiv.org/abs/2602.03320) [Code🖥️](https://github.com/CUHK-AIM-Group/MedSAM-Agent) [Model🤗](https://huggingface.co/Saint-lsy/MedSAM-Agent-Qwen3-VL-8B-MedSAM2)
  - Reformulates medical image segmentation as multi-step decision-making using hybrid prompting and two-stage training with process rewards for autonomous reasoning. | Task: Medical Reasoning
* 26.02 [Hepato-LLaVA: An Expert MLLM for Hepatocellular Pathology Analysis on Whole Slide Images](https://arxiv.org/abs/2602.19424) | [Paper📑](https://arxiv.org/abs/2602.19424) [Project🌐](https://pris-cv.github.io/Hepto-LLaVA/)
  - Specialized MLLM for hepatocellular carcinoma diagnosis with Sparse Topo-Pack Attention modeling tissue topology; includes HepatoPathoVQA (33K expert-validated Q&A pairs). | Task: Medical Reasoning
* 26.02 [MedCLIPSeg: Probabilistic Vision-Language Adaptation for Medical Image Segmentation](https://arxiv.org/abs/2602.20423) | [Paper📑](https://arxiv.org/abs/2602.20423) [Code🖥️](https://github.com/HealthX-Lab/MedCLIPSeg) [Model🤗](https://huggingface.co/TahaKoleilat/MedCLIPSeg)
  - Adapts CLIP for medical image segmentation via Probabilistic Vision-Language Adapter with uncertainty-aware attention, tested across 16 datasets spanning 5 modalities and 6 organs. | Task: Medical Reasoning
* 26.02 [MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs](https://arxiv.org/abs/2602.12705) | [Paper📑](https://arxiv.org/abs/2602.12705)
  - Medical VLM combining entity-aware continual pretraining for rare diseases, RL for expert-level reasoning, and tool-augmented agentic training for multi-step diagnostic reasoning with reduced hallucination. | Task: Medical Reasoning
* 26.02 [ClinAlign: Scaling Healthcare Alignment from Clinician Preference](https://arxiv.org/abs/2602.09653) | [Paper📑](https://arxiv.org/abs/2602.09653) [Code🖥️](https://github.com/AQ-MedAI/ClinAlign) [Model🤗](https://huggingface.co/AQ-MedAI/ClinAlign-30B-A3B)
  - Two-stage LLM alignment using physician-verified examples and distilled clinical principles, with a 30B model activating 3B parameters at inference achieving SOTA on medical benchmarks. | Task: Medical Reasoning
* 26.02 [Uncertainty-Aware Vision-Language Segmentation for Medical Imaging](https://arxiv.org/abs/2602.14498) | [Paper📑](https://arxiv.org/abs/2602.14498) [Code🖥️](https://github.com/arya-domain/UA-VLS)
  - Multimodal segmentation with Modality Decoding Attention Blocks (MoDAB) and Spectral-Entropic Uncertainty Loss for medical image segmentation from radiological images and clinical text. | Task: Medical Reasoning
* 26.03 [When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains](https://arxiv.org/abs/2603.01301) | [Paper📑](https://arxiv.org/abs/2603.01301) [Project🌐](https://medbridgerl.github.io)
  - Controlled study disentangling vision, SFT, and RL effects in medical VLMs, proposing a boundary-aware recipe: bridge support first, then sharpen with RL for optimal medical reasoning. | Task: Medical Reasoning
* 26.01 [UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation](https://arxiv.org/abs/2601.11522) | [Paper📑](https://arxiv.org/abs/2601.11522) [Code🖥️](https://github.com/ZrH42/UniX) [Model🤗](https://huggingface.co/ZrH42/UniX)
  - Unified medical foundation model combining autoregressive understanding and diffusion generation for chest X-rays, +46.1% in understanding. | Task: Medical Image Understanding & Generation
* 25.12 [OralGPT-Omni: A Versatile Dental Multimodal Large Language Model](https://arxiv.org/abs/2511.22055) | [Paper📑](https://arxiv.org/abs/2511.22055)
  - Versatile dental MLLM for oral health diagnosis and reasoning across modalities. | Task: Medical Reasoning
* 25.12 [DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry](https://arxiv.org/abs/2512.11558) | [Paper📑](https://arxiv.org/abs/2512.11558)
  - Incentivizes complex multimodal reasoning for dental diagnosis and treatment. | Task: Medical Reasoning
* 25.12 [Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning](https://arxiv.org/abs/2512.03667) | [Paper📑](https://arxiv.org/abs/2512.03667)
  - Advances colonoscopy with multimodal understanding and clinical reasoning capabilities. | Task: Medical Reasoning
* 25.10 [M3Retrieve: Benchmarking Multimodal Retrieval for Medicine](https://arxiv.org/abs/2510.06888) | [Paper📑](https://arxiv.org/abs/2510.06888)
  - Multimodal retrieval benchmark for medical domain. | Task: Medical Reasoning
* 25.09 [MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection](https://arxiv.org/abs/2509.03800) | [Paper📑](https://arxiv.org/abs/2509.03800)
  - VLM for medical 3D CT analysis to reduce diagnostic errors. | Task: Medical Reasoning
* 25.08 [MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine](https://arxiv.org/abs/2508.02951) | [Paper📑](https://arxiv.org/abs/2508.02951)
  - Tests multimodal LLMs on basic medical visual perception tasks. | Task: Medical Reasoning
#### Audio MLLM
* 25.04 (ICASSP 2025) [AuscMLLM: Bridging Classification and Reasoning in Heart Sound Analysis with a Multimodal Large Language Model](https://ieeexplore.ieee.org/document/10889373) |
* 24.09 (JBHI 2024) [Multi-Task Learning for Audio-Based Infant Cry Detection and Reasoning](https://ieeexplore.ieee.org/document/10663705) | 
#### Omni MLLM
* 25.06 (ACL 2025) [MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration](https://aclanthology.org/2025.findings-acl.1298/) | [Paper📑](https://aclanthology.org/2025.findings-acl.1298/) [Code🖥️](https://github.com/yczhou001/MAM) 

<a name="embodied-reasoning"></a>
### Embodied Reasoning

* 26.02 [VLANeXt: Recipes for Building Strong VLA Models](https://arxiv.org/abs/2602.18532) | [Paper📑](https://arxiv.org/abs/2602.18532) [Code🖥️](https://github.com/DravenALG/VLANeXt) [Model🤗](https://huggingface.co/DravenALG/VLANeXt)
  - Systematically identifies 12 key design findings across foundational components for VLA models, yielding SOTA simulation and real-world benchmark performance (CVPR 2026). | Task: Robot Control

* 26.02 [SimVLA: A Simple VLA Baseline for Robotic Manipulation](https://arxiv.org/abs/2602.18224) | [Paper📑](https://arxiv.org/abs/2602.18224) [Code🖥️](https://github.com/LUOyk1999/SimVLA) [Model🤗](https://huggingface.co/YuankaiLuo/SimVLA-LIBERO)
  - Minimal VLA baseline strictly decoupling perception from control with standard VL backbone, achieving SOTA on simulation benchmarks with only 0.5B parameters. | Task: Robotic Manipulation

* 26.02 [GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning](https://arxiv.org/abs/2602.12099) | [Paper📑](https://arxiv.org/abs/2602.12099) [Code🖥️](https://github.com/open-gigaai/giga-brain-0) [Project🌐](https://gigabrain05m.github.io/)
  - VLA trained via world model-based RL (RAMP) on 10,000+ hours of robot data, achieving ~30% improvement on challenging tasks like laundry folding and espresso preparation. | Task: Robotic Manipulation

* 26.02 [Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning](https://arxiv.org/abs/2602.07845) | [Paper📑](https://arxiv.org/abs/2602.07845) [Code🖥️](https://github.com/rd-vla/rd-vla) [Project🌐](https://rd-vla.github.io/)
  - Recurrent VLA using latent iterative refinement instead of chain-of-thought tokens to adaptively scale compute at inference, achieving 0%→90%+ task success with 4 iterations. | Task: Robotic Manipulation

* 26.02 [VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model](https://arxiv.org/abs/2602.10098) | [Paper📑](https://arxiv.org/abs/2602.10098) [Code🖥️](https://github.com/ginwind/VLA-JEPA) [Model🤗](https://huggingface.co/ginwind/VLA-JEPA)
  - JEPA-style pretraining for VLA policies predicting future latent states from current observations, improving robustness to camera motion and irrelevant backgrounds. | Task: Robotic Manipulation

* 26.02 [DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos](https://arxiv.org/abs/2602.06949) | [Paper📑](https://arxiv.org/abs/2602.06949) [Model🤗](https://huggingface.co/nvidia/DreamDojo) [Project🌐](https://dreamdojo-world.github.io/)
  - Foundation world model trained on 44k hours of egocentric human video enabling teleoperation, policy evaluation, and model-based planning for dexterous robotics at 10.81 FPS. | Task: Robotic Manipulation

* 26.02 [ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation](https://arxiv.org/abs/2602.11598) | [Paper📑](https://arxiv.org/abs/2602.11598) [Code🖥️](https://github.com/amap-cvlab/ABot-Navigation) [Project🌐](https://amap-cvlab.github.io/ABot-Navigation/ABot-N0/)
  - Unified VLA navigation model with hierarchical Brain-Action architecture achieving SOTA on 7 benchmarks across 5 navigation task types, trained on 16.9M expert trajectories. | Task: Embodied Navigation

* 26.02 [TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments](https://arxiv.org/abs/2602.02459) | [Paper📑](https://arxiv.org/abs/2602.02459) [Code🖥️](https://github.com/ucla-mobility/TIC-VLA) [Project🌐](https://ucla-mobility.github.io/TIC-VLA/)
  - Latency-aware VLA framework modeling delayed semantic reasoning during action generation via delayed semantic-control interface for real-time navigation. | Task: Embodied Navigation

* 26.02 [QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models](https://arxiv.org/abs/2602.20309) | [Paper📑](https://arxiv.org/abs/2602.20309) [Code🖥️](https://github.com/AIoT-MLSys-Lab/QuantVLA)
  - Training-free PTQ framework for VLA models combining selective quantization, attention temperature matching, and output head balancing, achieving ~70% memory savings (CVPR 2026). | Task: Robot Control

* 26.02 [FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment](https://arxiv.org/abs/2602.17259) | [Paper📑](https://arxiv.org/abs/2602.17259) [Code🖥️](https://github.com/OpenHelix-Team/frappe) [Model🤗](https://huggingface.co/collections/hhhJB/frappe)
  - Improves world-awareness in robotic policies via parallel progressive latent alignment with visual foundation models, reducing error accumulation in multi-step prediction. | Task: Robotic Manipulation

* 26.02 [TactAlign: Human-to-Robot Policy Transfer via Tactile Alignment](https://arxiv.org/abs/2602.13579) | [Paper📑](https://arxiv.org/abs/2602.13579) [Project🌐](https://yswi.github.io/tactalign/)
  - Cross-embodiment tactile alignment using rectified flow for zero-shot transfer on contact-rich manipulation tasks including pivoting, insertion, and lid closing. | Task: Robotic Manipulation

* 26.02 [World Guidance: World Modeling in Condition Space for Action Generation](https://arxiv.org/abs/2602.22010) | [Paper📑](https://arxiv.org/abs/2602.22010) [Project🌐](https://selen-suyue.github.io/WoGNet/)
  - WoG maps predicted future observations into compact condition representations for fine-grained action generation, validated across simulation and real-world robot environments. | Task: Robot Control

* 26.02 [Green-VLA: Staged Vision-Language-Action Model for Generalist Robots](https://arxiv.org/abs/2602.00919) | [Paper📑](https://arxiv.org/abs/2602.00919) [Code🖥️](https://github.com/greenvla/GreenVLA)
  - Five-stage VLA framework for real-world robot deployment achieving generalization across embodiments via multimodal training and RL, reaching 69.5% success on ALOHA Table-Cleaning. | Task: Robotic Manipulation

* 26.02 [Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs](https://arxiv.org/abs/2602.21198) | [Paper📑](https://arxiv.org/abs/2602.21198) [Code🖥️](https://github.com/Reflective-Test-Time-Planning/Reflective-Test-Time-Planning)
  - Reflective Test-Time Planning with reflection-in-action and reflection-on-action enabling long-horizon credit assignment in robot decision-making. | Task: Embodied Reasoning

* 26.02 [RISE: Self-Improving Robot Policy with Compositional World Model](https://arxiv.org/abs/2602.11075) | [Paper📑](https://arxiv.org/abs/2602.11075)
  - Addresses VLA brittleness in contact-rich manipulation using a compositional world model to predict multi-view futures and evaluate imagined outcomes, achieving +35-45% gains across real-world tasks. | Task: Robotic Manipulation

* 26.02 [chi_0: Resource-Aware Robust Manipulation via Taming Distributional Inconsistencies](https://arxiv.org/abs/2602.09021) | [Paper📑](https://arxiv.org/abs/2602.09021) [Code🖥️](https://github.com/OpenDriveLab/KAI0) [Model🤗](https://huggingface.co/OpenDriveLab-org/Kai0)
  - Resource-efficient robotic manipulation using model arithmetic weight-space merging and stage-aware advantage estimation for dual-arm garment tasks, achieving 250% higher success than pi_0.5. | Task: Robotic Manipulation

* 26.02 [EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration](https://arxiv.org/abs/2602.10106) | [Paper📑](https://arxiv.org/abs/2602.10106)
  - Enables humanoid loco-manipulation via co-training VLA policies using abundant egocentric human demonstrations with limited robot data, achieving 51% improvement over robot-only baselines. | Task: Robotic Manipulation

* 26.02 [MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation](https://arxiv.org/abs/2602.11337) | [Paper📑](https://arxiv.org/abs/2602.11337) [Code🖥️](https://github.com/allenai/molmospaces)
  - Open ecosystem with 230k+ diverse indoor environments and 130k annotated assets supporting MuJoCo/Isaac/ManiSkill, with 8 benchmark tasks and strong sim-to-real correlation (R=0.96). | Task: Robot Simulation

* 26.02 [ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning](https://arxiv.org/abs/2602.11236) | [Paper📑](https://arxiv.org/abs/2602.11236) [Code🖥️](https://github.com/amap-cvlab/ABot-Manipulation) [Model🤗](https://huggingface.co/acvlab/ABot-M0-LIBERO)
  - Unified robotic manipulation framework standardizing 6M+ trajectories and introducing Action Manifold Learning (AML) for improved action prediction on low-dimensional manifolds. | Task: Robotic Manipulation

* 26.02 [RLinf-Co: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models](https://arxiv.org/abs/2602.12628) | [Paper📑](https://arxiv.org/abs/2602.12628)
  - RL-based sim-real co-training for VLA policies using two-stage warm-start SFT + RL fine-tuning with auxiliary supervised loss, achieving +24% on OpenVLA and +20% on pi_0.5 in real-world success. | Task: Robotic Manipulation

* 26.02 [Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution](https://arxiv.org/abs/2602.12684) | [Paper📑](https://arxiv.org/abs/2602.12684) [Code🖥️](https://github.com/XiaomiRobotics/Xiaomi-Robotics-0)
  - Open-source VLA combining large-scale cross-embodiment pretraining with asynchronous execution for real-time deployment, achieving SOTA on simulation benchmarks with consumer-grade GPU compatibility. | Task: Robotic Manipulation

* 26.02 [GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning](https://arxiv.org/abs/2602.04315) | [Paper📑](https://arxiv.org/abs/2602.04315) [Code🖥️](https://github.com/AIGeeksGroup/GeneralVLA)
  - Hierarchical VLA for zero-shot robotic manipulation via affordance segmentation, 3D trajectory planning, and 3D-aware control policy, outperforming VoxPoser without real-world demonstrations. | Task: Robotic Manipulation

* 26.02 [RynnBrain: Open Embodied Foundation Models](https://arxiv.org/abs/2602.14979) | [Paper📑](https://arxiv.org/abs/2602.14979) [Code🖥️](https://github.com/alibaba-damo-academy/RynnBrain) [Model🤗](https://huggingface.co/Alibaba-DAMO-Academy/RynnBrain-8B) [Dataset🤗](https://huggingface.co/datasets/Alibaba-DAMO-Academy/RynnBrain-Bench)
  - Open-source spatiotemporal foundation model unifying perception, reasoning, and planning for embodied intelligence, outperforming existing models on 20 embodied benchmarks. | Task: Embodied Reasoning

* 26.02 [Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation](https://arxiv.org/abs/2602.16705) | [Paper📑](https://arxiv.org/abs/2602.16705)
  - HERO enables humanoid robots to manipulate diverse real-world objects using inverse kinematics and neural forward model, reducing end-effector tracking error 3.2x for reliable surface manipulation. | Task: Robotic Manipulation

* 26.02 [World Action Models are Zero-shot Policies](https://arxiv.org/abs/2602.15922) | [Paper📑](https://arxiv.org/abs/2602.15922) [Code🖥️](https://github.com/dreamzero0/dreamzero)
  - DreamZero uses video diffusion as a World Action Model (WAM) for robot skill generalization with 2x improvement in novel environments at real-time 7Hz closed-loop control. | Task: Robotic Manipulation

* 26.02 [Learning Native Continuation for Action Chunking Flow Policies](https://arxiv.org/abs/2602.12978) | [Paper📑](https://arxiv.org/abs/2602.12978)
  - Legato, a training-time continuation method for action-chunked VLA policies producing smoother trajectories, achieving ~10% improvements in smoothness and task completion across 5 manipulation tasks. | Task: Robotic Manipulation

* 26.02 [BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models](https://arxiv.org/abs/2602.08392) | [Paper📑](https://arxiv.org/abs/2602.08392) [Code🖥️](https://github.com/bimanibench/BiManiBench)
  - Evaluates 30+ MLLMs on bimanual robotic tasks across spatial reasoning, action planning, and end-effector control tiers, finding persistent failures in dual-arm spatial grounding. | Task: Robotic Manipulation

* 26.03 [RoboPocket: Improve Robot Policies Instantly with Your Phone](https://arxiv.org/abs/2603.05504) | [Paper📑](https://arxiv.org/abs/2603.05504) [Project🌐](https://robo-pocket.github.io/)
  - Portable system enabling robot-free instant policy iteration via smartphone AR visualization of predicted trajectories, doubling data efficiency with asynchronous online finetuning. | Task: Robotic Manipulation

* 26.03 [UltraDexGrasp: Learning Universal Dexterous Grasping for Bimanual Robots with Synthetic Data](https://arxiv.org/abs/2603.05312) | [Paper📑](https://arxiv.org/abs/2603.05312) [Project🌐](https://yangsizhe.github.io/ultradexgrasp/)
  - Framework for universal dexterous grasping with UltraDexGrasp-20M dataset (20M frames across 1,000 objects), achieving robust zero-shot sim-to-real transfer for multi-strategy bimanual grasping. | Task: Robotic Manipulation

* 26.03 [EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding](https://arxiv.org/abs/2603.04254) | [Paper📑](https://arxiv.org/abs/2603.04254) [Project🌐](https://0nandon.github.io/EmbodiedSplat/)
  - Online feed-forward 3D Gaussian Splatting framework enabling real-time semantic scene understanding from streaming images at 5-6 FPS without per-scene optimization. | Task: Embodied Scene Understanding

* 26.03 [Lightweight Visual Reasoning for Socially-Aware Robots](https://arxiv.org/abs/2603.03942) | [Paper📑](https://arxiv.org/abs/2603.03942)
  - Lightweight language-to-vision feedback module improving VLM reasoning for robotic tasks (navigation, scene description, human-intention recognition) with less than 3% extra parameters. | Task: Embodied Reasoning

* 26.01 [ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models](https://arxiv.org/abs/2601.11404) | [Paper📑](https://arxiv.org/abs/2601.11404)
  - Action Chain-of-Thought paradigm for VLA models with Explicit and Implicit Action Reasoner components, achieving 98.5% on LIBERO. | Task: Robotic Manipulation

* 26.01 [Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning](https://arxiv.org/abs/2601.16163) | [Paper📑](https://arxiv.org/abs/2601.16163) [Model🤗](https://huggingface.co/nvidia/Cosmos-Policy-LIBERO-Predict2-2B)
  - Adapts pretrained video models into robot policies through single-stage post-training, achieving 98.5% on LIBERO and SOTA on real-world bimanual manipulation. | Task: Robot Control

* 26.01 [DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation](https://arxiv.org/abs/2601.22153) | [Paper📑](https://arxiv.org/abs/2601.22153) [Code🖥️](https://github.com/hzxie/DynamicVLA) [Dataset🤗](https://huggingface.co/datasets/hzxie/DOM)
  - Compact 0.4B VLA model for dynamic object manipulation with continuous inference and latent-aware action streaming. | Task: Robotic Manipulation

* 26.01 [SOP: A Scalable Online Post-Training System for Vision-Language-Action Models](https://arxiv.org/abs/2601.03044) | [Paper📑](https://arxiv.org/abs/2601.03044) [Project🌐](https://agibot.com/research/sop_en)
  - Scalable online distributed post-training system for VLA models enabling real-world robot policy adaptation through fleet learning. | Task: Robot Control

* 26.01 [FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation](https://arxiv.org/abs/2601.13976) | [Paper📑](https://arxiv.org/abs/2601.13976) [Code🖥️](https://github.com/Fantasy-AMAP/fantasy-vln) [Model🤗](https://huggingface.co/acvlab/FantasyVLN)
  - Implicit reasoning framework for vision-language navigation encoding imagined visual tokens in latent space, reducing inference latency by an order of magnitude. | Task: Vision-Language Navigation

* 26.01 [RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation](https://arxiv.org/abs/2601.05241) | [Paper📑](https://arxiv.org/abs/2601.05241) [Code🖥️](https://github.com/RoboVIP/RoboVIP_VDM)
  - Visual identity prompting for multi-view video generation to augment robot manipulation data. | Task: Robotic Manipulation

* 26.01 [VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory](https://arxiv.org/abs/2601.08665) | [Paper📑](https://arxiv.org/abs/2601.08665)
  - Embodied navigation agent with adaptive reasoning combining visual perception and linguistic memory. | Task: Embodied Navigation

* 25.12 [DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action](https://arxiv.org/abs/2511.22134) | [Paper📑](https://arxiv.org/abs/2511.22134)
  - Decouples reasoning and action for more generalizable embodied agents. | Task: Robotic Manipulation
* 25.12 [HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for VLA Models](https://arxiv.org/abs/2512.09928) | [Paper📑](https://arxiv.org/abs/2512.09928)
  - Enriches VLA models with hindsight, insight, and foresight via motion representations. | Task: Robotic Manipulation
* 25.12 [LEO-RobotAgent: A General-purpose Robotic Agent for Language-driven Embodied Operator](https://arxiv.org/abs/2512.10605) | [Paper📑](https://arxiv.org/abs/2512.10605)
  - General-purpose language-driven robotic agent for embodied task execution. | Task: Robotic Manipulation
* 25.12 [Steering VLA Models as Anti-Exploration: A Test-Time Scaling Approach](https://arxiv.org/abs/2512.02834) | [Paper📑](https://arxiv.org/abs/2512.02834)
  - Test-time scaling approach for steering VLA models for safe embodied behavior. | Task: Robot Control
* 25.11 [WMPO: World Model-based Policy Optimization for Vision-Language-Action Models](https://arxiv.org/abs/2511.09515) | [Paper📑](https://arxiv.org/abs/2511.09515)
  - World model-based policy optimization for VLA models in robotics. | Task: Robot Control
* 25.11 [RynnVLA-002: A Unified Vision-Language-Action and World Model](https://arxiv.org/abs/2511.17502) | [Paper📑](https://arxiv.org/abs/2511.17502)
  - Unified VLA and world model for robotic manipulation. | Task: Robot Control
* 25.11 [Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight](https://arxiv.org/abs/2511.16175) | [Paper📑](https://arxiv.org/abs/2511.16175)
  - VLA model with disentangled visual foresight for robotic control. | Task: Robot Control
* 25.11 [MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots](https://arxiv.org/abs/2511.17889) | [Paper📑](https://arxiv.org/abs/2511.17889)
  - Reinforcement-based VLA model for mobile robot tasks. | Task: Robot Control
* 25.10 [VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards](https://arxiv.org/abs/2510.00406) | [Paper📑](https://arxiv.org/abs/2510.00406)
  - Fine-tuning VLA models using RL with verified rewards in world simulators. | Task: Robot Control
* 25.10 [InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy](https://arxiv.org/abs/2510.13778) | [Paper📑](https://arxiv.org/abs/2510.13778)
  - VLA framework for robotic control with spatial grounding. | Task: Robot Control
* 25.10 [X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model](https://arxiv.org/abs/2510.10274) | [Paper📑](https://arxiv.org/abs/2510.10274)
  - Cross-embodiment VLA model for scalable robot learning. | Task: Robot Control
* 25.10 [GigaBrain-0: A World Model-Powered Vision-Language-Action Model](https://arxiv.org/abs/2510.19430) | [Paper📑](https://arxiv.org/abs/2510.19430)
  - VLA model integrating world models for robot reasoning. | Task: Robot Control
* 25.09 [Robix: A Unified Model for Robot Interaction, Reasoning and Planning](https://arxiv.org/abs/2509.01106) | [Paper📑](https://arxiv.org/abs/2509.01106)
  - Unified robotics model combining visual reasoning with interaction and planning. | Task: Robot Control
* 25.09 [FLOWER: Democratizing Generalist Robot Policies with Efficient VLA Flow Policies](https://arxiv.org/abs/2509.04996) | [Paper📑](https://arxiv.org/abs/2509.04996)
  - Vision-language-action model for generalist robot policies. | Task: Robot Control
* 25.08 [RynnEC: Bringing MLLMs into Embodied World](https://arxiv.org/abs/2508.14160) | [Paper📑](https://arxiv.org/abs/2508.14160)
  - Integrates multimodal LLMs into embodied AI settings for physical-world reasoning. | Task: Embodied Reasoning
* 25.08 [Do What? Teaching Vision-Language-Action Models to Reject the Impossible](https://arxiv.org/abs/2508.16292) | [Paper📑](https://arxiv.org/abs/2508.16292)
  - Trains VLA models to reason about task feasibility and reject impossible instructions. | Task: Robot Control
* 25.08 [Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in VLA Policies](https://arxiv.org/abs/2508.20072) | [Paper📑](https://arxiv.org/abs/2508.20072)
  - Uses discrete diffusion for action decoding in vision-language-action robotic policies. | Task: Robot Control

* 23.07 [RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control](https://deepmind.google/discover/blog/rt-2-new-model-translates-vision-and-language-into-action/) | [Paper📑](https://arxiv.org/pdf/2307.15818) [Project🌐](https://robotics-transformer2.github.io)
  - Co-finetunes a VLM on web and robot data, establishing the VLA paradigm by transferring internet-scale knowledge to robot control. | Task: General Robotic Manipulation

* 24.05 [Octo: An Open-Source Generalist Robot Policy](https://arxiv.org/abs/2405.12213) | [Paper📑](https://arxiv.org/pdf/2405.12213) [Code🖥️](https://github.com/octo-models/octo) [Project🌐](https://octo-models.github.io/) [Model🤗](https://huggingface.co/rail-berkeley/octo-base-1.5)
  - An open-source, generalist transformer policy pretrained on the large-scale Open X-Embodiment dataset, designed for efficient fine-tuning to new robots and tasks. | Task: Robotics

* 24.06 [OpenVLA: An Open-Source Vision-Language-Action Model](https://arxiv.org/abs/2406.09246) | [Paper📑](https://arxiv.org/pdf/2406.09246) [Code🖥️](https://github.com/openvla/openvla) [Project🌐](https://openvla.github.io/) [Model🤗](https://huggingface.co/openvla)
    - A 7B-parameter open-source VLA model trained on the Open X-Embodiment dataset, achieving state-of-the-art performance for generalist manipulation. | Task: VLA

* 24.10 [π₀: A Vision-Language-Action Flow Model for General Robot Control](https://www.physicalintelligence.company/blog/pi0) | [Paper📑](https://arxiv.org/abs/2410.24164) [Code🖥️](https://github.com/Physical-Intelligence/openpi)
  - A generalist policy using a novel flow matching architecture atop a pretrained VLM, enabling zero-shot generalization for dexterous manipulation. | Task: Robot Control

* 25.01 [FAST: Efficient Action Tokenization for Vision-Language-Action Models](https://www.physicalintelligence.company/research/fast) | [Paper📑](https://arxiv.org/pdf/2501.09747) [Code🖥️](https://github.com/Physical-Intelligence/openpi)
  - A compression-based action tokenization scheme that accelerates autoregressive VLA training by 5x with performance comparable to diffusion models. | Task: Robot Control

* 25.02 [Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models](https://www.pi.website/research/hirobot) | [Paper📑](https://arxiv.org/pdf/2502.19417)
  - A hierarchical VLA model with a high-level VLM for reasoning and a low-level VLA for execution, enabling complex, open-ended instruction following. | Task: Robot Control

* 25.03 [Gemini Robotics: Bringing AI into the Physical World](https://arxiv.org/abs/2503.20020) | [Paper📑](https://arxiv.org/pdf/2503.20020) [Code🖥️](https://github.com/embodiedreasoning/ERQA) [Project🌐](https://deepmind.google/discover/blog/gemini-robotics-brings-ai-into-the-physical-world/) [Dataset🤗](https://github.com/embodiedreasoning/ERQA)
  - A VLA model built on the Gemini foundation model, demonstrating significant improvements in generality, interactivity, and dexterity for complex tasks. | Task: Advanced & Dexterous Manipulation

* 25.03 [COT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models](https://arxiv.org/abs/2503.22020) | [Paper📑](https://arxiv.org/pdf/2503.22020) [Project🌐](https://cot-vla.github.io/)
  - A method that incorporates explicit visual CoT reasoning into VLAs by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. | Task: Robotics

* 25.03 [GR00T: A Foundation Model for General-Purpose Robotics](https://arxiv.org/abs/2503.14734) | [Paper📑](https://arxiv.org/pdf/2503.14734) [Code🖥️](https://github.com/NVIDIA/Isaac-GR00T) [Model🤗](https://huggingface.co/nvidia/GR00T-N1.5-3B) [Dataset🤗](https://huggingface.co/datasets/nvidia/PhysicalAI-Robotics-GR00T-X-Embodiment-Sim)
  - A general-purpose foundation model for robot learning that takes multimodal instructions and past observations to generate actions for the robot to execute. | Task: Robotics

* 25.04 [π0.5: a Vision-Language-Action Model with Open-World Generalization](https://www.pi.website/blog/pi05) | [Paper📑](https://www.physicalintelligence.company/download/pi05.pdf)
  - An evolution of π₀ that uses co-training on diverse tasks to achieve long-horizon, dexterous manipulation in novel, unseen environments. | Task: Robot Control


* 25.06 [Chain-of-Action: Faithful and Deterministic Robot Policy via Language-guided State-Action Augmentation](https://chain-of-action.github.io/) | [Paper📑](https://arxiv.org/pdf/2506.09990) [Code🖥️](https://github.com/ByteDance-Seed/Chain-of-Action) [Project🌐](https://chain-of-action.github.io/) [Model🤗](https://huggingface.co/Solomonz/Chain-of-Action)
  - A novel robot policy, Chain-of-Action (CoA), that uses language as an intermediate representation to explicitly reason about the chain of actions for a given task, while being fully deterministic during inference. | Task: Robotics

* 25.07 [Vision-Language-Action Instruction Tuning: From Understanding to Manipulation](https://yangs03.github.io/InstructVLA_Home/) | [Paper📑](https://arxiv.org/pdf/2507.17520) [Code🖥️](https://github.com/InternRobotics/InstructVLA) [Project🌐](https://yangs03.github.io/InstructVLA_Home/) [Model🤗](https://huggingface.co/datasets/ShuaiYang03/VLA_Instruction_Tuning)
  - An end-to-end VLA model, InstructVLA, that introduces a novel training paradigm called Vision-Language-Action Instruction Tuning (VLA-IT) to preserve the flexible reasoning of VLMs while delivering high-performance robotic manipulation. | Task: Robotic

* 25.07 [MinD: Learning A Dual-System World Model for Real-Time Planning and Implicit Risk Analysis](https://manipulate-in-dream.github.io/) | [Paper📑](https://www.arxiv.org/pdf/2506.18897) [Code🖥️](https://github.com/manipulate-in-dream/MinD) [Project🌐](https://manipulate-in-dream.github.io/)
  - A dual-system world model, MinD, that enables real-time, risk-aware planning by conditioning a high-frequency action policy on single-step latent predictions from a low-frequency video generation model. | Task: Robotic


### Others

#### Image MLLM
* 26.03 [UniG2U-Bench: Towards Comprehensive Benchmarking of Unified Generation and Understanding MLLMs](https://arxiv.org/abs/2603.03241) | [Paper📑](https://arxiv.org/abs/2603.03241)
  - Comprehensive benchmark evaluating MLLMs on both generation and understanding capabilities in a unified framework. | Task: Multimodal Evaluation
* 26.03 [Towards Multimodal Lifelong Understanding](https://arxiv.org/abs/2603.05484) | [Paper📑](https://arxiv.org/abs/2603.05484)
  - MM-Lifelong dataset and ReMA agent for multimodal lifelong understanding, enabling continual learning across modalities. | Task: Lifelong Learning
* 26.02 [VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?](https://arxiv.org/abs/2602.04802) | [Paper📑](https://arxiv.org/abs/2602.04802)
  - Benchmark testing whether VLMs truly understand text rendered visually in images as well as plain text, revealing a significant comprehension gap. | Task: Reasoning
* 26.02 [From Perception to Action: An Interactive Benchmark for Vision Reasoning](https://arxiv.org/abs/2602.21015) | [Paper📑](https://arxiv.org/abs/2602.21015) [Code🖥️](https://github.com/Social-AI-Studio/CHAIN)
  - CHAIN 3D physics-driven interactive benchmark evaluating whether VLMs understand causal constraints and execute structured action sequences in mechanical puzzles. | Task: Reasoning
* 26.02 [SAM 3D Body: Robust Full-Body Human Mesh Recovery](https://arxiv.org/abs/2602.15989) | [Paper📑](https://arxiv.org/abs/2602.15989) [Code🖥️](https://github.com/facebookresearch/sam-3d-body)
  - Promptable model for single-image 3D human mesh recovery using the Momentum Human Rig (MHR) parametric representation, supporting 2D keypoint/mask prompts with strong generalization. | Task: 3D Human Pose Estimation
* 26.01 [CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation](https://arxiv.org/abs/2601.10061) | [Paper📑](https://arxiv.org/abs/2601.10061)
  - Uses video generation models as visual reasoners for text-to-image generation, showing temporal modeling transfers to improved spatial reasoning. | Task: Image Generation
* 26.01 [OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models](https://arxiv.org/abs/2601.21639) | [Paper📑](https://arxiv.org/abs/2601.21639)
  - Holistic OCR framework within end-to-end vision-language models for comprehensive text understanding in images. | Task: OCR & Document Understanding
* 25.12 [GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation](https://arxiv.org/abs/2512.17495) | [Paper📑](https://arxiv.org/abs/2512.17495)
  - Exposes and evaluates visual grounding gaps in MLLMs across multiple dimensions. | Task: Visual Grounding
* 25.11 [Monet: Reasoning in Latent Visual Space Beyond Images and Language](https://arxiv.org/abs/2511.21395) | [Paper📑](https://arxiv.org/abs/2511.21395)
  - Enables vision-language reasoning in latent visual space, going beyond standard image-text paradigms. | Task: Reasoning
* 25.10 [SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In Text-only LLMs](https://arxiv.org/abs/2510.25092) | [Paper📑](https://arxiv.org/abs/2510.25092)
  - Enables multimodal reasoning in text-only LLMs through agentic information flow. | Task: Reasoning
* 25.04 [InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners](https://arxiv.org/pdf/2504.14239) | [Paper📑](https://arxiv.org/pdf/2504.14239) [Code🖥️](https://github.com/Reallm-Labs/InfiGUI-R1)
  - an MLLM-based GUI agent designed to progressively evolve agents from Reactive Actors to Deliberative Reasoners. | task: UI
* 25.04 [GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents](https://arxiv.org/pdf/2504.10458) | [Paper📑](https://arxiv.org/pdf/2504.10458)
  - Enhances GUI agent through RL with unified action space modeling, achieving superior cross-platform performance using only 0.02% of the data required by previous methods. | Task: UI 
* 25.03 [UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning](https://arxiv.org/pdf/2503.21620) | [Paper📑](https://arxiv.org/pdf/2503.21620)
  - Introduce a unified rule-based action reward, enabling model optimization via policy-based algorithms like GRPO. | Task: UI 
* 25.03   [VLM-R1: A stable and generalizable R1-style Large Vision-Language Model](https://github.com/om-ai-lab/VLM-R1/tree/main?tab=readme-ov-file) [Code🖥️](https://github.com/om-ai-lab/VLM-R1/tree/main?tab=readme-ov-file) [Dataset🤗](https://huggingface.co/datasets/omlab/VLM-R1)  [Model🤗](https://huggingface.co/omlab/Qwen2.5VL-3B-VLM-R1-REC-500steps/tree/main)
  - A reproduced R1-style VLM | Task: Referring Expression Comprehension
* 25.02 [MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning](https://arxiv.org/pdf/2502.19634)| [Paper📑](https://arxiv.org/pdf/2502.19634)
  - A MLLM trained with GRPO for medical image VQA.| Task: Medical Image VQA
#### Video MLLM
* 25.03 [R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning](https://arxiv.org/abs/2503.05379) | [Paper📑](https://arxiv.org/abs/2503.05379) [Code🖥️](https://github.com/HumanMLLM/R1-Omni) [Model🤗](https://huggingface.co/StarJiaxing/R1-Omni-0.5B/tree/main)
  - Impove reasoning capability, emotion recognition accuracy, and generalization ability with RLVR.  | Task: Emotion recognition

#### Audio MLLM
* 26.01 [The Sonar Moment: Benchmarking Audio-Language Models in Audio Geo-Localization](https://arxiv.org/abs/2601.03227) | [Paper📑](https://arxiv.org/abs/2601.03227)
  - Benchmark for audio-language models on spatial audio geo-localization reasoning tasks. | Task: Audio Reasoning
* 25.02 [ADIFF: Explaining audio difference using natural language](https://arxiv.org/abs/2502.04476)  [Code🖥️](https://github.com/soham97/ADIFF)  [Model](https://zenodo.org/records/14706090)
* 24.09 [What Are They Doing? Joint Audio-Speech Co-Reasoning](https://arxiv.org/abs/2409.14526)
* 24.09 [Chain-of-Thought Prompting for Speech Translation](https://arxiv.org/abs/2409.11538)

#### Omni LLM
* 26.01 [FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs](https://arxiv.org/abs/2601.13836) | [Paper📑](https://arxiv.org/abs/2601.13836)
  - Benchmark evaluating multimodal LLMs' ability to forecast future events from omni-modal context including temporal reasoning. | Task: Omni Reasoning
* 25.05 [AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding](https://arxiv.org/abs/2505.20862)
* 25.03 [R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning](https://arxiv.org/abs/2503.05379)
* 23.11 [X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and its Emergent Cross-Modal Reasoning](https://arxiv.org/abs/2311.18799)


<a name="benchmarks"></a>
## Benchmarks 📊

| Date  | Project                                                      | Task                                          | Links                                                        |
| ----- | ------------------------------------------------------------ | --------------------------------------------- | ------------------------------------------------------------ |
| 26.03 | MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning | Multi-image Reasoning | [[📑 Paper]](https://arxiv.org/abs/2603.02024) [[🌐 Project]](https://mmr-life.github.io/) |
| 26.03 | RIVER: A Benchmark for Real-World Video Reasoning in Long and Short Contexts | Video Temporal Reasoning | [[📑 Paper]](https://arxiv.org/abs/2603.03876) |
| 26.03 | UniG2U-Bench: Comprehensive Benchmark for Unified Generation and Understanding MLLMs | Multimodal Evaluation | [[📑 Paper]](https://arxiv.org/abs/2603.03241) |
| 26.03 | AgentVista: Generalizable Multi-Task Agent with Diverse Visual Manipulation | Multi-Task Agent | [[📑 Paper]](https://arxiv.org/abs/2603.04016) |
| 26.02 | A Very Big Video Reasoning Suite (VBVR): 1M+ video clips across 200 reasoning tasks | Video Reasoning | [[📑 Paper]](https://arxiv.org/abs/2602.20159) [[🤗 Model]](https://huggingface.co/Video-Reason/VBVR-Wan2.2) [[🤗 Data]](https://huggingface.co/datasets/Video-Reason/VBVR-Dataset) |
| 26.02 | OmniGAIA: Omni-Modal AI Agent Benchmark with hindsight-guided exploration | Omni-Modal Agent Reasoning | [[📑 Paper]](https://arxiv.org/abs/2602.22897) [[💻 Code]](https://github.com/RUC-NLPIR/OmniGAIA) [[🤗 Data]](https://huggingface.co/datasets/RUC-NLPIR/OmniGAIA) |
| 26.02 | SpatiaLab: Wild Spatial Reasoning benchmark across 6 VQA categories | Spatial Reasoning | [[📑 Paper]](https://arxiv.org/abs/2602.03916) [[💻 Code]](https://github.com/SpatiaLab-Reasoning/SpatiaLab) [[🤗 Data]](https://huggingface.co/datasets/ciol-research/SpatiaLab) |
| 26.02 | MuRGAt: Multimodal Fact-Level Attribution benchmark for verifiable reasoning | Multimodal Attribution | [[📑 Paper]](https://arxiv.org/abs/2602.11509) [[💻 Code]](https://github.com/meetdavidwan/murgat) |
| 26.02 | DeepVision-103K: Verifiable multimodal math dataset for RLVR training | Math Reasoning | [[📑 Paper]](https://arxiv.org/abs/2602.16742) [[💻 Code]](https://github.com/SKYLENAGE-AI/DeepVision-103K) [[🤗 Data]](https://huggingface.co/datasets/skylenage/DeepVision-103K) |
| 26.02 | UniVBench: Unified evaluation for video foundation models across understanding, generation, editing | Video Foundation Model Evaluation | [[📑 Paper]](https://arxiv.org/abs/2602.21835) [[💻 Code]](https://github.com/JianhuiWei7/UniVBench) |
| 26.02 | RISE-Video: Benchmark for video generators decoding implicit world rules | Video Generation Reasoning | [[📑 Paper]](https://arxiv.org/abs/2602.05986) [[💻 Code]](https://github.com/VisionXLab/Rise-Video) [[🤗 Data]](https://huggingface.co/datasets/VisionXLab/RISE-Video) |
| 26.02 | SAW-Bench: Egocentric Situated Awareness evaluation with 786 smart-glass videos and 2,071+ QA pairs | Spatial Reasoning | [[📑 Paper]](https://arxiv.org/abs/2602.16682) |
| 26.02 | BrowseComp-V3: 300-question visual benchmark for complex multi-hop multimodal web search | Multimodal Browsing | [[📑 Paper]](https://arxiv.org/abs/2602.12876) |
| 26.02 | BiManiBench: Hierarchical benchmark for bimanual coordination evaluation in MLLMs | Bimanual Robotics | [[📑 Paper]](https://arxiv.org/abs/2602.08392) [[💻 Code]](https://github.com/bimanibench/BiManiBench) |
| 26.01 | MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods | Multimodal Reasoning | [[📑 Paper]](https://arxiv.org/abs/2601.21821) [[🤗 Model]](https://huggingface.co/OpenDataArena/MMFineReason-8B) [[🤗 Data]](https://huggingface.co/datasets/OpenDataArena/MMFineReason-1.8M-Qwen3-VL-235B-Thinking) |
| 26.01 | ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis | Chart Reasoning | [[📑 Paper]](https://arxiv.org/abs/2601.13606) [[💻 Code]](https://github.com/starriver030515/ChartVerse) [[🤗 Model]](https://huggingface.co/opendatalab/ChartVerse-8B) [[🤗 Data]](https://huggingface.co/datasets/opendatalab/ChartVerse-SFT-1.8M) |
| 26.01 | VideoLoom: Joint Spatial-Temporal Understanding with LoomBench | Spatial-Temporal Reasoning | [[📑 Paper]](https://arxiv.org/abs/2601.07290) [[💻 Code]](https://github.com/JPShi/VideoLoom) [[🤗 Model]](https://huggingface.co/JPShi/VideoLoom-8B) |
| 26.01 | PROGRESSLM: Towards Progress Reasoning in Vision-Language Models | Task Progress Reasoning | [[📑 Paper]](https://arxiv.org/abs/2601.15224) [[💻 Code]](https://github.com/ProgressLM/ProgressLM) [[🤗 Data]](https://huggingface.co/datasets/Raymond-Qiancx/ProgressLM-Dataset) |
| 26.01 | FutureOmni: Evaluating Future Forecasting from Omni-Modal Context | Omni-Modal Temporal Reasoning | [[📑 Paper]](https://arxiv.org/abs/2601.13836) |
| 26.01 | Afri-MCQA: Multimodal Cultural Question Answering for African Languages | Multilingual Multimodal Reasoning | [[📑 Paper]](https://arxiv.org/abs/2601.05699) |
| 26.01 | AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark | Cultural Multimodal Reasoning | [[📑 Paper]](https://arxiv.org/abs/2601.17645) |
| 25.12 | HERBench: Multi-Evidence Integration in Video Question Answering | Video Reasoning | [[📑 Paper]](https://arxiv.org/abs/2512.14870) |
| 25.12 | SVBench: Evaluation of Video Generation Models on Social Reasoning | Video Social Reasoning | [[📑 Paper]](https://arxiv.org/abs/2512.21507) |
| 25.12 | IF-Bench: Benchmarking MLLMs for Infrared Images | Infrared Image Understanding | [[📑 Paper]](https://arxiv.org/abs/2512.09663) |
| 25.12 | VABench: Comprehensive Benchmark for Audio-Video Generation | Audio-Video Generation | [[📑 Paper]](https://arxiv.org/abs/2512.09299) |
| 25.11 | MME-CC: Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity | Cognitive Capacity | [[📑 Paper]](https://arxiv.org/abs/2511.03146) |
| 25.11 | GGBench: Geometric Generative Reasoning Benchmark for Unified Multimodal Models | Geometric Reasoning | [[📑 Paper]](https://arxiv.org/abs/2511.11134) |
| 25.11 | WEAVE: Benchmarking In-context Interleaved Comprehension and Generation | Multimodal Comprehension & Generation | [[📑 Paper]](https://arxiv.org/abs/2511.11434) |
| 25.10 | Uni-MMMU: Massive Multi-discipline Multimodal Unified Benchmark | Multimodal Multi-discipline Reasoning | [[📑 Paper]](https://arxiv.org/abs/2510.13759) |
| 25.10 | PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs | Physical Tool Understanding | [[📑 Paper]](https://arxiv.org/abs/2510.09507) |
| 25.10 | BEAR: Benchmarking Multimodal Language Models for Atomic Embodied Capabilities | Embodied AI Capabilities | [[📑 Paper]](https://arxiv.org/abs/2510.08759) |
| 25.10 | OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs | Long-context, Video-Audio Unerstanding & Reasonin | [[📑 Paper]](https://arxiv.org/abs/2510.10689v1)  [[💻 Code]](https://github.com/NJU-LINK/OmniVideoBench) [[🌐 Project]](https://omnivideobench.github.io/omnivideobench_home/) [[🤗 Data]](https://huggingface.co/datasets/NJU-LINK/OmniVideoBench)|
| 25.10 | XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models | Capability Balancing among Different Modalities | [[📑 Paper]](https://arxiv.org/abs/2510.15148)  [[💻 Code]](https://github.com/XingruiWang/XModBench) [[🌐 Project]](https://xingruiwang.github.io/projects/XModBench/) |
| 25.10 | StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA | Termporal Reasoning | [[📑 Paper]](https://arxiv.org/abs/2510.25332) |
| 25.10 | Valor32k-AVQA v2.0: Open-Ended Audio-Visual Question Answering Dataset and Benchmark | Common Sense Omni Reasoning | [[📑 Paper]](https://dl.acm.org/doi/10.1145/3746027.3758261) |
| 25.09 | MARS2 2025 Challenge on Multimodal Reasoning | Multimodal Reasoning Challenge | [[📑 Paper]](https://arxiv.org/abs/2509.14142) |
| 25.09 | Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images | Table Reasoning | [[📑 Paper]](https://arxiv.org/abs/2509.07966) |
| 25.09 | AHELM: A Holistic Evaluation of Audio-Language Models | Audio-Language Understanding | [[📑 Paper]](https://arxiv.org/abs/2508.21376) |
| 25.09 | MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark | Complex, Multi-scene, & Dynamically Evolving Speech & Audio Reasonin | [[📑 Paper]](https://arxiv.org/abs/2509.22461)  [[💻 Code]](https://github.com/luckyerr/MDAR) |
| 25.09 | MiMo-Audio-Eval Toolkit | Speech/Sound/Music Reasoning | [[💻 Code]](https://github.com/XiaomiMiMo/MiMo-Audio-Eval?tab=readme-ov-file)
| 25.08 | SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models | Speech  Reasoning | [[📑 Paper]](https://www.arxiv.org/abs/2508.02018)  [[💻 Code]](https://github.com/Yanda95/SpeechR) [[Data]](https://drive.google.com/file/d/1BH2r2idILwUHX0NKsXz6GsSXdO0qWly8/view) |
| 25.08 | MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence | Long-form, Spatial, and Multi-audio Reasoning on Speech/Music/Sound | [[📑 Paper]](https://arxiv.org/abs/2508.13992)  [[🤗 Data]](https://huggingface.co/datasets/gamma-lab-umd/MMAU-Pro)  |
| 25.08 | R²-AVSBench: Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation | Segmentation Reasoning | [[📑 Paper]](https://arxiv.org/abs/2508.04418) [[🤗 Data]](https://drive.google.com/drive/folders/1Qz7MxBs7IpxgcTH8CaUsU3i9d366gRhM)|
| 25.07 | Towards Video Thinking Test: A Holistic Benchmark for Advanced Video  Reasoning and Understanding | Video Reasoning and Understanding             | [[📑 Paper]](https://arxiv.org/abs/2507.15028).  [[🌐 Project]](https://zhangyuanhan-ai.github.io/video-tt/) [[🤗 Data]](https://huggingface.co/datasets/lmms-lab/video-tt) |
| 25.06 | FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation | Financial Multi-Modal Reasoning Reasoning     | [[📑 Paper]](https://github.com/luo-junyu/FinMME). [[💻 Code]](https://github.com/luo-junyu/FinMME). [[🤗 Data]](https://huggingface.co/datasets/luojunyu/FinMME) |
| 25.06 | MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos | Video Reasoning                               | [[📑 Paper]](https://arxiv.org/abs/2506.04141). [[💻 Code]](https://github.com/GaryStack/MMR-V). [[🌐 Project]](https://mmr-v.github.io/home_page.html) [[🤗 Data]](https://huggingface.co/datasets/JokerJan/MMR-VBench) |
| 25.06 | OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models | Spatial Reasoning                             | [[📑 Paper]](https://arxiv.org/abs/2506.03135). [[💻 Code]](https://github.com/qizekun/OmniSpatial). [[🌐 Project]](https://qizekun.github.io/omnispatial/) [[🤗 Data]](https://huggingface.co/qizekun/datasets/OmniSpatial) |
| 25.06 | MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark | Phonatics, Prosody, Rhetoric, Syntactics, Semantics, and Paralinguistics in Speech Understanding & Reasoning | [[📑 Paper]](https://arxiv.org/abs/2506.04779)  [[💻 Code]](https://github.com/dingdongwang/mmsu_bench)  [[🤗 Data]](https://huggingface.co/datasets/ddwang2000/MMSU)
| 25.05 | Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities | Video&Audio Reasoning | [[📑 Paper]](https://arxiv.org/abs/2505.17862) [[💻 Code]](https://github.com/Lliar-liar/Daily-Omni)  [[🌐 Project]](https://lliar-liar.github.io/Daily-Omni/) [[🤗 Data]](https://huggingface.co/datasets/liarliar/Daily-Omni) |
| 25.05 | MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix | Multi-step Audio Reasoning                    | [[📑 Paper]](https://arxiv.org/abs/2505.13032). [[💻 Code]](https://github.com/ddlBoJack/MMAR). [[🎥 demo]](https://youtube.com/watch?v=Dab13opIGqU) [[🤗 Data]](https://huggingface.co/datasets/BoJack/MMAR) |
| 25.05 | On Path to Multimodal Generalist: General-Level and General-Bench | Multimodal Generation                         | [[🌐 Project]](https://generalist.top/) [[📑 Paper]](https://arxiv.org/abs/2505.04620) [[🤗 Data]](https://huggingface.co/General-Level) |
| 25.04 | VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models | Visual Reasoning                              | [[🌐 Project]](https://visulogic-benchmark.github.io/VisuLogic/) [[📑 Paper]](http://arxiv.org/abs/2504.15279) [[💻 Code]](https://github.com/VisuLogic-Benchmark/VisuLogic-Eval) [[🤗 Data]](https://huggingface.co/datasets/VisuLogic/VisuLogic) |
| 25.04 | IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs | Image-Grounded Video Perception and Reasoning | [[📑 Paper]](https://arxiv.org/pdf/2504.15415) [[💻 Code]](https://github.com/multimodal-art-projection/IV-Bench) |
| 25.04 | Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing | Reasoning-Informed viSual Editing             | [[📑 Paper]](https://arxiv.org/abs/2504.02826) [[💻 Code]](https://github.com/PhoenixZ810/RISEBench) |
| 25.04 | CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following | Music Information Retrieval & Knowledge       | [[📑 Paper]](https://arxiv.org/abs/2506.12285)  [[💻 Code]](https://github.com/nicolaus625/CMI-bench)    |
| 25.03 | MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX | Common Sense Omni Reasoning | [[📑 Paper]](https://arxiv.org/abs/2503.21699) [[🌐 Project]](https://maverix-benchmark.github.io/)| 
| 25.03 | V-STaR : Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning | Spatio-temporal Reasoning                     | [[🌐 Project]](https://v-star-bench.github.io/) [[📑 Paper]](https://arxiv.org/abs/2311.17982) [[🤗 Data]](https://huggingface.co/datasets/V-STaR-Bench/V-STaR) |
| 25.03 | MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs | Spatio-temporal Understanding                 | [[📑Paper]](https://arxiv.org/pdf/2503.13111)                 |
| 25.03 | Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning | 3D-CoT                                        | [[📑 Paper]](https://arxiv.org/pdf/2503.06232) [[🤗 Data]](https://huggingface.co/datasets/Battam/3D-CoT) |
| 25.02 | MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models | MM-IQ                                         | [[📑 Paper]](https://arxiv.org/pdf/2502.00698) [[💻 Code]](https://github.com/AceCHQ/MMIQ) |
| 25.02 | MM-RLHF: The Next Step Forward in Multimodal LLM Alignment   | MM-RLHF-RewardBench, MM-RLHF-SafetyBench      | [[📑 Paper]](https://arxiv.org/abs/2502.10391)                |
| 25.02 | ZeroBench: An Impossible* Visual Benchmark for Contemporary Large Multimodal Models | ZeroBench                                     | [[🌐 Project]](https://zerobench.github.io/) [[🤗 Dataset]](https://huggingface.co/datasets/jonathan-roberts1/zerobench) [[💻 Code]](https://github.com/jonathan-roberts1/zerobench/) |
| 25.02 | MME-CoT: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency | MME-CoT                                       | [[📑 Paper]](https://arxiv.org/pdf/2502.09621) [[💻 Code]](https://github.com/CaraJ7/MME-CoT) |
| 25.02 | OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference | MM-AlignBench                                 | [[📑 Paper]](https://arxiv.org/abs/2502.18411) [[💻 Code]](https://github.com/PhoenixZ810/OmniAlign-V) |
|25.01 | AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs |  Adversarial attack, Compositional reasoning, and Modality-specific dependency in Visual&Audio | [[📑 Paper]](https://arxiv.org/abs/2501.02135)  |
| 25.01 | LlamaV-o1: Rethinking Step-By-Step Visual Reasoning in LLMs  | VRCBench                                      | [[📑 Paper]](https://arxiv.org/abs/2501.06186) [[💻 Code]](https://github.com/mbzuai-oryx/LlamaV-o1) |
| 24.12 | Online Video Understanding: A Comprehensive Benchmark and  Memory-Augmented Method | VideoChat-Online                              | [[Paper📑]](https://arxiv.org/abs/2501.00584) [[Code💻]](https://github.com/qirui-chen/MultiHop-EgoQA) |
| 24.11 | VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models | VLRewardBench                                 | [[📑 Paper]](https://arxiv.org/abs/2411.17451)                |
| 24.11 | Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos    | MH-VidQA                                      | [[Paper📑]](https://arxiv.org/pdf/2408.14469) [[Code💻]](https://github.com/MCG-NJU/VideoChat-Online) |
| 24.10 | OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities | Video&Audio Reasoning | [[📑 Paper]](https://arxiv.org/abs/2410.12219) | 
| 24.10 | MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark | Audio Understanding & Reasoning               | [[🌐 Project]](https://sakshi113.github.io/mmau_homepage/) [[📑 Paper]](https://arxiv.org/html/2410.19168v1) [[💻Code]](https://github.com/Sakshi113/mmau/tree/main) [[🤗 Data]](https://huggingface.co/datasets/apple/mmau) |
| 24.09 | MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning | Video Causal Reasoning                        | [[📑 Paper]](https://arxiv.org/abs/2409.17647) [[💻Code]](https://github.com/tychen-SJTU/MECD-Benchmark) [[🤗 Data]](https://huggingface.co/datasets/tychen-sjtu/MECD) |
| 24.09 | OmniBench: Towards The Future of Universal Omni-Language Models | Reasoning with Image & Speech/Sound/Music | [[📑 Paper]](https://arxiv.org/abs/2409.15272) [[Code💻]](https://github.com/multimodal-art-projection/OmniBench) [[🌐 Project]](https://m-a-p.ai/OmniBench/)  [[🤗 Data]](https://huggingface.co/datasets/m-a-p/OmniBench) |
| 24.08 | MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models | Music Knowledge & Reasoning                   | [[🌐 Project]](https://mulab-mir.github.io/muchomusic/) [[📑 Paper]](https://zenodo.org/records/14877459) [[💻Code]](https://github.com/mulab-mir/muchomusic) [[ Data]](https://zenodo.org/records/12709974) |
| 24.07 | REXTIME: A Benchmark Suite for Reasoning-Across-Time in Videos | REXTIME                                       | [[Paper📑]](https://arxiv.org/abs/2406.19392) [[Code💻]](https://github.com/ReXTime/ReXTime) |
| 24.06 | AudioBench: A Universal Benchmark for Audio Large Language Models | Speech & Sound Understanding                  | [[Paper📑]](https://arxiv.org/pdf/2406.16020) [[Code🖥️]](https://github.com/AudioLLMs/AudioBench) |
| 24.06 | ChartMimic: Evaluating LMM’s Cross-Modal Reasoning Capability via Chart-to-Code Generation | ChartBench                                    | [[Project🌐]](https://chartmimic.github.io/) [[Paper📑]](https://arxiv.org/abs/2406.09961) [[Code🖥️]](https://github.com/ChartMimic/ChartMimic) |
| 24.05 | M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought | M3CoT                                         | [[📑 Paper]](https://arxiv.org/html/2405.16473v1)             |
| 24.02 | AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension | Speech & Sound Understanding                  | [[📑 Paper]](h

Download .txt

gitextract_eaow443p/

├── README.md
└── contribution.md

Download .json

Condensed preview — 2 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (210K chars).

[
  {
    "path": "README.md",
    "chars": 204384,
    "preview": "# Awesome-MLLM-Reasoning-Collection\n[![License: MIT](https://img.shields.io/badge/License-MIT-purple.svg)](LICENSE)\n\n👏 W"
  },
  {
    "path": "contribution.md",
    "chars": 332,
    "preview": "Anyone interested in transfer learning is welcomed to contribute to this repo (by pull request):\n\n- You can add the late"
  }
]

About this extraction

This page contains the full source code of the lwpyh/Awesome-MLLM-Reasoning-Collection GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 2 files (199.9 KB), approximately 61.3k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo