Repository: bytedance/Dolphin
Branch: master
Commit: fcbf0334a9ea
Files: 12
Total size: 85.5 KB

Directory structure:
gitextract_yl96yq_r/

├── .gitignore
├── .pre-commit-config.yaml
├── LICENSE
├── README.md
├── README_CN.md
├── demo_element.py
├── demo_layout.py
├── demo_page.py
├── pyproject.toml
├── requirements.txt
└── utils/
    ├── markdown_utils.py
    └── utils.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
coverage.xml
*.mo
*.pot

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
.idea/
*.iml

# VS Code
.vscode/
!.vscode/settings.json
!.vscode/tasks.json
!.vscode/launch.json
!.vscode/extensions.json

# macOS
.DS_Store

# Windows
Thumbs.db
ehthumbs.db
Desktop.ini

fusion_result.json
kernel_meta/


================================================
FILE: .pre-commit-config.yaml
================================================
repos:
  # 1. isort - 自动排序 Python imports
  - repo: https://github.com/pycqa/isort
    rev: 6.0.1  # 使用固定版本号
    hooks:
      - id: isort
        name: isort (python)
        args: [--profile=black]  # 与 Black 兼容的配置
        language: python

  # 2. Black - 自动格式化 Python 代码
  - repo: https://github.com/psf/black
    rev: 25.1.0  # 使用固定版本号
    hooks:
      - id: black
        language: python

  # 3. flake8 - Python 静态检查
  - repo: https://github.com/pycqa/flake8
    rev: 7.2.0
    hooks:
      - id: flake8
        args: [--max-line-length=120, --ignore=E203]  # 设置行长度为 120
        additional_dependencies: [flake8-bugbear==24.12.12]  # 可选：增强检查

  # 4. pre-commit-hooks - 通用 Git 钩子
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v5.0.0
    hooks:
      - id: trailing-whitespace  # 删除行尾空格
      - id: end-of-file-fixer    # 确保文件以换行符结束
      - id: check-yaml           # 验证 YAML 文件语法
      - id: check-added-large-files  # 阻止大文件提交
        args: ["--maxkb=512"]


================================================
FILE: LICENSE
================================================
Qwen RESEARCH LICENSE AGREEMENT

Qwen RESEARCH LICENSE AGREEMENT Release Date: September 19, 2024

By clicking to agree or by using or distributing any portion or element of the Qwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately.

1. Definitions
    a. This Qwen RESEARCH LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement.
    b. "We" (or "Us") shall mean Alibaba Cloud.
    c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use.
    d. "Third Parties" shall mean individuals or legal entities that are not under common control with us or you.
    e. "Qwen" shall mean the large language models, and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by us.
    f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Qwen and Documentation (and any portion thereof) made available under this Agreement.
    g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files.
    h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
    i. "Non-Commercial" shall mean for research or evaluation purposes only.

2. Grant of Rights
    a. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials FOR NON-COMMERCIAL PURPOSES ONLY. 
    b. If you are commercially using the Materials, you shall request a license from us.

3. Redistribution
You may distribute copies or make the Materials, or derivative works thereof, available as part of a product or service that contains any of them, with or without modifications, and in Source or Object form, provided that you meet the following conditions:
    a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement;
    b. You shall cause any modified files to carry prominent notices stating that you changed the files;
    c. You shall retain in all copies of the Materials that you distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Qwen is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and
    d. You may add your own copyright statement to your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of your modifications, or for any such derivative works as a whole, provided your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement.

4. Rules of use
    a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials.
    b. If you use the Materials or any outputs or results therefrom to create, train, fine-tune, or improve an AI model that is distributed or made available, you shall prominently display “Built with Qwen” or “Improved using Qwen” in the related product documentation.

5. Intellectual Property
    a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications.
    b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials.
    c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licenses granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought.

6. Disclaimer of Warranty and Limitation of Liability
    a. We are not obligated to support, update, provide training for, or develop any further version of the Qwen Materials or to grant any license thereto.
    b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM.
    c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW IT’S CAUSED.
    d. You will defend, indemnify and hold harmless us from and against any claim by any third party arising out of or related to your use or distribution of the Materials.

7. Survival and Termination.
    a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein.
    b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 6 and 8 shall survive the termination of this Agreement.

8. Governing Law and Jurisdiction.
    a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
    b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement.

9. Other Terms and Conditions.
    a. Any arrangements, understandings, or agreements regarding the Material not stated herein are separate from and independent of the terms and conditions of this Agreement. You shall request a separate license from us, if you use the Materials in ways not expressly agreed to in this Agreement. 
    b. We shall not be bound by any additional or different terms or conditions communicated by you unless expressly agreed.


================================================
FILE: README.md
================================================
<div align="center">
  <img src="./assets/dolphin.png" width="300">
</div>

<div align="center">
  <a href="https://arxiv.org/abs/2505.14059">
    <img src="https://img.shields.io/badge/Paper-arXiv-red">
  </a>
  <a href="https://huggingface.co/ByteDance/Dolphin-v2">
    <img src="https://img.shields.io/badge/HuggingFace-Dolphin-yellow">
  </a>
  <a href="https://github.com/bytedance/Dolphin">
    <img src="https://img.shields.io/badge/Code-Github-green">
  </a>
  <a href="https://opensource.org/licenses/MIT">
    <img src="https://img.shields.io/badge/License-MIT-lightgray">
  </a>
  <br>
</div>

<br>

<div align="center">
  <img src="./assets/demo.gif" width="800">
</div>

# Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
Dolphin-v2 is an enhanced universal document parsing model that substantially improves upon the original Dolphin. It seamlessly handles any document type—whether digital-born or photographed—through a document-type-aware two-stage architecture with scalable anchor prompting.


## 📑 Overview

Document image parsing is challenging due to diverse document types and complexly intertwined elements such as text paragraphs, figures, formulas, tables, and code blocks. Dolphin-v2 addresses these challenges through a document-type-aware two-stage approach:

1. **🔍 Stage 1**: Document type classification (digital vs. photographed) + layout analysis with reading order prediction
2. **🧩 Stage 2**: Hybrid parsing strategy - holistic parsing for photographed documents, parallel element-wise parsing for digital documents

<div align="center">
  <img src="./assets/framework.png" width="680">
</div>

Dolphin achieves promising performance across diverse page-level and element-level parsing tasks while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism.

<!-- ## 🚀 Demo
Try our demo on [Demo-Dolphin](https://huggingface.co/spaces/ByteDance/Dolphin). -->

## 📅 Changelog
- 🔥 **2025.12.12** Released *Dolphin-v2* model. Upgraded to 3B parameters with 21-element detection, attribute field extraction, dedicated formula/code parsing, and robust photographed document parsing. (Dolphin-1.5 moved to [v1.5 branch](https://github.com/bytedance/Dolphin/tree/v1.5))
- 🔥 **2025.10.16** Released *Dolphin-1.5* model. While maintaining the lightweight 0.3B architecture, this version achieves significant parsing improvements. (Dolphin 1.0 moved to [v1.0 branch](https://github.com/bytedance/Dolphin/tree/v1.0))
- 🔥 **2025.07.10** Released the *Fox-Page Benchmark*, a manually refined subset of the original [Fox dataset](https://github.com/ucaslcl/Fox). Download via: [Baidu Yun](https://pan.baidu.com/share/init?surl=t746ULp6iU5bUraVrPlMSw&pwd=fox1) | [Google Drive](https://drive.google.com/file/d/1yZQZqI34QCqvhB4Tmdl3X_XEvYvQyP0q/view?usp=sharing).
- 🔥 **2025.06.30** Added [TensorRT-LLM support](https://github.com/bytedance/Dolphin/blob/master/deployment/tensorrt_llm/ReadMe.md) for accelerated inference！
- 🔥 **2025.06.27** Added [vLLM support](https://github.com/bytedance/Dolphin/blob/master/deployment/vllm/ReadMe.md) for accelerated inference！
- 🔥 **2025.06.13** Added multi-page PDF document parsing capability.
- 🔥 **2025.05.21** Our demo is released at [link](http://115.190.42.15:8888/dolphin/). Check it out!
- 🔥 **2025.05.20** The pretrained model and inference code of Dolphin are released.
- 🔥 **2025.05.16** Our paper has been accepted by ACL 2025. Paper link: [arXiv](https://arxiv.org/abs/2505.14059).

## 📈 Performance

<table style="width:90%; border-collapse: collapse; text-align: center;">
    <caption>Comprehensive evaluation of document parsing on OmniDocBench (v1.5)</caption>
    <thead>
        <tr>
            <th style="text-align: center !important;">Model</th>
            <th style="text-align: center !important;">Size</th>
            <th style="text-align: center !important;">Overall&#x2191;</th>
            <th style="text-align: center !important;">Text<sup>Edit</sup>&#x2193;</th>
            <th style="text-align: center !important;">Formula<sup>CDM</sup>&#x2191;</th>
            <th style="text-align: center !important;">Table<sup>TEDS</sup>&#x2191;</th>
            <th style="text-align: center !important;">Table<sup>TEDS-S</sup>&#x2191;</th>
            <th style="text-align: center !important;">Read Order<sup>Edit</sup>&#x2193;</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Dolphin</td>
            <td>0.3B</td>
            <td>74.67</td>
            <td>0.125</td>
            <td>67.85</td>
            <td>68.70</td>
            <td>77.77</td>
            <td>0.124</td>
        </tr>
        <tr>
            <td>Dolphin-1.5</td>
            <td>0.3B</td>
            <td>85.06</td>
            <td>0.085</td>
            <td>79.44</td>
            <td>84.25</td>
            <td>88.06</td>
            <td>0.071</td>
        </tr>
        <tr>
            <td>Dolphin-v2</td>
            <td>3B</td>
            <td><strong>89.78</strong></td>
            <td><strong>0.054</strong></td>
            <td><strong>87.63</strong></td>
            <td><strong>87.02</strong></td>
            <td><strong>90.48</strong></td>
            <td><strong>0.054</strong></td>
        </tr>
    </tbody>
</table>

## 🛠️ Installation

1. Clone the repository:
   ```bash
   git clone https://github.com/ByteDance/Dolphin.git
   cd Dolphin
   ```

2. Install the dependencies:
   ```bash
   pip install -r requirements.txt
   ```

3. Download the pre-trained models of *Dolphin-v2*:

   Visit our Huggingface [model card](https://huggingface.co/ByteDance/Dolphin-v2), or download model by:
   
   ```bash
   # Download the model from Hugging Face Hub
   git lfs install
   git clone https://huggingface.co/ByteDance/Dolphin-v2 ./hf_model
   # Or use the Hugging Face CLI
   pip install huggingface_hub
   huggingface-cli download ByteDance/Dolphin-v2 --local-dir ./hf_model
   ```

## ⚡ Inference

Dolphin provides two inference frameworks with support for two parsing granularities:
- **Page-level Parsing**: Parse the entire document page into a structured JSON and Markdown format
- **Element-level Parsing**: Parse individual document elements (text, table, formula)


### 📄 Page-level Parsing

```bash
# Process a single document image
python demo_page.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs/page_1.png 

# Process a single document pdf
python demo_page.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs/page_6.pdf 

# Process all documents in a directory
python demo_page.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs 

# Process with custom batch size for parallel element decoding
python demo_page.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs \
    --max_batch_size 8
```

### 🧩 Element-level Parsing

````bash
# Process element images (specify element_type: table, formula, text, or code)
python demo_element.py --model_path ./hf_model --save_dir ./results \
    --input_path  \
    --element_type [table|formula|text|code]
````

### 🎨 Layout Parsing
````bash
# Process a single document image
python demo_layout.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs/page_1.png \
    
# Process a single PDF document
python demo_layout.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs/page_6.pdf \

# Process all documents in a directory
python demo_layout.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs 
````


## 🌟 Key Features

- 🔄 Two-stage analyze-then-parse approach based on a single VLM
- 📊 Promising performance on document parsing tasks
- 🔍 Natural reading order element sequence generation
- 🧩 Heterogeneous anchor prompting for different document elements
- ⏱️ Efficient parallel parsing mechanism
- 🤗 Support for Hugging Face Transformers for easier integration


## 📮 Notice
**Call for Bad Cases:** If you have encountered any cases where the model performs poorly, we would greatly appreciate it if you could share them in the issue. We are continuously working to optimize and improve the model.

## 💖 Acknowledgement

We would like to acknowledge the following open-source projects that provided inspiration and reference for this work:
- [OmniDocBench](https://github.com/opendatalab/OmniDocBench)
- [Donut](https://github.com/clovaai/donut/)
- [Nougat](https://github.com/facebookresearch/nougat)
- [GOT](https://github.com/Ucas-HaoranWei/GOT-OCR2.0)
- [MinerU](https://github.com/opendatalab/MinerU/tree/master)
- [Swin](https://github.com/microsoft/Swin-Transformer)
- [Hugging Face Transformers](https://github.com/huggingface/transformers)

## 📝 Citation

If you find this code useful for your research, please use the following BibTeX entry.

```bibtex
@article{feng2025dolphin,
  title={Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting},
  author={Feng, Hao and Wei, Shu and Fei, Xiang and Shi, Wei and Han, Yingdong and Liao, Lei and Lu, Jinghui and Wu, Binghong and Liu, Qi and Lin, Chunhui and others},
  journal={arXiv preprint arXiv:2505.14059},
  year={2025}
}
```

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=bytedance/Dolphin&type=Date)](https://www.star-history.com/#bytedance/Dolphin&Date)


================================================
FILE: README_CN.md
================================================
<div align="center">
  <img src="./assets/dolphin.png" width="300">
</div>

<div align="center">
  <a href="https://arxiv.org/abs/2505.14059">
    <img src="https://img.shields.io/badge/论文-arXiv-red">
  </a>
  <a href="https://huggingface.co/ByteDance/Dolphin-v2">
    <img src="https://img.shields.io/badge/HuggingFace-Dolphin-yellow">
  </a>
  <a href="https://github.com/bytedance/Dolphin">
    <img src="https://img.shields.io/badge/代码-Github-green">
  </a>
  <a href="https://opensource.org/licenses/MIT">
    <img src="https://img.shields.io/badge/许可证-MIT-lightgray">
  </a>
  <br>
</div>

<br>

<div align="center">
  <img src="./assets/demo.gif" width="800">
</div>

# Dolphin: 基于异构锚点提示的文档图像解析

Dolphin（**Do**cument Image **P**arsing via **H**eterogeneous Anchor Prompt**in**g）是一个创新的多模态文档图像解析模型（**0.3B**），采用"分析-解析"的两阶段范式。本仓库包含Dolphin的演示代码和预训练模型。

## 📑 概述

由于文档图像中文本段落、图表、公式和表格等元素的复杂交织，文档图像解析具有挑战性。Dolphin通过两阶段方法解决这些挑战：

1. **🔍 第一阶段**：通过按自然阅读顺序生成元素序列进行全面的页面级布局分析
2. **🧩 第二阶段**：使用异构锚点和任务特定提示高效并行解析文档元素

<div align="center">
  <img src="./assets/framework.png" width="680">
</div>

Dolphin在多样化的页面级和元素级解析任务中取得了优异的性能，同时通过其轻量级架构和并行解析机制确保了卓越的效率。

## 📅 更新日志
- 🔥 **2025.12.12** *Dolphin-v2* 开源！支持 21 类元素检测、属性字段提取、代码专用解析，以及拍照文档解析。（原1.5版本已迁移至[v1.5分支](https://github.com/bytedance/Dolphin/tree/v1.5)）
- 🔥 **2025.10.16** *Dolphin-1.5* 开源！在保持轻量级0.3B架构的同时，该版本实现了显著的解析性能提升。（原1.0版本已迁移至[v1.0分支](https://github.com/bytedance/Dolphin/tree/v1.0)）
- 🔥 **2025.07.10** *Fox-Page* 基准测试开源。这是原始 [Fox 数据集](https://github.com/ucaslcl/Fox) 人工矫正标注后的版本。下载地址：[百度网盘](https://pan.baidu.com/share/init?surl=t746ULp6iU5bUraVrPlMSw&pwd=fox1) | [Google Drive](https://drive.google.com/file/d/1yZQZqI34QCqvhB4Tmdl3X_XEvYvQyP0q/view?usp=sharing)。
- 🔥 **2025.06.30** 新增[TensorRT-LLM](https://github.com/bytedance/Dolphin/blob/master/deployment/tensorrt_llm/ReadMe.md)支持，提升推理速度！
- 🔥 **2025.06.27** 新增[vLLM](https://github.com/bytedance/Dolphin/blob/master/deployment/vllm/ReadMe.md)支持，提升推理速度！
- 🔥 **2025.06.13** 新增多页PDF文档解析功能。
- 🔥 **2025.05.21** 我们的演示已在 [链接](http://115.190.42.15:8888/dolphin/) 发布。快来体验吧！
- 🔥 **2025.05.20** Dolphin的预训练模型和推理代码已发布。
- 🔥 **2025.05.16** 我们的论文已被ACL 2025接收。论文链接：[arXiv](https://arxiv.org/abs/2505.14059)。

## 📈 性能表现

<table style="width:90%; border-collapse: collapse; text-align: center;">
    <caption>OmniDocBench (v1.5) 测试基准上评估结果</caption>
    <thead>
        <tr>
            <th style="text-align: center !important;">模型</th>
            <th style="text-align: center !important;">参数</th>
            <th style="text-align: center !important;">总体&#x2191;</th>
            <th style="text-align: center !important;">文本<sup>Edit</sup>&#x2193;</th>
            <th style="text-align: center !important;">公式<sup>CDM</sup>&#x2191;</th>
            <th style="text-align: center !important;">表格<sup>TEDS</sup>&#x2191;</th>
            <th style="text-align: center !important;">表格<sup>TEDS-S</sup>&#x2191;</th>
            <th style="text-align: center !important;">阅读顺序<sup>Edit</sup>&#x2193;</th>
        </tr>
    </thead>
        <tr>
            <td>Dolphin</td>
            <td>0.3B</td>
            <td>74.67</td>
            <td>0.125</td>
            <td>67.85</td>
            <td>68.70</td>
            <td>77.77</td>
            <td>0.124</td>
        </tr>
        <tr>
            <td>Dolphin-1.5</td>
            <td>0.3B</td>
            <td>85.06</td>
            <td>0.085</td>
            <td>79.44</td>
            <td>84.25</td>
            <td>88.06</td>
            <td>0.071</td>
        </tr>
        <tr>
            <td>Dolphin-v2</td>
            <td>0.3B</td>
            <td><strong>89.78</strong></td>
            <td><strong>0.054</strong></td>
            <td><strong>87.63</strong></td>
            <td><strong>87.02</strong></td>
            <td><strong>90.48</strong></td>
            <td><strong>0.054</strong></td>
        </tr>
    </tbody>
</table>

## 🛠️ 安装

1. 克隆仓库：
   ```bash
   git clone https://github.com/ByteDance/Dolphin.git
   cd Dolphin
   ```

2. 安装依赖：
   ```bash
   pip install -r requirements.txt
   ```

3. 使用以下选项之一下载 *Dolphin-v2* 的预训练模型：
   访问我们的Huggingface [模型卡片](https://huggingface.co/ByteDance/Dolphin-v2)，或通过以下方式下载模型：
   
   ```bash
   # 从Hugging Face Hub下载模型
   git lfs install
   git clone https://huggingface.co/ByteDance/Dolphin-v2 ./hf_model
   # 或使用Hugging Face CLI
   pip install huggingface_hub
   huggingface-cli download ByteDance/Dolphin-v2 --local-dir ./hf_model
   ```

## ⚡ 推理

Dolphin提供两个推理框架，支持两种解析粒度：
- **页面级解析**：将整个文档页面解析为结构化的JSON和Markdown格式
- **元素级解析**：解析单个文档元素（文本、表格、公式）


### 📄 页面级解析

```bash
# 处理单个文档图像
python demo_page.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs/page_1.png 

# 处理单个文档PDF
python demo_page.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs/page_6.pdf 

# 处理目录中的所有文档
python demo_page.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs 

# 使用自定义批次大小进行并行元素解码
python demo_page.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs \
    --max_batch_size 8
```

### 🧩 元素级解析

````bash
# 解析块图像 (支持块图像类型: table, formula, text, or code)
python demo_element.py --model_path ./hf_model --save_dir ./results \
    --input_path  \
    --element_type [table|formula|text|code]
````

### 🎨 元素定位及阅读顺序解析
````bash
# 处理单个文档图像
python demo_layout.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs/page_1.png \
    
# 处理单个文档PDF
python demo_layout.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs/page_6.pdf \

# 处理目录中的所有文档
python demo_layout.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs 
````


## 🌟 主要特性

- 🔄 基于单一VLM的两阶段分析-解析方法
- 📊 在文档解析任务上的优异性能
- 🔍 自然阅读顺序元素序列生成
- 🧩 针对不同文档元素的异构锚点提示
- ⏱️ 高效的并行解析机制
- 🤗 支持Hugging Face Transformers，便于集成


## 📮 通知
**征集不良案例：** 如果您遇到模型表现不佳的案例，我们非常欢迎您在issue中分享。我们正在持续优化和改进模型。


## 💖 致谢

我们要感谢以下开源项目为本工作提供的灵感和参考：
- [OmniDocBench](https://github.com/opendatalab/OmniDocBench)
- [Donut](https://github.com/clovaai/donut/)
- [Nougat](https://github.com/facebookresearch/nougat)
- [GOT](https://github.com/Ucas-HaoranWei/GOT-OCR2.0)
- [MinerU](https://github.com/opendatalab/MinerU/tree/master)
- [Swin](https://github.com/microsoft/Swin-Transformer)
- [Hugging Face Transformers](https://github.com/huggingface/transformers)


## 📝 引用

如果您在研究中发现此代码有用，请使用以下BibTeX条目。

```bibtex
@article{feng2025dolphin,
  title={Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting},
  author={Feng, Hao and Wei, Shu and Fei, Xiang and Shi, Wei and Han, Yingdong and Liao, Lei and Lu, Jinghui and Wu, Binghong and Liu, Qi and Lin, Chunhui and others},
  journal={arXiv preprint arXiv:2505.14059},
  year={2025}
}
```

## 星标历史

[![Star History Chart](https://api.star-history.com/svg?repos=bytedance/Dolphin&type=Date)](https://www.star-history.com/#bytedance/Dolphin&Date)


================================================
FILE: demo_element.py
================================================
"""
Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
SPDX-License-Identifier: MIT
"""

import argparse
import glob
import os

import cv2
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info

from utils.utils import *


class DOLPHIN:
    def __init__(self, model_id_or_path):
        """Initialize the Hugging Face model
        
        Args:
            model_id_or_path: Path to local model or Hugging Face model ID
        """
        # Load model from local path or Hugging Face hub
        self.processor = AutoProcessor.from_pretrained(model_id_or_path)
        self.model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_id_or_path)
        self.model.eval()
        
        # Set device and precision
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(self.device)

        if self.device == "cuda":
            self.model = self.model.bfloat16()
        else:
            self.model = self.model.float()
        
        # set tokenizer
        self.tokenizer = self.processor.tokenizer
        self.tokenizer.padding_side = "left"

    def chat(self, prompt, image):
        # Check if we're dealing with a batch
        is_batch = isinstance(image, list)
        
        if not is_batch:
            # Single image, wrap it in a list for consistent processing
            images = [image]
            prompts = [prompt]
        else:
            # Batch of images
            images = image
            prompts = prompt if isinstance(prompt, list) else [prompt] * len(images)
        
        assert len(images) == len(prompts)
        
        # preprocess all images
        processed_images = [resize_img(img) for img in images]
        # generate all messages
        all_messages = []
        for img, question in zip(processed_images, prompts):
            messages = [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "image": img,
                        },
                        {"type": "text", "text": question}
                    ],
                }
            ]
            all_messages.append(messages)
        # prepare all texts
        texts = [
            self.processor.apply_chat_template(
                msgs, tokenize=False, add_generation_prompt=True
            )
            for msgs in all_messages
        ]
        # collect all image inputs
        all_image_inputs = []
        all_video_inputs = None
        for msgs in all_messages:
            image_inputs, video_inputs = process_vision_info(msgs)
            all_image_inputs.extend(image_inputs)
        # prepare model inputs
        inputs = self.processor(
            text=texts,
            images=all_image_inputs if all_image_inputs else None,
            videos=all_video_inputs if all_video_inputs else None,
            padding=True,
            return_tensors="pt",
        )
        inputs = inputs.to(self.model.device)
        # inference
        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=4096,
            # repetition_penalty=1.05
        )
        generated_ids_trimmed = [
            out_ids[len(in_ids):] 
            for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        
        results = self.processor.batch_decode(
            generated_ids_trimmed, 
            skip_special_tokens=True, 
            clean_up_tokenization_spaces=False
        )
        # Return a single result for single image input
        if not is_batch:
            return results[0]
        return results


def process_element(image_path, model, element_type, save_dir=None):
    """Process a single element image (text, table, formula)
    
    Args:
        image_path: Path to the element image
        model: HFModel model instance
        element_type: Type of element ('text', 'table', 'formula')
        save_dir: Directory to save results (default: same as input directory)
        
    Returns:
        Parsed content of the element and recognition results
    """
    # Load and prepare image
    pil_image = Image.open(image_path).convert("RGB")
    # pil_image = crop_margin(pil_image)
    
    # Select appropriate prompt based on element type
    if element_type == "table":
        prompt = "Parse the table in the image."
        label = "tab"
    elif element_type == "formula":
        prompt = "Read formula in the image."
        label = "equ"
    elif element_type == "code":
        prompt = "Read code in the image."
        label = "code"
    else:  # Default to text
        prompt = "Read text in the image."
        label = "para"
    
    # Process the element
    result = model.chat(prompt, pil_image)
    
    # Create recognition result in the same format as the document parser
    recognition_results = [
        {
            "label": label,
            "text": result.strip(),
        }
    ]
    
    # Save results if save_dir is provided
    save_outputs(recognition_results, pil_image, os.path.basename(image_path).split(".")[0], save_dir)
    print(f"Results saved to {save_dir}")
    
    return result, recognition_results


def main():
    parser = argparse.ArgumentParser(description="Element-level processing using DOLPHIN model")
    parser.add_argument("--model_path", default="./hf_model", help="Path to Hugging Face model")
    parser.add_argument("--input_path", type=str, required=True, help="Path to input image or directory of images")
    parser.add_argument(
        "--element_type",
        type=str,
        choices=["text", "table", "formula", "code"],
        default="text",
        help="Type of element to process (text, table, formula)",
    )
    parser.add_argument(
        "--save_dir",
        type=str,
        default=None,
        help="Directory to save parsing results (default: same as input directory)",
    )
    parser.add_argument("--print_results", action="store_true", help="Print recognition results to console")
    args = parser.parse_args()
    
    # Load Model
    model = DOLPHIN(args.model_path)
    
    # Set save directory
    save_dir = args.save_dir or (
        args.input_path if os.path.isdir(args.input_path) else os.path.dirname(args.input_path)
    )
    setup_output_dirs(save_dir)
    
    # Collect Images
    if os.path.isdir(args.input_path):
        image_files = []
        for ext in [".jpg", ".jpeg", ".png", ".JPG", ".JPEG", ".PNG"]:
            image_files.extend(glob.glob(os.path.join(args.input_path, f"*{ext}")))
        image_files = sorted(image_files)
    else:
        if not os.path.exists(args.input_path):
            raise FileNotFoundError(f"Input path {args.input_path} does not exist")
        image_files = [args.input_path]
    
    total_samples = len(image_files)
    print(f"\nTotal samples to process: {total_samples}")
    
    # Process images one by one
    for image_path in image_files:
        print(f"\nProcessing {image_path}")
        try:
            result, recognition_result = process_element(
                image_path=image_path,
                model=model,
                element_type=args.element_type,
                save_dir=save_dir,
            )

            if args.print_results:
                print("\nRecognition result:")
                print(result)
                print("-" * 40)
        except Exception as e:
            print(f"Error processing {image_path}: {str(e)}")
            continue


if __name__ == "__main__":
    main()


================================================
FILE: demo_layout.py
================================================
""" 
Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
SPDX-License-Identifier: MIT
"""

import argparse
import glob
import os

import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info

from utils.utils import *


class DOLPHIN:
    def __init__(self, model_id_or_path):
        """Initialize the Hugging Face model
        
        Args:
            model_id_or_path: Path to local model or Hugging Face model ID
        """
        # Load model from local path or Hugging Face hub
        self.processor = AutoProcessor.from_pretrained(model_id_or_path)
        self.model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_id_or_path)
        self.model.eval()
        
        # Set device and precision
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(self.device)

        if self.device == "cuda":
            self.model = self.model.bfloat16()
        else:
            self.model = self.model.float()
        
        # set tokenizer
        self.tokenizer = self.processor.tokenizer
        self.tokenizer.padding_side = "left"

    def chat(self, prompt, image):
        # Check if we're dealing with a batch
        is_batch = isinstance(image, list)
        
        if not is_batch:
            # Single image, wrap it in a list for consistent processing
            images = [image]
            prompts = [prompt]
        else:
            # Batch of images
            images = image
            prompts = prompt if isinstance(prompt, list) else [prompt] * len(images)
        
        assert len(images) == len(prompts)
        
        # preprocess all images
        processed_images = [resize_img(img) for img in images]
        # generate all messages
        all_messages = []
        for img, question in zip(processed_images, prompts):
            messages = [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "image": img,
                        },
                        {"type": "text", "text": question}
                    ],
                }
            ]
            all_messages.append(messages)

        # prepare all texts
        texts = [
            self.processor.apply_chat_template(
                msgs, tokenize=False, add_generation_prompt=True
            )
            for msgs in all_messages
        ]

        # collect all image inputs
        all_image_inputs = []
        all_video_inputs = None
        for msgs in all_messages:
            image_inputs, video_inputs = process_vision_info(msgs)
            all_image_inputs.extend(image_inputs)

        # prepare model inputs
        inputs = self.processor(
            text=texts,
            images=all_image_inputs if all_image_inputs else None,
            videos=all_video_inputs if all_video_inputs else None,
            padding=True,
            return_tensors="pt",
        )
        inputs = inputs.to(self.model.device)

        # inference
        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=4096,
            do_sample=False,
            temperature=None,
            # repetition_penalty=1.05
        )
        generated_ids_trimmed = [
            out_ids[len(in_ids):] 
            for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        
        results = self.processor.batch_decode(
            generated_ids_trimmed, 
            skip_special_tokens=True, 
            clean_up_tokenization_spaces=False
        )

        # Return a single result for single image input
        if not is_batch:
            return results[0]
        return results


def process_layout(input_path, model, save_dir):
    """Process layout detection for image or PDF
    
    Args:
        input_path: Path to input image or PDF
        model: DOLPHIN model instance
        save_dir: Directory to save results
    """
    file_ext = os.path.splitext(input_path)[1].lower()
    
    if file_ext == '.pdf':
        # Convert PDF to images
        images = convert_pdf_to_images(input_path)
        if not images:
            raise Exception(f"Failed to convert PDF {input_path} to images")
        
        # Process each page
        for page_idx, pil_image in enumerate(images):
            print(f"\nProcessing page {page_idx + 1}/{len(images)}")
            
            # Generate output name for this page
            base_name = os.path.splitext(os.path.basename(input_path))[0]
            page_name = f"{base_name}_page_{page_idx + 1:03d}"
            
            # Process layout for this page
            process_single_layout(pil_image, model, save_dir, page_name)
    
    else:
        # Process regular image file
        pil_image = Image.open(input_path).convert("RGB")
        base_name = os.path.splitext(os.path.basename(input_path))[0]
        process_single_layout(pil_image, model, save_dir, base_name)


def process_single_layout(pil_image, model, save_dir, image_name):
    """Process layout for a single image
    
    Args:
        pil_image: PIL Image object
        model: DOLPHIN model instance
        save_dir: Directory to save results
        image_name: Name for the output files
    """
    # Parse layout
    print("Parsing layout and reading order...")
    layout_results = model.chat("Parse the reading order of this document.", pil_image)

    # Parse the layout string
    layout_results_list = parse_layout_string(layout_results)
    if not layout_results_list or not (layout_results.startswith("[") and layout_results.endswith("]")):
        layout_results_list = [([0, 0, *pil_image.size], 'distorted_page', [])]
    
    # map bbox to original image coordinates
    recognition_results = []
    reading_order = 0
    for bbox, label, tags in layout_results_list:
        x1, y1, x2, y2 = process_coordinates(bbox, pil_image)
        recognition_results.append({
                        "label": label,
                        "bbox": [x1, y1, x2, y2],
                        "text": "", # empty for now
                        "reading_order": reading_order,
                        "tags": tags,
                    })
        reading_order += 1
    json_path = save_outputs(recognition_results, pil_image, image_name, save_dir)


def main():
    parser = argparse.ArgumentParser(description="Layout detection and visualization using DOLPHIN model")
    parser.add_argument("--model_path", default="./hf_model", help="Path to Hugging Face model")
    parser.add_argument(
        "--input_path", 
        type=str, 
        required=True, 
        help="Path to input image/PDF or directory of files"
    )
    parser.add_argument(
        "--save_dir",
        type=str,
        default=None,
        help="Directory to save results (default: same as input directory)",
    )
    args = parser.parse_args()
    
    # Load Model
    print("Loading model...")
    model = DOLPHIN(args.model_path)
    
    # Set save directory
    save_dir = args.save_dir or (
        args.input_path if os.path.isdir(args.input_path) else os.path.dirname(args.input_path)
    )
    
    # Create save directory if it doesn't exist
    os.makedirs(save_dir, exist_ok=True)
    
    # Collect files
    if os.path.isdir(args.input_path):
        # Support both image and PDF files
        file_extensions = [".jpg", ".jpeg", ".png", ".JPG", ".JPEG", ".PNG", ".pdf", ".PDF"]
        
        input_files = []
        for ext in file_extensions:
            input_files.extend(glob.glob(os.path.join(args.input_path, f"*{ext}")))
        input_files = sorted(input_files)
    else:
        if not os.path.exists(args.input_path):
            raise FileNotFoundError(f"Input path {args.input_path} does not exist")
        
        # Check if it's a supported file type
        file_ext = os.path.splitext(args.input_path)[1].lower()
        supported_exts = ['.jpg', '.jpeg', '.png', '.pdf']
        
        if file_ext not in supported_exts:
            raise ValueError(f"Unsupported file type: {file_ext}. Supported types: {supported_exts}")
        
        input_files = [args.input_path]
    
    total_files = len(input_files)
    print(f"\nTotal files to process: {total_files}")
    
    # Process files
    for file_path in input_files:
        print(f"\n{'='*60}")
        print(f"Processing: {file_path}")
        print('='*60)
        
        try:
            process_layout(
                input_path=file_path,
                model=model,
                save_dir=save_dir,
            )
            print(f"\n✓ Processing completed for {file_path}")
            
        except Exception as e:
            print(f"\n✗ Error processing {file_path}: {str(e)}")
            continue
    
    print(f"\n{'='*60}")
    print(f"All processing completed. Results saved to {save_dir}")
    print('='*60)


if __name__ == "__main__":
    main()


================================================
FILE: demo_page.py
================================================
""" 
Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
SPDX-License-Identifier: MIT
"""

import argparse
import glob
import os

import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info

from utils.utils import *


class DOLPHIN:
    def __init__(self, model_id_or_path):
        """Initialize the Hugging Face model
        
        Args:
            model_id_or_path: Path to local model or Hugging Face model ID
        """
        # Load model from local path or Hugging Face hub
        self.processor = AutoProcessor.from_pretrained(model_id_or_path)
        self.model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_id_or_path)
        self.model.eval()
        
        # Set device and precision
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(self.device)

        if self.device == "cuda":
            self.model = self.model.bfloat16()
        else:
            self.model = self.model.float()
        
        # set tokenizer
        self.tokenizer = self.processor.tokenizer
        self.tokenizer.padding_side = "left"

    def chat(self, prompt, image):
        # Check if we're dealing with a batch
        is_batch = isinstance(image, list)
        
        if not is_batch:
            # Single image, wrap it in a list for consistent processing
            images = [image]
            prompts = [prompt]
        else:
            # Batch of images
            images = image
            prompts = prompt if isinstance(prompt, list) else [prompt] * len(images)
        
        assert len(images) == len(prompts)
        
        # preprocess all images
        processed_images = [resize_img(img) for img in images]
        # generate all messages
        all_messages = []
        for img, question in zip(processed_images, prompts):
            messages = [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "image": img,
                        },
                        {"type": "text", "text": question}
                    ],
                }
            ]
            all_messages.append(messages)

        # prepare all texts
        texts = [
            self.processor.apply_chat_template(
                msgs, tokenize=False, add_generation_prompt=True
            )
            for msgs in all_messages
        ]

        # collect all image inputs
        all_image_inputs = []
        all_video_inputs = None
        for msgs in all_messages:
            image_inputs, video_inputs = process_vision_info(msgs)
            all_image_inputs.extend(image_inputs)

        # prepare model inputs
        inputs = self.processor(
            text=texts,
            images=all_image_inputs if all_image_inputs else None,
            videos=all_video_inputs if all_video_inputs else None,
            padding=True,
            return_tensors="pt",
        )
        inputs = inputs.to(self.model.device)

        # inference
        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=4096,
            do_sample=False,
            temperature=None,
            # repetition_penalty=1.05
        )
        generated_ids_trimmed = [
            out_ids[len(in_ids):] 
            for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        
        results = self.processor.batch_decode(
            generated_ids_trimmed, 
            skip_special_tokens=True, 
            clean_up_tokenization_spaces=False
        )

        # Return a single result for single image input
        if not is_batch:
            return results[0]
        return results


def process_document(document_path, model, save_dir, max_batch_size=None):
    """Parse documents with two stages - Handles both images and PDFs"""
    file_ext = os.path.splitext(document_path)[1].lower()
    
    if file_ext == '.pdf':
        # Convert PDF to images
        images = convert_pdf_to_images(document_path)
        if not images:
            raise Exception(f"Failed to convert PDF {document_path} to images")
        
        all_results = []
        
        # Process each page
        for page_idx, pil_image in enumerate(images):
            print(f"Processing page {page_idx + 1}/{len(images)}")
            
            # Generate output name for this page
            base_name = os.path.splitext(os.path.basename(document_path))[0]
            page_name = f"{base_name}_page_{page_idx + 1:03d}"
            
            # Process this page (don't save individual page results)
            json_path, recognition_results = process_single_image(
                pil_image, model, save_dir, page_name, max_batch_size, save_individual=False
            )
            
            # Add page information to results
            page_results = {
                "page_number": page_idx + 1,
                "elements": recognition_results
            }
            all_results.append(page_results)
        
        # Save combined results for multi-page PDF
        combined_json_path = save_combined_pdf_results(all_results, document_path, save_dir)
        
        return combined_json_path, all_results
    
    else:
        # Process regular image file
        pil_image = Image.open(document_path).convert("RGB")
        base_name = os.path.splitext(os.path.basename(document_path))[0]
        return process_single_image(pil_image, model, save_dir, base_name, max_batch_size)


def process_single_image(image, model, save_dir, image_name, max_batch_size=None, save_individual=True):
    """Process a single image (either from file or converted from PDF page)
    
    Args:
        image: PIL Image object
        model: DOLPHIN model instance
        save_dir: Directory to save results
        image_name: Name for the output file
        max_batch_size: Maximum batch size for processing
        save_individual: Whether to save individual results (False for PDF pages)
        
    Returns:
        Tuple of (json_path, recognition_results)
    """
    # Stage 1: Page-level layout and reading order parsing
    layout_output = model.chat("Parse the reading order of this document.", image)
    # print(layout_output)

    # Stage 2: Element-level content parsing
    recognition_results = process_elements(layout_output, image, model, max_batch_size, save_dir, image_name)

    # Save outputs only if requested (skip for PDF pages)
    json_path = None
    if save_individual:
        # Create a dummy image path for save_outputs function
        json_path = save_outputs(recognition_results, image, image_name, save_dir)

    return json_path, recognition_results


def process_elements(layout_results, image, model, max_batch_size, save_dir=None, image_name=None):
    """Parse all document elements with parallel decoding"""
    layout_results_list = parse_layout_string(layout_results)
    if not layout_results_list or not (layout_results.startswith("[") and layout_results.endswith("]")):
        layout_results_list = [([0, 0, *image.size], 'distorted_page', [])]
    # Check for bbox overlap - if too many overlaps, treat as distorted page
    elif len(layout_results_list) > 1 and check_bbox_overlap(layout_results_list, image):
        print("Falling back to distorted_page mode due to high bbox overlap")
        layout_results_list = [([0, 0, *image.size], 'distorted_page', [])]
        
    tab_elements = []      
    equ_elements = []     
    code_elements = []    
    text_elements = []     
    figure_results = []    
    reading_order = 0

    # Collect elements and group
    for bbox, label, tags in layout_results_list:
        try:
            if label == "distorted_page":
                x1, y1, x2, y2 = 0, 0, *image.size
                pil_crop = image
            else:
                # get coordinates in the original image
                x1, y1, x2, y2 = process_coordinates(bbox, image)
                # crop the image
                pil_crop = image.crop((x1, y1, x2, y2))

            if pil_crop.size[0] > 3 and pil_crop.size[1] > 3:
                if label == "fig":
                    figure_filename = save_figure_to_local(pil_crop, save_dir, image_name, reading_order)
                    figure_results.append({
                        "label": label,
                        "text": f"![Figure](figures/{figure_filename})",
                        "figure_path": f"figures/{figure_filename}",
                        "bbox": [x1, y1, x2, y2],
                        "reading_order": reading_order,
                        "tags": tags,
                    })
                else:
                    # Prepare element information
                    element_info = {
                        "crop": pil_crop,
                        "label": label,
                        "bbox": [x1, y1, x2, y2],
                        "reading_order": reading_order,
                        "tags": tags,
                    }
                    
                    if label == "tab":
                        tab_elements.append(element_info)
                    elif label == "equ":
                        equ_elements.append(element_info)
                    elif label == "code":
                        code_elements.append(element_info)
                    else:
                        text_elements.append(element_info)

            reading_order += 1

        except Exception as e:
            print(f"Error processing bbox with label {label}: {str(e)}")
            continue

    recognition_results = figure_results.copy()
    
    if tab_elements:
        results = process_element_batch(tab_elements, model, "Parse the table in the image.", max_batch_size)
        recognition_results.extend(results)
    
    if equ_elements:
        results = process_element_batch(equ_elements, model, "Read formula in the image.", max_batch_size)
        recognition_results.extend(results)
    
    if code_elements:
        results = process_element_batch(code_elements, model, "Read code in the image.", max_batch_size)
        recognition_results.extend(results)
    
    if text_elements:
        results = process_element_batch(text_elements, model, "Read text in the image.", max_batch_size)
        recognition_results.extend(results)

    recognition_results.sort(key=lambda x: x.get("reading_order", 0))

    return recognition_results


def process_element_batch(elements, model, prompt, max_batch_size=None):
    """Process elements of the same type in batches"""
    results = []
    
    # Determine batch size
    batch_size = len(elements)
    if max_batch_size is not None and max_batch_size > 0:
        batch_size = min(batch_size, max_batch_size)
    
    # Process in batches
    for i in range(0, len(elements), batch_size):
        batch_elements = elements[i:i+batch_size]
        crops_list = [elem["crop"] for elem in batch_elements]
        
        # Use the same prompt for all elements in the batch
        prompts_list = [prompt] * len(crops_list)
        
        # Batch inference
        batch_results = model.chat(prompts_list, crops_list)
        
        # Add results
        for j, result in enumerate(batch_results):
            elem = batch_elements[j]
            results.append({
                "label": elem["label"],
                "bbox": elem["bbox"],
                "text": result.strip(),
                "reading_order": elem["reading_order"],
                "tags": elem["tags"],
            })
    
    return results


def main():
    parser = argparse.ArgumentParser(description="Document parsing based on DOLPHIN")
    parser.add_argument("--model_path", default="./hf_model", help="Path to Hugging Face model")
    parser.add_argument("--input_path", type=str, default="./demo", help="Path to input image/PDF or directory of files")
    parser.add_argument(
        "--save_dir",
        type=str,
        default=None,
        help="Directory to save parsing results (default: same as input directory)",
    )
    parser.add_argument(
        "--max_batch_size",
        type=int,
        default=4,
        help="Maximum number of document elements to parse in a single batch (default: 4)",
    )
    args = parser.parse_args()

    # Load Model
    model = DOLPHIN(args.model_path)

    # Collect Document Files (images and PDFs)
    if os.path.isdir(args.input_path):
        # Support both image and PDF files
        file_extensions = [".jpg", ".jpeg", ".png", ".JPG", ".JPEG", ".PNG", ".pdf", ".PDF"]
        
        document_files = []
        for ext in file_extensions:
            document_files.extend(glob.glob(os.path.join(args.input_path, f"*{ext}")))
        document_files = sorted(document_files)
    else:
        if not os.path.exists(args.input_path):
            raise FileNotFoundError(f"Input path {args.input_path} does not exist")
        
        # Check if it's a supported file type
        file_ext = os.path.splitext(args.input_path)[1].lower()
        supported_exts = ['.jpg', '.jpeg', '.png', '.pdf']
        
        if file_ext not in supported_exts:
            raise ValueError(f"Unsupported file type: {file_ext}. Supported types: {supported_exts}")
        
        document_files = [args.input_path]

    save_dir = args.save_dir or (
        args.input_path if os.path.isdir(args.input_path) else os.path.dirname(args.input_path)
    )
    setup_output_dirs(save_dir)

    total_samples = len(document_files)
    print(f"\nTotal files to process: {total_samples}")

    # Process All Document Files
    for file_path in document_files:
        print(f"\nProcessing {file_path}")
        try:
            json_path, recognition_results = process_document(
                document_path=file_path,
                model=model,
                save_dir=save_dir,
                max_batch_size=args.max_batch_size,
            )

            print(f"Processing completed. Results saved to {save_dir}")

        except Exception as e:
            print(f"Error processing {file_path}: {str(e)}")
            continue


if __name__ == "__main__":
    main()


================================================
FILE: pyproject.toml
================================================
[tool.black]
line-length = 120
include = '\.pyi?$'
exclude = '''
/(
    \.git
  | \.hg
  | \.mypy_cache
  | \.tox
  | \.venv
  | _build
  | buck-out
  | build
  | dist
)/
'''


================================================
FILE: requirements.txt
================================================
datasets==3.6.0
torch==2.6.0
torchvision==0.21.0
transformers==4.51.0
deepspeed==0.16.4
triton==3.2.0
accelerate==1.4.0
torchcodec==0.2
decord==0.6.0
Levenshtein==0.27.1
qwen_vl_utils
matplotlib
jieba
opencv-python
bs4
albumentations==1.4.0
pymupdf==1.26


================================================
FILE: utils/markdown_utils.py
================================================
""" 
Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
SPDX-License-Identifier: MIT
"""

import re
import base64
from typing import List, Dict, Any, Optional


def extract_table_from_html(html_string):
    """Extract and clean table tags from HTML string"""
    try:
        table_pattern = re.compile(r'<table.*?>.*?</table>', re.DOTALL)
        tables = table_pattern.findall(html_string)
        tables = [re.sub(r'<table[^>]*>', '<table>', table) for table in tables]
        return '\n'.join(tables)
    except Exception as e:
        print(f"extract_table_from_html error: {str(e)}")
        return f"<table><tr><td>Error extracting table: {str(e)}</td></tr></table>"


class MarkdownConverter:
    """Convert structured recognition results to Markdown format"""
    
    def __init__(self):
        # Define heading levels for different section types
        self.heading_levels = {
            'sec_0': '#',
            'sec_1': '##',
            'sec_2': '###',
            'sec_3': '###',
            'sec_4': '###',
            'sec_5': '###',
        }
        
        # Define which labels need special handling
        self.special_labels = {
            'sec_0', 'sec_1', 'sec_2', 'sec_3', 'sec_4', 'sec_5',
            'list', 'equ', 'tab', 'fig'
        }

        # Define replacements for special formulas
        self.replace_dict = {
            '\\bm': '\mathbf ',
            '\eqno': '\quad ',
            '\quad': '\quad ',
            '\leq': '\leq ',
            '\pm': '\pm ',
            '\\varmathbb': '\mathbb ',
            '\in fty': '\infty',
            '\mu': '\mu ',
            '\cdot': '\cdot ',
            '\langle': '\langle ',
            '\pm': '\pm '
        }
    
    def try_remove_newline(self, text: str) -> str:
        try:
            # Preprocess text to handle line breaks
            text = text.strip()
            text = text.replace('-\n', '')
            
            # Handle Chinese text line breaks
            def is_chinese(char):
                return '\u4e00' <= char <= '\u9fff'

            lines = text.split('\n')
            processed_lines = []
            
            # Process all lines except the last one
            for i in range(len(lines)-1):
                current_line = lines[i].strip()
                next_line = lines[i+1].strip()
                
                # Always add the current line, but determine if we need a newline
                if current_line:  # If current line is not empty
                    if next_line:  # If next line is not empty
                        # For Chinese text handling
                        if is_chinese(current_line[-1]) and is_chinese(next_line[0]):
                            processed_lines.append(current_line)
                        else:
                            processed_lines.append(current_line + ' ')
                    else:
                        # Next line is empty, add current line with newline
                        processed_lines.append(current_line + '\n')
                else:
                    # Current line is empty, add an empty line
                    processed_lines.append('\n')
            
            # Add the last line
            if lines and lines[-1].strip():
                processed_lines.append(lines[-1].strip())
            
            text = ''.join(processed_lines)
            return text
        
        except Exception as e:
            print(f"try_remove_newline error: {str(e)}")
            return text  # Return original text on error

    def _handle_text(self, text: str) -> str:
        """
        Process regular text content, preserving paragraph structure
        """
        try:
            if not text:
                return ""
            
            # Process formulas in text before handling other text processing
            text = self._process_formulas_in_text(text)
            text = self.try_remove_newline(text)
            return text
        except Exception as e:
            print(f"_handle_text error: {str(e)}")
            return text  # Return original text on error
    
    def _process_formulas_in_text(self, text: str) -> str:
        """
        Process mathematical formulas in text by iteratively finding and replacing formulas.
        - Identify inline and block formulas
        - Replace newlines within formulas with \\
        """
        try:
            text = text.replace(r'\upmu', r'\mu')
            for key, value in self.replace_dict.items():
                text = text.replace(key, value)
            return text
        
        except Exception as e:
            print(f"_process_formulas_in_text error: {str(e)}")
            return text  # Return original text on error
    
    def _remove_newline_in_heading(self, text: str) -> str:
        """
        Remove newline in heading
        """
        try:
            # Handle Chinese text line breaks
            def is_chinese(char):
                return '\u4e00' <= char <= '\u9fff'
            
            # Check if the text contains Chinese characters
            if any(is_chinese(char) for char in text):
                return text.replace('\n', '')
            else:
                return text.replace('\n', ' ')

        except Exception as e:
            print(f"_remove_newline_in_heading error: {str(e)}")
            return text
    
    def _handle_heading(self, text: str, label: str) -> str:
        """
        Convert section headings to appropriate markdown format
        """
        try:
            level = self.heading_levels.get(label, '#')
            text = text.strip()
            text = self._remove_newline_in_heading(text)
            text = self._handle_text(text)
            return f"{level} {text}\n\n"
        
        except Exception as e:
            print(f"_handle_heading error: {str(e)}")
            return f"# Error processing heading: {text}\n\n"
    
    def _handle_list_item(self, text: str) -> str:
        """
        Convert list items to markdown list format
        """
        try:
            return f"- {text.strip()}\n"
        except Exception as e:
            print(f"_handle_list_item error: {str(e)}")
            return f"- Error processing list item: {text}\n"
    
    def _handle_figure(self, text: str, section_count: int) -> str:
        """
        Handle figure content
        """
        try:
            # Check if it's a file path starting with "figures/"
            if text.startswith("figures/"):
                # Convert to relative path from markdown directory to figures directory
                relative_path = f"../{text}"
                return f"![Figure {section_count}]({relative_path})\n\n"

            # Check if it's already a markdown format image link
            if text.startswith("!["):
                # Already in markdown format, return directly
                return f"{text}\n\n"

            # If it's still base64 format, maintain original logic
            if text.startswith("data:image/"):
                return f"![Figure {section_count}]({text})\n\n"
            elif ";" in text and "," in text:
                return f"![Figure {section_count}]({text})\n\n"
            else:
                # Assume it's raw base64, convert to data URI
                img_format = "png"
                data_uri = f"data:image/{img_format};base64,{text}"
                return f"![Figure {section_count}]({data_uri})\n\n"
                
        except Exception as e:
            print(f"_handle_figure error: {str(e)}")
            return f"*[Error processing figure: {str(e)}]*\n\n"

    def _handle_table(self, text: str) -> str:
        """
        Convert table content to markdown format
        """
        try:
            markdown_content = []
            markdown_table = extract_table_from_html(text)
            markdown_content.append(markdown_table + "\n")
            return '\n'.join(markdown_content) + '\n\n'
        
        except Exception as e:
            print(f"_handle_table error: {str(e)}")
            return f"*[Error processing table: {str(e)}]*\n\n"

    def _handle_formula(self, text: str) -> str:
        """
        Handle formula-specific content
        """
        try:
            text = text.strip('$').rstrip("\ ").replace(r'\upmu', r'\mu')
            for key, value in self.replace_dict.items():
                text = text.replace(key, value)
            processed_text = '$$' + text + '$$'
            return f"{processed_text}\n\n"
        
        except Exception as e:
            print(f"_handle_formula error: {str(e)}")
            return f"*[Error processing formula: {str(e)}]*\n\n"

    def convert(self, recognition_results: List[Dict[str, Any]]) -> str:
        """
        Convert recognition results to markdown format
        """
        try:
            markdown_content = []
            
            for section_count, result in enumerate(recognition_results):
                try:
                    label = result.get('label', '')
                    text = result.get('text', '').strip()
                    
                    # Skip empty text
                    if not text:
                        continue
                        
                    # Handle different content types
                    if label in {'sec_0', 'sec_1', 'sec_2', 'sec_3', 'sec_4', 'sec_5'}:
                        markdown_content.append(self._handle_heading(text, label))
                    elif label == 'fig':
                        markdown_content.append(self._handle_figure(text, section_count))
                    elif label == 'tab':
                        markdown_content.append(self._handle_table(text))
                    elif label == 'equ':
                        markdown_content.append(self._handle_formula(text))
                    elif label == 'list':
                        markdown_content.append(self._handle_list_item(text))
                    elif label == 'code':
                        markdown_content.append(f"```bash\n{text}\n```\n\n")
                    else:
                        # Handle regular text (paragraphs, etc.)
                        processed_text = self._handle_text(text)
                        markdown_content.append(f"{processed_text}\n\n")
                        # TODO: distoraged page

                except Exception as e:
                    print(f"Error processing item {section_count}: {str(e)}")
                    # Add a placeholder for the failed item
                    markdown_content.append(f"*[Error processing content]*\n\n")
            
            # Join all content and apply post-processing
            result = ''.join(markdown_content)
            return result
        
        except Exception as e:
            print(f"convert error: {str(e)}")
            return f"Error generating markdown content: {str(e)}"


================================================
FILE: utils/utils.py
================================================
"""
Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
SPDX-License-Identifier: MIT
"""

import io
import json
import os
import re
from dataclasses import dataclass
from typing import List, Tuple

import cv2
import numpy as np
import pymupdf
from PIL import Image
from qwen_vl_utils import smart_resize
from utils.markdown_utils import MarkdownConverter


def save_figure_to_local(pil_crop, save_dir, image_name, reading_order):
    """Save cropped figure to local file system

    Args:
        pil_crop: PIL Image object of the cropped figure
        save_dir: Base directory to save results
        image_name: Name of the source image/document
        reading_order: Reading order of the figure in the document

    Returns:
        str: Filename of the saved figure
    """
    try:
        # Create figures directory if it doesn't exist
        figures_dir = os.path.join(save_dir, "markdown", "figures")
        # os.makedirs(figures_dir, exist_ok=True)

        # Generate figure filename
        figure_filename = f"{image_name}_figure_{reading_order:03d}.png"
        figure_path = os.path.join(figures_dir, figure_filename)

        # Save the figure
        pil_crop.save(figure_path, format="PNG", quality=95)

        # print(f"Saved figure: {figure_filename}")
        return figure_filename

    except Exception as e:
        print(f"Error saving figure: {str(e)}")
        # Return a fallback filename
        return f"{image_name}_figure_{reading_order:03d}_error.png"


def convert_pdf_to_images(pdf_path, target_size=896):
    """Convert PDF pages to images

    Args:
        pdf_path: Path to PDF file
        target_size: Target size for the longest dimension

    Returns:
        List of PIL Images
    """
    images = []
    try:
        doc = pymupdf.open(pdf_path)

        for page_num in range(len(doc)):
            page = doc[page_num]

            # Calculate scale to make longest dimension equal to target_size
            rect = page.rect
            scale = target_size / max(rect.width, rect.height)

            # Render page as image
            mat = pymupdf.Matrix(scale, scale)
            pix = page.get_pixmap(matrix=mat)

            # Convert to PIL Image
            img_data = pix.tobytes("png")
            pil_image = Image.open(io.BytesIO(img_data))
            images.append(pil_image)

        doc.close()
        print(f"Successfully converted {len(images)} pages from PDF")
        return images

    except Exception as e:
        print(f"Error converting PDF to images: {str(e)}")
        return []


def save_combined_pdf_results(all_page_results, pdf_path, save_dir):
    """Save combined results for multi-page PDF with both JSON and Markdown

    Args:
        all_page_results: List of results for all pages
        pdf_path: Path to original PDF file
        save_dir: Directory to save results

    Returns:
        Path to saved combined JSON file
    """
    # Create output filename based on PDF name
    base_name = os.path.splitext(os.path.basename(pdf_path))[0]

    # Prepare combined results
    combined_results = {"source_file": pdf_path, "total_pages": len(all_page_results), "pages": all_page_results}

    # Save combined JSON results
    json_filename = f"{base_name}.json"
    json_path = os.path.join(save_dir, "recognition_json", json_filename)
    os.makedirs(os.path.dirname(json_path), exist_ok=True)

    with open(json_path, "w", encoding="utf-8") as f:
        json.dump(combined_results, f, indent=2, ensure_ascii=False)

    # Generate and save combined markdown
    try:
        markdown_converter = MarkdownConverter()

        # Combine all page results into a single list for markdown conversion
        all_elements = []
        for page_data in all_page_results:
            page_elements = page_data.get("elements", [])
            if page_elements:
                # Add page separator if not the first page
                if all_elements:
                    all_elements.append(
                        {"label": "page_separator", "text": f"\n\n---\n\n", "reading_order": len(all_elements)}
                    )
                all_elements.extend(page_elements)

        # Generate markdown content
        markdown_content = markdown_converter.convert(all_elements)

        # Save markdown file
        markdown_filename = f"{base_name}.md"
        markdown_path = os.path.join(save_dir, "markdown", markdown_filename)
        os.makedirs(os.path.dirname(markdown_path), exist_ok=True)

        with open(markdown_path, "w", encoding="utf-8") as f:
            f.write(markdown_content)

        # print(f"Combined markdown saved to: {markdown_path}")

    except ImportError:
        print("MarkdownConverter not available, skipping markdown generation")
    except Exception as e:
        print(f"Error generating markdown: {e}")

    # print(f"Combined JSON results saved to: {json_path}")
    return json_path


def extract_labels_from_string(text):
    """
    from [202,217,921,325][para][author] extract para and author
    """
    all_matches = re.findall(r'\[([^\]]+)\]', text)
    
    labels = []
    for match in all_matches:
        if not re.match(r'^\d+,\d+,\d+,\d+$', match):
            labels.append(match)
    
    return labels


def parse_layout_string(bbox_str):
    """
    Dolphin-V1.5 layout string parsing function
    Parse layout string to extract bbox and category information
    Supports multiple formats:
    1. Original format: [x1,y1,x2,y2] label
    2. New format: [x1,y1,x2,y2][label][PAIR_SEP] or [x1,y1,x2,y2][label][meta_info][PAIR_SEP]
    """
    parsed_results = []
    
    segments = bbox_str.split('[PAIR_SEP]')
    new_segments = []
    for seg in segments:
        new_segments.extend(seg.split('[RELATION_SEP]'))
    segments = new_segments
    for segment in segments:
        segment = segment.strip()
        if not segment:
            continue
        
        coord_pattern = r'\[(\d*\.?\d+),(\d*\.?\d+),(\d*\.?\d+),(\d*\.?\d+)\]'
        coord_match = re.search(coord_pattern, segment)
        label_matches = extract_labels_from_string(segment)
        
        if coord_match and label_matches:
            coords = [float(coord_match.group(i)) for i in range(1, 5)]
            label = label_matches[0].strip()
            parsed_results.append((coords, label, label_matches[1:])) # label_matches[1:] 是 tags
    
    return parsed_results


def process_coordinates(coords, pil_image):
    original_w, original_h = pil_image.size[:2]
    # use the same resize logic as the model
    resized_pil = resize_img(pil_image)
    resized_image = np.array(resized_pil)
    resized_h, resized_w = resized_image.shape[:2]
    resized_h, resized_w = smart_resize(resized_h, resized_w, factor=28, min_pixels=784, max_pixels=2560000)

    w_ratio, h_ratio = original_w / resized_w, original_h / resized_h
    x1 = int(coords[0] * w_ratio)
    y1 = int(coords[1] * h_ratio)
    x2 = int(coords[2] * w_ratio)
    y2 = int(coords[3] * h_ratio)

    x1 = max(0, min(x1, original_w - 1))
    y1 = max(0, min(y1, original_h - 1))
    x2 = max(x1 + 1, min(x2, original_w))
    y2 = max(y1 + 1, min(y2, original_h))
    return x1, y1, x2, y2


def setup_output_dirs(save_dir):
    """Create necessary output directories"""
    os.makedirs(save_dir, exist_ok=True)
    os.makedirs(os.path.join(save_dir, "markdown"), exist_ok=True)
    os.makedirs(os.path.join(save_dir, "output_json"), exist_ok=True)
    os.makedirs(os.path.join(save_dir, "markdown", "figures"), exist_ok=True)
    os.makedirs(os.path.join(save_dir, "layout_visualization"), exist_ok=True)


def save_outputs(recognition_results, image, image_name, save_dir):
    """Save JSON and markdown outputs"""

    # Save JSON file
    json_path = os.path.join(save_dir, "output_json", f"{image_name}.json")
    with open(json_path, "w", encoding="utf-8") as f:
        json.dump(recognition_results, f, ensure_ascii=False, indent=2)

    # Generate and save markdown file
    markdown_converter = MarkdownConverter()
    markdown_content = markdown_converter.convert(recognition_results)
    markdown_path = os.path.join(save_dir, "markdown", f"{image_name}.md")
    with open(markdown_path, "w", encoding="utf-8") as f:
        f.write(markdown_content)

    # visualize layout
    # Save visualization (pass original PIL image for coordinate mapping)
    vis_path = os.path.join(save_dir, "layout_visualization", f"{image_name}_layout.png")

    visualize_layout(image, recognition_results, vis_path)
    return json_path


def crop_margin(img: Image.Image) -> Image.Image:
    """Crop margins from image"""
    try:
        width, height = img.size
        if width == 0 or height == 0:
            print("Warning: Image has zero width or height")
            return img

        data = np.array(img.convert("L"))
        data = data.astype(np.uint8)
        max_val = data.max()
        min_val = data.min()
        if max_val == min_val:
            return img
        data = (data - min_val) / (max_val - min_val) * 255
        gray = 255 * (data < 200).astype(np.uint8)

        coords = cv2.findNonZero(gray)  # Find all non-zero points (text)
        if coords is None:
            return img
        a, b, w, h = cv2.boundingRect(coords)  # Find minimum spanning bounding box

        # Ensure crop coordinates are within image bounds
        a = max(0, a)
        b = max(0, b)
        w = min(w, width - a)
        h = min(h, height - b)

        # Only crop if we have a valid region
        if w > 0 and h > 0:
            return img.crop((a, b, a + w, b + h))
        return img
    except Exception as e:
        print(f"crop_margin error: {str(e)}")
        return img  # Return original image on error

def visualize_layout(image_path, layout_results, save_path, alpha=0.3):
    """Visualize layout detection results on the image
    
    Args:
        image_path: Path to the input image
        layout_results: List of (bbox, label, tags) dict
        save_path: Path to save the visualization
        alpha: Transparency of the overlay (0-1, lower = more transparent)
    """
    # Read image
    if isinstance(image_path, str):
        image = cv2.imread(image_path)
    else:
        # If it's already a PIL Image
        image = cv2.cvtColor(np.array(image_path), cv2.COLOR_RGB2BGR)
    
    if image is None:
        raise ValueError(f"Failed to load image from {image_path}")
    
    # Assign colors to all elements at once
    element_colors = assign_colors_to_elements(len(layout_results))
    
    # Create overlay
    overlay = image.copy()
    
    # Draw each layout element
    for idx, layout_res in enumerate(layout_results):
        if "bbox" not in layout_res:
            return
        bbox, label, reading_order, tags = layout_res["bbox"], layout_res["label"], layout_res["reading_order"], layout_res["tags"]
       
        x1,y1,x2,y2 = bbox
        
        # Get color for this element (assigned by order, not by label)
        color = element_colors[idx]
        
        # Draw filled rectangle with transparency
        cv2.rectangle(overlay, (x1,y1), (x2,y2), color, -1)
        
        # Draw border
        cv2.rectangle(image, (x1,y1), (x2,y2), color, 3)
        
        # Add label text with background at the top-left corner (outside the box)
        label_text = f"{reading_order}: {label} | {tags}"
        font = cv2.FONT_HERSHEY_SIMPLEX
        font_scale = 0.5
        thickness = 1
        
        # Get text size
        (text_width, text_height), baseline = cv2.getTextSize(
            label_text, font, font_scale, thickness
        )
        
        # Position text above the box (outside)
        text_x = x1
        text_y = y1 - 5  # 5 pixels above the box
        
        # If text would go outside the image at the top, put it inside the box instead
        if text_y - text_height < 0:
            text_y = y1 + text_height + 5
        
        # Draw text background
        cv2.rectangle(
            image,
            (text_x - 2, text_y - text_height - 2),
            (text_x + text_width + 2, text_y + baseline + 2),
            (255, 255, 255),
            -1
        )
        
        # Draw text
        cv2.putText(
            image,
            label_text,
            (text_x, text_y),
            font,
            font_scale,
            (0, 0, 0),
            thickness
        )
    
    # Blend the overlay with the original image
    result = cv2.addWeighted(overlay, alpha, image, 1 - alpha, 0)
    
    # Save the result
    cv2.imwrite(save_path, result)
    # print(f"Layout visualization saved to {save_path}")


def get_color_palette():
    """Get a visually pleasing color palette for layout visualization
    
    Returns:
        List of BGR color tuples (semi-transparent, good for overlay)
    """
    # Carefully selected color palette with good visual distinction
    # Colors are chosen to be light, pleasant, and distinguishable
    color_palette = [
        (200, 255, 255),  # Light cyan
        (255, 200, 255),  # Light magenta
        (255, 255, 200),  # Light yellow
        (200, 255, 200),  # Light green
        (255, 220, 200),  # Light orange
        (220, 200, 255),  # Light purple
        (200, 240, 255),  # Light sky blue
        (255, 240, 220),  # Light peach
        (220, 255, 240),  # Light mint
        (255, 220, 240),  # Light pink
        (240, 255, 200),  # Light lime
        (240, 220, 255),  # Light lavender
        (200, 255, 240),  # Light turquoise
        (255, 240, 200),  # Light apricot
        (220, 240, 255),  # Light periwinkle
        (255, 200, 220),  # Light rose
        (220, 255, 220),  # Light jade
        (255, 230, 200),  # Light salmon
        (210, 230, 255),  # Light cornflower
        (255, 210, 230),  # Light carnation
    ]
    return color_palette


def assign_colors_to_elements(num_elements):
    """Assign colors to elements in order
    
    Args:
        num_elements: Number of elements to assign colors to
        
    Returns:
        List of color tuples, one for each element
    """
    palette = get_color_palette()
    colors = []
    
    for i in range(num_elements):
        # Cycle through the palette if we have more elements than colors
        color_idx = i % len(palette)
        colors.append(palette[color_idx])
    
    return colors

def resize_img(image, max_size=1600, min_size=28):
    width, height = image.size
    if max(width, height) < max_size and min(width, height) >= 28:
        return image
    
    if max(width, height) > max_size:
        if width > height:
            new_width = max_size
            new_height = int(height * (max_size / width))
        else:
            new_height = max_size
            new_width = int(width * (max_size / height))
        image = image.resize((new_width, new_height))
        width, height = image.size
    
    if min(width, height) < 28:
        if width < height:
            new_width = min_size
            new_height = int(height * (min_size / width))
        else:
            new_height = min_size
            new_width = int(width * (min_size / height))
        image = image.resize((new_width, new_height))

    return image


def calculate_iou_matrix(boxes):
    """Vectorized IoU matrix calculation [N, N]
    
    Args:
        boxes: List of bounding boxes in [x1, y1, x2, y2] format
        
    Returns:
        numpy.ndarray: IoU matrix of shape [N, N]
    """
    boxes = np.array(boxes)  # [N, 4]
    
    # Calculate areas
    areas = (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])  # [N]
    
    # Broadcast to calculate intersection
    lt = np.maximum(boxes[:, None, :2], boxes[None, :, :2])  # [N, N, 2]
    rb = np.minimum(boxes[:, None, 2:], boxes[None, :, 2:])  # [N, N, 2]
    
    wh = np.clip(rb - lt, 0, None)  # [N, N, 2]
    inter = wh[:, :, 0] * wh[:, :, 1]  # [N, N]
    
    # Calculate IoU
    union = areas[:, None] + areas[None, :] - inter
    iou = inter / np.clip(union, 1e-6, None)
    
    return iou


def check_bbox_overlap(layout_results_list, image, iou_threshold=0.1, overlap_box_ratio=0.25):
    """Check if bounding boxes have significant overlaps, indicating a distorted/photographed document
    
    If more than 60% of boxes have overlaps (IoU > threshold with at least 1 other box),
    treat as photographed document.
    
    Args:
        layout_results_list: List of (bbox, label, tags) tuples
        image: PIL Image object
        iou_threshold: IoU threshold to consider two boxes as overlapping (default: 0.3)
        overlap_box_ratio: Ratio threshold of boxes with overlaps (default: 0.6, i.e., 60%)
    
    Returns:
        bool: True if significant overlap detected (should treat as distorted_page)
    """
    if len(layout_results_list) <= 1:
        return False
    
    # Convert to absolute coordinates
    bboxes = []
    for bbox, label, tags in layout_results_list:
        x1, y1, x2, y2 = process_coordinates(bbox, image)
        bboxes.append([x1, y1, x2, y2])
    
    # Vectorized IoU matrix calculation
    iou_matrix = calculate_iou_matrix(bboxes)
    
    # Check if each box has overlap with any other box (excluding itself)
    overlap_mask = iou_matrix > iou_threshold
    np.fill_diagonal(overlap_mask, False)  # Exclude self
    has_overlap = overlap_mask.any(axis=1)  # Whether each box has overlap
    
    # Count boxes with overlaps
    overlap_count = has_overlap.sum()
    total_boxes = len(bboxes)
    overlap_ratio = overlap_count / total_boxes
    
    # print(f"Overlap detection: {overlap_count}/{total_boxes} boxes have overlaps (ratio: {overlap_ratio:.2%})")
    
    # If more than 60% boxes have overlaps, treat as photographed document
    if overlap_ratio > overlap_box_ratio:
        print(f"⚠️ High overlap detected ({overlap_ratio:.2%} > {overlap_box_ratio:.2%}), treating as distorted/photographed document")
        return True
    
    return False

if __name__ == "__main__":
    bbox_str = "[210,136,910,172][sec_0][PAIR_SEP][202,217,921,325][para][author][PAIR_SEP][520,341,604,367][para][PAIR_SEP][290,404,384,432][sec_1][paper_abstract][PAIR_SEP][156,448,520,723][para][paper_abstract][PAIR_SEP][125,740,290,768][sec_1][PAIR_SEP][125,781,552,1143][para][PAIR_SEP][125,1144,552,1400][para][RELATION_SEP][573,406,1000,561][para][PAIR_SEP][573,581,1001,943][para][PAIR_SEP][573,962,1001,1222][para][PAIR_SEP][573,1241,1001,1475][para][PAIR_SEP][126,1410,551,1470][fnote][PAIR_SEP][21,499,63,1163][watermark][meta_num]"
    print(parse_layout_string(bbox_str))