Repository: bytedance/Dolphin
Branch: master
Commit: fcbf0334a9ea
Files: 12
Total size: 85.5 KB
Directory structure:
gitextract_yl96yq_r/
├── .gitignore
├── .pre-commit-config.yaml
├── LICENSE
├── README.md
├── README_CN.md
├── demo_element.py
├── demo_layout.py
├── demo_page.py
├── pyproject.toml
├── requirements.txt
└── utils/
├── markdown_utils.py
└── utils.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
coverage.xml
*.mo
*.pot
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
.python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# PyCharm
.idea/
*.iml
# VS Code
.vscode/
!.vscode/settings.json
!.vscode/tasks.json
!.vscode/launch.json
!.vscode/extensions.json
# macOS
.DS_Store
# Windows
Thumbs.db
ehthumbs.db
Desktop.ini
fusion_result.json
kernel_meta/
================================================
FILE: .pre-commit-config.yaml
================================================
repos:
# 1. isort - 自动排序 Python imports
- repo: https://github.com/pycqa/isort
rev: 6.0.1 # 使用固定版本号
hooks:
- id: isort
name: isort (python)
args: [--profile=black] # 与 Black 兼容的配置
language: python
# 2. Black - 自动格式化 Python 代码
- repo: https://github.com/psf/black
rev: 25.1.0 # 使用固定版本号
hooks:
- id: black
language: python
# 3. flake8 - Python 静态检查
- repo: https://github.com/pycqa/flake8
rev: 7.2.0
hooks:
- id: flake8
args: [--max-line-length=120, --ignore=E203] # 设置行长度为 120
additional_dependencies: [flake8-bugbear==24.12.12] # 可选:增强检查
# 4. pre-commit-hooks - 通用 Git 钩子
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
hooks:
- id: trailing-whitespace # 删除行尾空格
- id: end-of-file-fixer # 确保文件以换行符结束
- id: check-yaml # 验证 YAML 文件语法
- id: check-added-large-files # 阻止大文件提交
args: ["--maxkb=512"]
================================================
FILE: LICENSE
================================================
Qwen RESEARCH LICENSE AGREEMENT
Qwen RESEARCH LICENSE AGREEMENT Release Date: September 19, 2024
By clicking to agree or by using or distributing any portion or element of the Qwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately.
1. Definitions
a. This Qwen RESEARCH LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement.
b. "We" (or "Us") shall mean Alibaba Cloud.
c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use.
d. "Third Parties" shall mean individuals or legal entities that are not under common control with us or you.
e. "Qwen" shall mean the large language models, and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by us.
f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Qwen and Documentation (and any portion thereof) made available under this Agreement.
g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files.
h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
i. "Non-Commercial" shall mean for research or evaluation purposes only.
2. Grant of Rights
a. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials FOR NON-COMMERCIAL PURPOSES ONLY.
b. If you are commercially using the Materials, you shall request a license from us.
3. Redistribution
You may distribute copies or make the Materials, or derivative works thereof, available as part of a product or service that contains any of them, with or without modifications, and in Source or Object form, provided that you meet the following conditions:
a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement;
b. You shall cause any modified files to carry prominent notices stating that you changed the files;
c. You shall retain in all copies of the Materials that you distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Qwen is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and
d. You may add your own copyright statement to your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of your modifications, or for any such derivative works as a whole, provided your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement.
4. Rules of use
a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials.
b. If you use the Materials or any outputs or results therefrom to create, train, fine-tune, or improve an AI model that is distributed or made available, you shall prominently display “Built with Qwen” or “Improved using Qwen” in the related product documentation.
5. Intellectual Property
a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications.
b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials.
c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licenses granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought.
6. Disclaimer of Warranty and Limitation of Liability
a. We are not obligated to support, update, provide training for, or develop any further version of the Qwen Materials or to grant any license thereto.
b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM.
c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW IT’S CAUSED.
d. You will defend, indemnify and hold harmless us from and against any claim by any third party arising out of or related to your use or distribution of the Materials.
7. Survival and Termination.
a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein.
b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 6 and 8 shall survive the termination of this Agreement.
8. Governing Law and Jurisdiction.
a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement.
9. Other Terms and Conditions.
a. Any arrangements, understandings, or agreements regarding the Material not stated herein are separate from and independent of the terms and conditions of this Agreement. You shall request a separate license from us, if you use the Materials in ways not expressly agreed to in this Agreement.
b. We shall not be bound by any additional or different terms or conditions communicated by you unless expressly agreed.
================================================
FILE: README.md
================================================
<div align="center">
<img src="./assets/dolphin.png" width="300">
</div>
<div align="center">
<a href="https://arxiv.org/abs/2505.14059">
<img src="https://img.shields.io/badge/Paper-arXiv-red">
</a>
<a href="https://huggingface.co/ByteDance/Dolphin-v2">
<img src="https://img.shields.io/badge/HuggingFace-Dolphin-yellow">
</a>
<a href="https://github.com/bytedance/Dolphin">
<img src="https://img.shields.io/badge/Code-Github-green">
</a>
<a href="https://opensource.org/licenses/MIT">
<img src="https://img.shields.io/badge/License-MIT-lightgray">
</a>
<br>
</div>
<br>
<div align="center">
<img src="./assets/demo.gif" width="800">
</div>
# Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
Dolphin-v2 is an enhanced universal document parsing model that substantially improves upon the original Dolphin. It seamlessly handles any document type—whether digital-born or photographed—through a document-type-aware two-stage architecture with scalable anchor prompting.
## 📑 Overview
Document image parsing is challenging due to diverse document types and complexly intertwined elements such as text paragraphs, figures, formulas, tables, and code blocks. Dolphin-v2 addresses these challenges through a document-type-aware two-stage approach:
1. **🔍 Stage 1**: Document type classification (digital vs. photographed) + layout analysis with reading order prediction
2. **🧩 Stage 2**: Hybrid parsing strategy - holistic parsing for photographed documents, parallel element-wise parsing for digital documents
<div align="center">
<img src="./assets/framework.png" width="680">
</div>
Dolphin achieves promising performance across diverse page-level and element-level parsing tasks while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism.
<!-- ## 🚀 Demo
Try our demo on [Demo-Dolphin](https://huggingface.co/spaces/ByteDance/Dolphin). -->
## 📅 Changelog
- 🔥 **2025.12.12** Released *Dolphin-v2* model. Upgraded to 3B parameters with 21-element detection, attribute field extraction, dedicated formula/code parsing, and robust photographed document parsing. (Dolphin-1.5 moved to [v1.5 branch](https://github.com/bytedance/Dolphin/tree/v1.5))
- 🔥 **2025.10.16** Released *Dolphin-1.5* model. While maintaining the lightweight 0.3B architecture, this version achieves significant parsing improvements. (Dolphin 1.0 moved to [v1.0 branch](https://github.com/bytedance/Dolphin/tree/v1.0))
- 🔥 **2025.07.10** Released the *Fox-Page Benchmark*, a manually refined subset of the original [Fox dataset](https://github.com/ucaslcl/Fox). Download via: [Baidu Yun](https://pan.baidu.com/share/init?surl=t746ULp6iU5bUraVrPlMSw&pwd=fox1) | [Google Drive](https://drive.google.com/file/d/1yZQZqI34QCqvhB4Tmdl3X_XEvYvQyP0q/view?usp=sharing).
- 🔥 **2025.06.30** Added [TensorRT-LLM support](https://github.com/bytedance/Dolphin/blob/master/deployment/tensorrt_llm/ReadMe.md) for accelerated inference!
- 🔥 **2025.06.27** Added [vLLM support](https://github.com/bytedance/Dolphin/blob/master/deployment/vllm/ReadMe.md) for accelerated inference!
- 🔥 **2025.06.13** Added multi-page PDF document parsing capability.
- 🔥 **2025.05.21** Our demo is released at [link](http://115.190.42.15:8888/dolphin/). Check it out!
- 🔥 **2025.05.20** The pretrained model and inference code of Dolphin are released.
- 🔥 **2025.05.16** Our paper has been accepted by ACL 2025. Paper link: [arXiv](https://arxiv.org/abs/2505.14059).
## 📈 Performance
<table style="width:90%; border-collapse: collapse; text-align: center;">
<caption>Comprehensive evaluation of document parsing on OmniDocBench (v1.5)</caption>
<thead>
<tr>
<th style="text-align: center !important;">Model</th>
<th style="text-align: center !important;">Size</th>
<th style="text-align: center !important;">Overall↑</th>
<th style="text-align: center !important;">Text<sup>Edit</sup>↓</th>
<th style="text-align: center !important;">Formula<sup>CDM</sup>↑</th>
<th style="text-align: center !important;">Table<sup>TEDS</sup>↑</th>
<th style="text-align: center !important;">Table<sup>TEDS-S</sup>↑</th>
<th style="text-align: center !important;">Read Order<sup>Edit</sup>↓</th>
</tr>
</thead>
<tbody>
<tr>
<td>Dolphin</td>
<td>0.3B</td>
<td>74.67</td>
<td>0.125</td>
<td>67.85</td>
<td>68.70</td>
<td>77.77</td>
<td>0.124</td>
</tr>
<tr>
<td>Dolphin-1.5</td>
<td>0.3B</td>
<td>85.06</td>
<td>0.085</td>
<td>79.44</td>
<td>84.25</td>
<td>88.06</td>
<td>0.071</td>
</tr>
<tr>
<td>Dolphin-v2</td>
<td>3B</td>
<td><strong>89.78</strong></td>
<td><strong>0.054</strong></td>
<td><strong>87.63</strong></td>
<td><strong>87.02</strong></td>
<td><strong>90.48</strong></td>
<td><strong>0.054</strong></td>
</tr>
</tbody>
</table>
## 🛠️ Installation
1. Clone the repository:
```bash
git clone https://github.com/ByteDance/Dolphin.git
cd Dolphin
```
2. Install the dependencies:
```bash
pip install -r requirements.txt
```
3. Download the pre-trained models of *Dolphin-v2*:
Visit our Huggingface [model card](https://huggingface.co/ByteDance/Dolphin-v2), or download model by:
```bash
# Download the model from Hugging Face Hub
git lfs install
git clone https://huggingface.co/ByteDance/Dolphin-v2 ./hf_model
# Or use the Hugging Face CLI
pip install huggingface_hub
huggingface-cli download ByteDance/Dolphin-v2 --local-dir ./hf_model
```
## ⚡ Inference
Dolphin provides two inference frameworks with support for two parsing granularities:
- **Page-level Parsing**: Parse the entire document page into a structured JSON and Markdown format
- **Element-level Parsing**: Parse individual document elements (text, table, formula)
### 📄 Page-level Parsing
```bash
# Process a single document image
python demo_page.py --model_path ./hf_model --save_dir ./results \
--input_path ./demo/page_imgs/page_1.png
# Process a single document pdf
python demo_page.py --model_path ./hf_model --save_dir ./results \
--input_path ./demo/page_imgs/page_6.pdf
# Process all documents in a directory
python demo_page.py --model_path ./hf_model --save_dir ./results \
--input_path ./demo/page_imgs
# Process with custom batch size for parallel element decoding
python demo_page.py --model_path ./hf_model --save_dir ./results \
--input_path ./demo/page_imgs \
--max_batch_size 8
```
### 🧩 Element-level Parsing
````bash
# Process element images (specify element_type: table, formula, text, or code)
python demo_element.py --model_path ./hf_model --save_dir ./results \
--input_path \
--element_type [table|formula|text|code]
````
### 🎨 Layout Parsing
````bash
# Process a single document image
python demo_layout.py --model_path ./hf_model --save_dir ./results \
--input_path ./demo/page_imgs/page_1.png \
# Process a single PDF document
python demo_layout.py --model_path ./hf_model --save_dir ./results \
--input_path ./demo/page_imgs/page_6.pdf \
# Process all documents in a directory
python demo_layout.py --model_path ./hf_model --save_dir ./results \
--input_path ./demo/page_imgs
````
## 🌟 Key Features
- 🔄 Two-stage analyze-then-parse approach based on a single VLM
- 📊 Promising performance on document parsing tasks
- 🔍 Natural reading order element sequence generation
- 🧩 Heterogeneous anchor prompting for different document elements
- ⏱️ Efficient parallel parsing mechanism
- 🤗 Support for Hugging Face Transformers for easier integration
## 📮 Notice
**Call for Bad Cases:** If you have encountered any cases where the model performs poorly, we would greatly appreciate it if you could share them in the issue. We are continuously working to optimize and improve the model.
## 💖 Acknowledgement
We would like to acknowledge the following open-source projects that provided inspiration and reference for this work:
- [OmniDocBench](https://github.com/opendatalab/OmniDocBench)
- [Donut](https://github.com/clovaai/donut/)
- [Nougat](https://github.com/facebookresearch/nougat)
- [GOT](https://github.com/Ucas-HaoranWei/GOT-OCR2.0)
- [MinerU](https://github.com/opendatalab/MinerU/tree/master)
- [Swin](https://github.com/microsoft/Swin-Transformer)
- [Hugging Face Transformers](https://github.com/huggingface/transformers)
## 📝 Citation
If you find this code useful for your research, please use the following BibTeX entry.
```bibtex
@article{feng2025dolphin,
title={Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting},
author={Feng, Hao and Wei, Shu and Fei, Xiang and Shi, Wei and Han, Yingdong and Liao, Lei and Lu, Jinghui and Wu, Binghong and Liu, Qi and Lin, Chunhui and others},
journal={arXiv preprint arXiv:2505.14059},
year={2025}
}
```
## Star History
[](https://www.star-history.com/#bytedance/Dolphin&Date)
================================================
FILE: README_CN.md
================================================
<div align="center">
<img src="./assets/dolphin.png" width="300">
</div>
<div align="center">
<a href="https://arxiv.org/abs/2505.14059">
<img src="https://img.shields.io/badge/论文-arXiv-red">
</a>
<a href="https://huggingface.co/ByteDance/Dolphin-v2">
<img src="https://img.shields.io/badge/HuggingFace-Dolphin-yellow">
</a>
<a href="https://github.com/bytedance/Dolphin">
<img src="https://img.shields.io/badge/代码-Github-green">
</a>
<a href="https://opensource.org/licenses/MIT">
<img src="https://img.shields.io/badge/许可证-MIT-lightgray">
</a>
<br>
</div>
<br>
<div align="center">
<img src="./assets/demo.gif" width="800">
</div>
# Dolphin: 基于异构锚点提示的文档图像解析
Dolphin(**Do**cument Image **P**arsing via **H**eterogeneous Anchor Prompt**in**g)是一个创新的多模态文档图像解析模型(**0.3B**),采用"分析-解析"的两阶段范式。本仓库包含Dolphin的演示代码和预训练模型。
## 📑 概述
由于文档图像中文本段落、图表、公式和表格等元素的复杂交织,文档图像解析具有挑战性。Dolphin通过两阶段方法解决这些挑战:
1. **🔍 第一阶段**:通过按自然阅读顺序生成元素序列进行全面的页面级布局分析
2. **🧩 第二阶段**:使用异构锚点和任务特定提示高效并行解析文档元素
<div align="center">
<img src="./assets/framework.png" width="680">
</div>
Dolphin在多样化的页面级和元素级解析任务中取得了优异的性能,同时通过其轻量级架构和并行解析机制确保了卓越的效率。
## 📅 更新日志
- 🔥 **2025.12.12** *Dolphin-v2* 开源!支持 21 类元素检测、属性字段提取、代码专用解析,以及拍照文档解析。(原1.5版本已迁移至[v1.5分支](https://github.com/bytedance/Dolphin/tree/v1.5))
- 🔥 **2025.10.16** *Dolphin-1.5* 开源!在保持轻量级0.3B架构的同时,该版本实现了显著的解析性能提升。(原1.0版本已迁移至[v1.0分支](https://github.com/bytedance/Dolphin/tree/v1.0))
- 🔥 **2025.07.10** *Fox-Page* 基准测试开源。这是原始 [Fox 数据集](https://github.com/ucaslcl/Fox) 人工矫正标注后的版本。下载地址:[百度网盘](https://pan.baidu.com/share/init?surl=t746ULp6iU5bUraVrPlMSw&pwd=fox1) | [Google Drive](https://drive.google.com/file/d/1yZQZqI34QCqvhB4Tmdl3X_XEvYvQyP0q/view?usp=sharing)。
- 🔥 **2025.06.30** 新增[TensorRT-LLM](https://github.com/bytedance/Dolphin/blob/master/deployment/tensorrt_llm/ReadMe.md)支持,提升推理速度!
- 🔥 **2025.06.27** 新增[vLLM](https://github.com/bytedance/Dolphin/blob/master/deployment/vllm/ReadMe.md)支持,提升推理速度!
- 🔥 **2025.06.13** 新增多页PDF文档解析功能。
- 🔥 **2025.05.21** 我们的演示已在 [链接](http://115.190.42.15:8888/dolphin/) 发布。快来体验吧!
- 🔥 **2025.05.20** Dolphin的预训练模型和推理代码已发布。
- 🔥 **2025.05.16** 我们的论文已被ACL 2025接收。论文链接:[arXiv](https://arxiv.org/abs/2505.14059)。
## 📈 性能表现
<table style="width:90%; border-collapse: collapse; text-align: center;">
<caption>OmniDocBench (v1.5) 测试基准上评估结果</caption>
<thead>
<tr>
<th style="text-align: center !important;">模型</th>
<th style="text-align: center !important;">参数</th>
<th style="text-align: center !important;">总体↑</th>
<th style="text-align: center !important;">文本<sup>Edit</sup>↓</th>
<th style="text-align: center !important;">公式<sup>CDM</sup>↑</th>
<th style="text-align: center !important;">表格<sup>TEDS</sup>↑</th>
<th style="text-align: center !important;">表格<sup>TEDS-S</sup>↑</th>
<th style="text-align: center !important;">阅读顺序<sup>Edit</sup>↓</th>
</tr>
</thead>
<tr>
<td>Dolphin</td>
<td>0.3B</td>
<td>74.67</td>
<td>0.125</td>
<td>67.85</td>
<td>68.70</td>
<td>77.77</td>
<td>0.124</td>
</tr>
<tr>
<td>Dolphin-1.5</td>
<td>0.3B</td>
<td>85.06</td>
<td>0.085</td>
<td>79.44</td>
<td>84.25</td>
<td>88.06</td>
<td>0.071</td>
</tr>
<tr>
<td>Dolphin-v2</td>
<td>0.3B</td>
<td><strong>89.78</strong></td>
<td><strong>0.054</strong></td>
<td><strong>87.63</strong></td>
<td><strong>87.02</strong></td>
<td><strong>90.48</strong></td>
<td><strong>0.054</strong></td>
</tr>
</tbody>
</table>
## 🛠️ 安装
1. 克隆仓库:
```bash
git clone https://github.com/ByteDance/Dolphin.git
cd Dolphin
```
2. 安装依赖:
```bash
pip install -r requirements.txt
```
3. 使用以下选项之一下载 *Dolphin-v2* 的预训练模型:
访问我们的Huggingface [模型卡片](https://huggingface.co/ByteDance/Dolphin-v2),或通过以下方式下载模型:
```bash
# 从Hugging Face Hub下载模型
git lfs install
git clone https://huggingface.co/ByteDance/Dolphin-v2 ./hf_model
# 或使用Hugging Face CLI
pip install huggingface_hub
huggingface-cli download ByteDance/Dolphin-v2 --local-dir ./hf_model
```
## ⚡ 推理
Dolphin提供两个推理框架,支持两种解析粒度:
- **页面级解析**:将整个文档页面解析为结构化的JSON和Markdown格式
- **元素级解析**:解析单个文档元素(文本、表格、公式)
### 📄 页面级解析
```bash
# 处理单个文档图像
python demo_page.py --model_path ./hf_model --save_dir ./results \
--input_path ./demo/page_imgs/page_1.png
# 处理单个文档PDF
python demo_page.py --model_path ./hf_model --save_dir ./results \
--input_path ./demo/page_imgs/page_6.pdf
# 处理目录中的所有文档
python demo_page.py --model_path ./hf_model --save_dir ./results \
--input_path ./demo/page_imgs
# 使用自定义批次大小进行并行元素解码
python demo_page.py --model_path ./hf_model --save_dir ./results \
--input_path ./demo/page_imgs \
--max_batch_size 8
```
### 🧩 元素级解析
````bash
# 解析块图像 (支持块图像类型: table, formula, text, or code)
python demo_element.py --model_path ./hf_model --save_dir ./results \
--input_path \
--element_type [table|formula|text|code]
````
### 🎨 元素定位及阅读顺序解析
````bash
# 处理单个文档图像
python demo_layout.py --model_path ./hf_model --save_dir ./results \
--input_path ./demo/page_imgs/page_1.png \
# 处理单个文档PDF
python demo_layout.py --model_path ./hf_model --save_dir ./results \
--input_path ./demo/page_imgs/page_6.pdf \
# 处理目录中的所有文档
python demo_layout.py --model_path ./hf_model --save_dir ./results \
--input_path ./demo/page_imgs
````
## 🌟 主要特性
- 🔄 基于单一VLM的两阶段分析-解析方法
- 📊 在文档解析任务上的优异性能
- 🔍 自然阅读顺序元素序列生成
- 🧩 针对不同文档元素的异构锚点提示
- ⏱️ 高效的并行解析机制
- 🤗 支持Hugging Face Transformers,便于集成
## 📮 通知
**征集不良案例:** 如果您遇到模型表现不佳的案例,我们非常欢迎您在issue中分享。我们正在持续优化和改进模型。
## 💖 致谢
我们要感谢以下开源项目为本工作提供的灵感和参考:
- [OmniDocBench](https://github.com/opendatalab/OmniDocBench)
- [Donut](https://github.com/clovaai/donut/)
- [Nougat](https://github.com/facebookresearch/nougat)
- [GOT](https://github.com/Ucas-HaoranWei/GOT-OCR2.0)
- [MinerU](https://github.com/opendatalab/MinerU/tree/master)
- [Swin](https://github.com/microsoft/Swin-Transformer)
- [Hugging Face Transformers](https://github.com/huggingface/transformers)
## 📝 引用
如果您在研究中发现此代码有用,请使用以下BibTeX条目。
```bibtex
@article{feng2025dolphin,
title={Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting},
author={Feng, Hao and Wei, Shu and Fei, Xiang and Shi, Wei and Han, Yingdong and Liao, Lei and Lu, Jinghui and Wu, Binghong and Liu, Qi and Lin, Chunhui and others},
journal={arXiv preprint arXiv:2505.14059},
year={2025}
}
```
## 星标历史
[](https://www.star-history.com/#bytedance/Dolphin&Date)
================================================
FILE: demo_element.py
================================================
"""
Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
SPDX-License-Identifier: MIT
"""
import argparse
import glob
import os
import cv2
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
from utils.utils import *
class DOLPHIN:
def __init__(self, model_id_or_path):
"""Initialize the Hugging Face model
Args:
model_id_or_path: Path to local model or Hugging Face model ID
"""
# Load model from local path or Hugging Face hub
self.processor = AutoProcessor.from_pretrained(model_id_or_path)
self.model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_id_or_path)
self.model.eval()
# Set device and precision
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.model.to(self.device)
if self.device == "cuda":
self.model = self.model.bfloat16()
else:
self.model = self.model.float()
# set tokenizer
self.tokenizer = self.processor.tokenizer
self.tokenizer.padding_side = "left"
def chat(self, prompt, image):
# Check if we're dealing with a batch
is_batch = isinstance(image, list)
if not is_batch:
# Single image, wrap it in a list for consistent processing
images = [image]
prompts = [prompt]
else:
# Batch of images
images = image
prompts = prompt if isinstance(prompt, list) else [prompt] * len(images)
assert len(images) == len(prompts)
# preprocess all images
processed_images = [resize_img(img) for img in images]
# generate all messages
all_messages = []
for img, question in zip(processed_images, prompts):
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": img,
},
{"type": "text", "text": question}
],
}
]
all_messages.append(messages)
# prepare all texts
texts = [
self.processor.apply_chat_template(
msgs, tokenize=False, add_generation_prompt=True
)
for msgs in all_messages
]
# collect all image inputs
all_image_inputs = []
all_video_inputs = None
for msgs in all_messages:
image_inputs, video_inputs = process_vision_info(msgs)
all_image_inputs.extend(image_inputs)
# prepare model inputs
inputs = self.processor(
text=texts,
images=all_image_inputs if all_image_inputs else None,
videos=all_video_inputs if all_video_inputs else None,
padding=True,
return_tensors="pt",
)
inputs = inputs.to(self.model.device)
# inference
generated_ids = self.model.generate(
**inputs,
max_new_tokens=4096,
# repetition_penalty=1.05
)
generated_ids_trimmed = [
out_ids[len(in_ids):]
for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
results = self.processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
# Return a single result for single image input
if not is_batch:
return results[0]
return results
def process_element(image_path, model, element_type, save_dir=None):
"""Process a single element image (text, table, formula)
Args:
image_path: Path to the element image
model: HFModel model instance
element_type: Type of element ('text', 'table', 'formula')
save_dir: Directory to save results (default: same as input directory)
Returns:
Parsed content of the element and recognition results
"""
# Load and prepare image
pil_image = Image.open(image_path).convert("RGB")
# pil_image = crop_margin(pil_image)
# Select appropriate prompt based on element type
if element_type == "table":
prompt = "Parse the table in the image."
label = "tab"
elif element_type == "formula":
prompt = "Read formula in the image."
label = "equ"
elif element_type == "code":
prompt = "Read code in the image."
label = "code"
else: # Default to text
prompt = "Read text in the image."
label = "para"
# Process the element
result = model.chat(prompt, pil_image)
# Create recognition result in the same format as the document parser
recognition_results = [
{
"label": label,
"text": result.strip(),
}
]
# Save results if save_dir is provided
save_outputs(recognition_results, pil_image, os.path.basename(image_path).split(".")[0], save_dir)
print(f"Results saved to {save_dir}")
return result, recognition_results
def main():
parser = argparse.ArgumentParser(description="Element-level processing using DOLPHIN model")
parser.add_argument("--model_path", default="./hf_model", help="Path to Hugging Face model")
parser.add_argument("--input_path", type=str, required=True, help="Path to input image or directory of images")
parser.add_argument(
"--element_type",
type=str,
choices=["text", "table", "formula", "code"],
default="text",
help="Type of element to process (text, table, formula)",
)
parser.add_argument(
"--save_dir",
type=str,
default=None,
help="Directory to save parsing results (default: same as input directory)",
)
parser.add_argument("--print_results", action="store_true", help="Print recognition results to console")
args = parser.parse_args()
# Load Model
model = DOLPHIN(args.model_path)
# Set save directory
save_dir = args.save_dir or (
args.input_path if os.path.isdir(args.input_path) else os.path.dirname(args.input_path)
)
setup_output_dirs(save_dir)
# Collect Images
if os.path.isdir(args.input_path):
image_files = []
for ext in [".jpg", ".jpeg", ".png", ".JPG", ".JPEG", ".PNG"]:
image_files.extend(glob.glob(os.path.join(args.input_path, f"*{ext}")))
image_files = sorted(image_files)
else:
if not os.path.exists(args.input_path):
raise FileNotFoundError(f"Input path {args.input_path} does not exist")
image_files = [args.input_path]
total_samples = len(image_files)
print(f"\nTotal samples to process: {total_samples}")
# Process images one by one
for image_path in image_files:
print(f"\nProcessing {image_path}")
try:
result, recognition_result = process_element(
image_path=image_path,
model=model,
element_type=args.element_type,
save_dir=save_dir,
)
if args.print_results:
print("\nRecognition result:")
print(result)
print("-" * 40)
except Exception as e:
print(f"Error processing {image_path}: {str(e)}")
continue
if __name__ == "__main__":
main()
================================================
FILE: demo_layout.py
================================================
"""
Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
SPDX-License-Identifier: MIT
"""
import argparse
import glob
import os
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
from utils.utils import *
class DOLPHIN:
def __init__(self, model_id_or_path):
"""Initialize the Hugging Face model
Args:
model_id_or_path: Path to local model or Hugging Face model ID
"""
# Load model from local path or Hugging Face hub
self.processor = AutoProcessor.from_pretrained(model_id_or_path)
self.model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_id_or_path)
self.model.eval()
# Set device and precision
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.model.to(self.device)
if self.device == "cuda":
self.model = self.model.bfloat16()
else:
self.model = self.model.float()
# set tokenizer
self.tokenizer = self.processor.tokenizer
self.tokenizer.padding_side = "left"
def chat(self, prompt, image):
# Check if we're dealing with a batch
is_batch = isinstance(image, list)
if not is_batch:
# Single image, wrap it in a list for consistent processing
images = [image]
prompts = [prompt]
else:
# Batch of images
images = image
prompts = prompt if isinstance(prompt, list) else [prompt] * len(images)
assert len(images) == len(prompts)
# preprocess all images
processed_images = [resize_img(img) for img in images]
# generate all messages
all_messages = []
for img, question in zip(processed_images, prompts):
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": img,
},
{"type": "text", "text": question}
],
}
]
all_messages.append(messages)
# prepare all texts
texts = [
self.processor.apply_chat_template(
msgs, tokenize=False, add_generation_prompt=True
)
for msgs in all_messages
]
# collect all image inputs
all_image_inputs = []
all_video_inputs = None
for msgs in all_messages:
image_inputs, video_inputs = process_vision_info(msgs)
all_image_inputs.extend(image_inputs)
# prepare model inputs
inputs = self.processor(
text=texts,
images=all_image_inputs if all_image_inputs else None,
videos=all_video_inputs if all_video_inputs else None,
padding=True,
return_tensors="pt",
)
inputs = inputs.to(self.model.device)
# inference
generated_ids = self.model.generate(
**inputs,
max_new_tokens=4096,
do_sample=False,
temperature=None,
# repetition_penalty=1.05
)
generated_ids_trimmed = [
out_ids[len(in_ids):]
for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
results = self.processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
# Return a single result for single image input
if not is_batch:
return results[0]
return results
def process_layout(input_path, model, save_dir):
"""Process layout detection for image or PDF
Args:
input_path: Path to input image or PDF
model: DOLPHIN model instance
save_dir: Directory to save results
"""
file_ext = os.path.splitext(input_path)[1].lower()
if file_ext == '.pdf':
# Convert PDF to images
images = convert_pdf_to_images(input_path)
if not images:
raise Exception(f"Failed to convert PDF {input_path} to images")
# Process each page
for page_idx, pil_image in enumerate(images):
print(f"\nProcessing page {page_idx + 1}/{len(images)}")
# Generate output name for this page
base_name = os.path.splitext(os.path.basename(input_path))[0]
page_name = f"{base_name}_page_{page_idx + 1:03d}"
# Process layout for this page
process_single_layout(pil_image, model, save_dir, page_name)
else:
# Process regular image file
pil_image = Image.open(input_path).convert("RGB")
base_name = os.path.splitext(os.path.basename(input_path))[0]
process_single_layout(pil_image, model, save_dir, base_name)
def process_single_layout(pil_image, model, save_dir, image_name):
"""Process layout for a single image
Args:
pil_image: PIL Image object
model: DOLPHIN model instance
save_dir: Directory to save results
image_name: Name for the output files
"""
# Parse layout
print("Parsing layout and reading order...")
layout_results = model.chat("Parse the reading order of this document.", pil_image)
# Parse the layout string
layout_results_list = parse_layout_string(layout_results)
if not layout_results_list or not (layout_results.startswith("[") and layout_results.endswith("]")):
layout_results_list = [([0, 0, *pil_image.size], 'distorted_page', [])]
# map bbox to original image coordinates
recognition_results = []
reading_order = 0
for bbox, label, tags in layout_results_list:
x1, y1, x2, y2 = process_coordinates(bbox, pil_image)
recognition_results.append({
"label": label,
"bbox": [x1, y1, x2, y2],
"text": "", # empty for now
"reading_order": reading_order,
"tags": tags,
})
reading_order += 1
json_path = save_outputs(recognition_results, pil_image, image_name, save_dir)
def main():
parser = argparse.ArgumentParser(description="Layout detection and visualization using DOLPHIN model")
parser.add_argument("--model_path", default="./hf_model", help="Path to Hugging Face model")
parser.add_argument(
"--input_path",
type=str,
required=True,
help="Path to input image/PDF or directory of files"
)
parser.add_argument(
"--save_dir",
type=str,
default=None,
help="Directory to save results (default: same as input directory)",
)
args = parser.parse_args()
# Load Model
print("Loading model...")
model = DOLPHIN(args.model_path)
# Set save directory
save_dir = args.save_dir or (
args.input_path if os.path.isdir(args.input_path) else os.path.dirname(args.input_path)
)
# Create save directory if it doesn't exist
os.makedirs(save_dir, exist_ok=True)
# Collect files
if os.path.isdir(args.input_path):
# Support both image and PDF files
file_extensions = [".jpg", ".jpeg", ".png", ".JPG", ".JPEG", ".PNG", ".pdf", ".PDF"]
input_files = []
for ext in file_extensions:
input_files.extend(glob.glob(os.path.join(args.input_path, f"*{ext}")))
input_files = sorted(input_files)
else:
if not os.path.exists(args.input_path):
raise FileNotFoundError(f"Input path {args.input_path} does not exist")
# Check if it's a supported file type
file_ext = os.path.splitext(args.input_path)[1].lower()
supported_exts = ['.jpg', '.jpeg', '.png', '.pdf']
if file_ext not in supported_exts:
raise ValueError(f"Unsupported file type: {file_ext}. Supported types: {supported_exts}")
input_files = [args.input_path]
total_files = len(input_files)
print(f"\nTotal files to process: {total_files}")
# Process files
for file_path in input_files:
print(f"\n{'='*60}")
print(f"Processing: {file_path}")
print('='*60)
try:
process_layout(
input_path=file_path,
model=model,
save_dir=save_dir,
)
print(f"\n✓ Processing completed for {file_path}")
except Exception as e:
print(f"\n✗ Error processing {file_path}: {str(e)}")
continue
print(f"\n{'='*60}")
print(f"All processing completed. Results saved to {save_dir}")
print('='*60)
if __name__ == "__main__":
main()
================================================
FILE: demo_page.py
================================================
"""
Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
SPDX-License-Identifier: MIT
"""
import argparse
import glob
import os
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
from utils.utils import *
class DOLPHIN:
def __init__(self, model_id_or_path):
"""Initialize the Hugging Face model
Args:
model_id_or_path: Path to local model or Hugging Face model ID
"""
# Load model from local path or Hugging Face hub
self.processor = AutoProcessor.from_pretrained(model_id_or_path)
self.model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_id_or_path)
self.model.eval()
# Set device and precision
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.model.to(self.device)
if self.device == "cuda":
self.model = self.model.bfloat16()
else:
self.model = self.model.float()
# set tokenizer
self.tokenizer = self.processor.tokenizer
self.tokenizer.padding_side = "left"
def chat(self, prompt, image):
# Check if we're dealing with a batch
is_batch = isinstance(image, list)
if not is_batch:
# Single image, wrap it in a list for consistent processing
images = [image]
prompts = [prompt]
else:
# Batch of images
images = image
prompts = prompt if isinstance(prompt, list) else [prompt] * len(images)
assert len(images) == len(prompts)
# preprocess all images
processed_images = [resize_img(img) for img in images]
# generate all messages
all_messages = []
for img, question in zip(processed_images, prompts):
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": img,
},
{"type": "text", "text": question}
],
}
]
all_messages.append(messages)
# prepare all texts
texts = [
self.processor.apply_chat_template(
msgs, tokenize=False, add_generation_prompt=True
)
for msgs in all_messages
]
# collect all image inputs
all_image_inputs = []
all_video_inputs = None
for msgs in all_messages:
image_inputs, video_inputs = process_vision_info(msgs)
all_image_inputs.extend(image_inputs)
# prepare model inputs
inputs = self.processor(
text=texts,
images=all_image_inputs if all_image_inputs else None,
videos=all_video_inputs if all_video_inputs else None,
padding=True,
return_tensors="pt",
)
inputs = inputs.to(self.model.device)
# inference
generated_ids = self.model.generate(
**inputs,
max_new_tokens=4096,
do_sample=False,
temperature=None,
# repetition_penalty=1.05
)
generated_ids_trimmed = [
out_ids[len(in_ids):]
for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
results = self.processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
# Return a single result for single image input
if not is_batch:
return results[0]
return results
def process_document(document_path, model, save_dir, max_batch_size=None):
"""Parse documents with two stages - Handles both images and PDFs"""
file_ext = os.path.splitext(document_path)[1].lower()
if file_ext == '.pdf':
# Convert PDF to images
images = convert_pdf_to_images(document_path)
if not images:
raise Exception(f"Failed to convert PDF {document_path} to images")
all_results = []
# Process each page
for page_idx, pil_image in enumerate(images):
print(f"Processing page {page_idx + 1}/{len(images)}")
# Generate output name for this page
base_name = os.path.splitext(os.path.basename(document_path))[0]
page_name = f"{base_name}_page_{page_idx + 1:03d}"
# Process this page (don't save individual page results)
json_path, recognition_results = process_single_image(
pil_image, model, save_dir, page_name, max_batch_size, save_individual=False
)
# Add page information to results
page_results = {
"page_number": page_idx + 1,
"elements": recognition_results
}
all_results.append(page_results)
# Save combined results for multi-page PDF
combined_json_path = save_combined_pdf_results(all_results, document_path, save_dir)
return combined_json_path, all_results
else:
# Process regular image file
pil_image = Image.open(document_path).convert("RGB")
base_name = os.path.splitext(os.path.basename(document_path))[0]
return process_single_image(pil_image, model, save_dir, base_name, max_batch_size)
def process_single_image(image, model, save_dir, image_name, max_batch_size=None, save_individual=True):
"""Process a single image (either from file or converted from PDF page)
Args:
image: PIL Image object
model: DOLPHIN model instance
save_dir: Directory to save results
image_name: Name for the output file
max_batch_size: Maximum batch size for processing
save_individual: Whether to save individual results (False for PDF pages)
Returns:
Tuple of (json_path, recognition_results)
"""
# Stage 1: Page-level layout and reading order parsing
layout_output = model.chat("Parse the reading order of this document.", image)
# print(layout_output)
# Stage 2: Element-level content parsing
recognition_results = process_elements(layout_output, image, model, max_batch_size, save_dir, image_name)
# Save outputs only if requested (skip for PDF pages)
json_path = None
if save_individual:
# Create a dummy image path for save_outputs function
json_path = save_outputs(recognition_results, image, image_name, save_dir)
return json_path, recognition_results
def process_elements(layout_results, image, model, max_batch_size, save_dir=None, image_name=None):
"""Parse all document elements with parallel decoding"""
layout_results_list = parse_layout_string(layout_results)
if not layout_results_list or not (layout_results.startswith("[") and layout_results.endswith("]")):
layout_results_list = [([0, 0, *image.size], 'distorted_page', [])]
# Check for bbox overlap - if too many overlaps, treat as distorted page
elif len(layout_results_list) > 1 and check_bbox_overlap(layout_results_list, image):
print("Falling back to distorted_page mode due to high bbox overlap")
layout_results_list = [([0, 0, *image.size], 'distorted_page', [])]
tab_elements = []
equ_elements = []
code_elements = []
text_elements = []
figure_results = []
reading_order = 0
# Collect elements and group
for bbox, label, tags in layout_results_list:
try:
if label == "distorted_page":
x1, y1, x2, y2 = 0, 0, *image.size
pil_crop = image
else:
# get coordinates in the original image
x1, y1, x2, y2 = process_coordinates(bbox, image)
# crop the image
pil_crop = image.crop((x1, y1, x2, y2))
if pil_crop.size[0] > 3 and pil_crop.size[1] > 3:
if label == "fig":
figure_filename = save_figure_to_local(pil_crop, save_dir, image_name, reading_order)
figure_results.append({
"label": label,
"text": f"",
"figure_path": f"figures/{figure_filename}",
"bbox": [x1, y1, x2, y2],
"reading_order": reading_order,
"tags": tags,
})
else:
# Prepare element information
element_info = {
"crop": pil_crop,
"label": label,
"bbox": [x1, y1, x2, y2],
"reading_order": reading_order,
"tags": tags,
}
if label == "tab":
tab_elements.append(element_info)
elif label == "equ":
equ_elements.append(element_info)
elif label == "code":
code_elements.append(element_info)
else:
text_elements.append(element_info)
reading_order += 1
except Exception as e:
print(f"Error processing bbox with label {label}: {str(e)}")
continue
recognition_results = figure_results.copy()
if tab_elements:
results = process_element_batch(tab_elements, model, "Parse the table in the image.", max_batch_size)
recognition_results.extend(results)
if equ_elements:
results = process_element_batch(equ_elements, model, "Read formula in the image.", max_batch_size)
recognition_results.extend(results)
if code_elements:
results = process_element_batch(code_elements, model, "Read code in the image.", max_batch_size)
recognition_results.extend(results)
if text_elements:
results = process_element_batch(text_elements, model, "Read text in the image.", max_batch_size)
recognition_results.extend(results)
recognition_results.sort(key=lambda x: x.get("reading_order", 0))
return recognition_results
def process_element_batch(elements, model, prompt, max_batch_size=None):
"""Process elements of the same type in batches"""
results = []
# Determine batch size
batch_size = len(elements)
if max_batch_size is not None and max_batch_size > 0:
batch_size = min(batch_size, max_batch_size)
# Process in batches
for i in range(0, len(elements), batch_size):
batch_elements = elements[i:i+batch_size]
crops_list = [elem["crop"] for elem in batch_elements]
# Use the same prompt for all elements in the batch
prompts_list = [prompt] * len(crops_list)
# Batch inference
batch_results = model.chat(prompts_list, crops_list)
# Add results
for j, result in enumerate(batch_results):
elem = batch_elements[j]
results.append({
"label": elem["label"],
"bbox": elem["bbox"],
"text": result.strip(),
"reading_order": elem["reading_order"],
"tags": elem["tags"],
})
return results
def main():
parser = argparse.ArgumentParser(description="Document parsing based on DOLPHIN")
parser.add_argument("--model_path", default="./hf_model", help="Path to Hugging Face model")
parser.add_argument("--input_path", type=str, default="./demo", help="Path to input image/PDF or directory of files")
parser.add_argument(
"--save_dir",
type=str,
default=None,
help="Directory to save parsing results (default: same as input directory)",
)
parser.add_argument(
"--max_batch_size",
type=int,
default=4,
help="Maximum number of document elements to parse in a single batch (default: 4)",
)
args = parser.parse_args()
# Load Model
model = DOLPHIN(args.model_path)
# Collect Document Files (images and PDFs)
if os.path.isdir(args.input_path):
# Support both image and PDF files
file_extensions = [".jpg", ".jpeg", ".png", ".JPG", ".JPEG", ".PNG", ".pdf", ".PDF"]
document_files = []
for ext in file_extensions:
document_files.extend(glob.glob(os.path.join(args.input_path, f"*{ext}")))
document_files = sorted(document_files)
else:
if not os.path.exists(args.input_path):
raise FileNotFoundError(f"Input path {args.input_path} does not exist")
# Check if it's a supported file type
file_ext = os.path.splitext(args.input_path)[1].lower()
supported_exts = ['.jpg', '.jpeg', '.png', '.pdf']
if file_ext not in supported_exts:
raise ValueError(f"Unsupported file type: {file_ext}. Supported types: {supported_exts}")
document_files = [args.input_path]
save_dir = args.save_dir or (
args.input_path if os.path.isdir(args.input_path) else os.path.dirname(args.input_path)
)
setup_output_dirs(save_dir)
total_samples = len(document_files)
print(f"\nTotal files to process: {total_samples}")
# Process All Document Files
for file_path in document_files:
print(f"\nProcessing {file_path}")
try:
json_path, recognition_results = process_document(
document_path=file_path,
model=model,
save_dir=save_dir,
max_batch_size=args.max_batch_size,
)
print(f"Processing completed. Results saved to {save_dir}")
except Exception as e:
print(f"Error processing {file_path}: {str(e)}")
continue
if __name__ == "__main__":
main()
================================================
FILE: pyproject.toml
================================================
[tool.black]
line-length = 120
include = '\.pyi?$'
exclude = '''
/(
\.git
| \.hg
| \.mypy_cache
| \.tox
| \.venv
| _build
| buck-out
| build
| dist
)/
'''
================================================
FILE: requirements.txt
================================================
datasets==3.6.0
torch==2.6.0
torchvision==0.21.0
transformers==4.51.0
deepspeed==0.16.4
triton==3.2.0
accelerate==1.4.0
torchcodec==0.2
decord==0.6.0
Levenshtein==0.27.1
qwen_vl_utils
matplotlib
jieba
opencv-python
bs4
albumentations==1.4.0
pymupdf==1.26
================================================
FILE: utils/markdown_utils.py
================================================
"""
Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
SPDX-License-Identifier: MIT
"""
import re
import base64
from typing import List, Dict, Any, Optional
def extract_table_from_html(html_string):
"""Extract and clean table tags from HTML string"""
try:
table_pattern = re.compile(r'<table.*?>.*?</table>', re.DOTALL)
tables = table_pattern.findall(html_string)
tables = [re.sub(r'<table[^>]*>', '<table>', table) for table in tables]
return '\n'.join(tables)
except Exception as e:
print(f"extract_table_from_html error: {str(e)}")
return f"<table><tr><td>Error extracting table: {str(e)}</td></tr></table>"
class MarkdownConverter:
"""Convert structured recognition results to Markdown format"""
def __init__(self):
# Define heading levels for different section types
self.heading_levels = {
'sec_0': '#',
'sec_1': '##',
'sec_2': '###',
'sec_3': '###',
'sec_4': '###',
'sec_5': '###',
}
# Define which labels need special handling
self.special_labels = {
'sec_0', 'sec_1', 'sec_2', 'sec_3', 'sec_4', 'sec_5',
'list', 'equ', 'tab', 'fig'
}
# Define replacements for special formulas
self.replace_dict = {
'\\bm': '\mathbf ',
'\eqno': '\quad ',
'\quad': '\quad ',
'\leq': '\leq ',
'\pm': '\pm ',
'\\varmathbb': '\mathbb ',
'\in fty': '\infty',
'\mu': '\mu ',
'\cdot': '\cdot ',
'\langle': '\langle ',
'\pm': '\pm '
}
def try_remove_newline(self, text: str) -> str:
try:
# Preprocess text to handle line breaks
text = text.strip()
text = text.replace('-\n', '')
# Handle Chinese text line breaks
def is_chinese(char):
return '\u4e00' <= char <= '\u9fff'
lines = text.split('\n')
processed_lines = []
# Process all lines except the last one
for i in range(len(lines)-1):
current_line = lines[i].strip()
next_line = lines[i+1].strip()
# Always add the current line, but determine if we need a newline
if current_line: # If current line is not empty
if next_line: # If next line is not empty
# For Chinese text handling
if is_chinese(current_line[-1]) and is_chinese(next_line[0]):
processed_lines.append(current_line)
else:
processed_lines.append(current_line + ' ')
else:
# Next line is empty, add current line with newline
processed_lines.append(current_line + '\n')
else:
# Current line is empty, add an empty line
processed_lines.append('\n')
# Add the last line
if lines and lines[-1].strip():
processed_lines.append(lines[-1].strip())
text = ''.join(processed_lines)
return text
except Exception as e:
print(f"try_remove_newline error: {str(e)}")
return text # Return original text on error
def _handle_text(self, text: str) -> str:
"""
Process regular text content, preserving paragraph structure
"""
try:
if not text:
return ""
# Process formulas in text before handling other text processing
text = self._process_formulas_in_text(text)
text = self.try_remove_newline(text)
return text
except Exception as e:
print(f"_handle_text error: {str(e)}")
return text # Return original text on error
def _process_formulas_in_text(self, text: str) -> str:
"""
Process mathematical formulas in text by iteratively finding and replacing formulas.
- Identify inline and block formulas
- Replace newlines within formulas with \\
"""
try:
text = text.replace(r'\upmu', r'\mu')
for key, value in self.replace_dict.items():
text = text.replace(key, value)
return text
except Exception as e:
print(f"_process_formulas_in_text error: {str(e)}")
return text # Return original text on error
def _remove_newline_in_heading(self, text: str) -> str:
"""
Remove newline in heading
"""
try:
# Handle Chinese text line breaks
def is_chinese(char):
return '\u4e00' <= char <= '\u9fff'
# Check if the text contains Chinese characters
if any(is_chinese(char) for char in text):
return text.replace('\n', '')
else:
return text.replace('\n', ' ')
except Exception as e:
print(f"_remove_newline_in_heading error: {str(e)}")
return text
def _handle_heading(self, text: str, label: str) -> str:
"""
Convert section headings to appropriate markdown format
"""
try:
level = self.heading_levels.get(label, '#')
text = text.strip()
text = self._remove_newline_in_heading(text)
text = self._handle_text(text)
return f"{level} {text}\n\n"
except Exception as e:
print(f"_handle_heading error: {str(e)}")
return f"# Error processing heading: {text}\n\n"
def _handle_list_item(self, text: str) -> str:
"""
Convert list items to markdown list format
"""
try:
return f"- {text.strip()}\n"
except Exception as e:
print(f"_handle_list_item error: {str(e)}")
return f"- Error processing list item: {text}\n"
def _handle_figure(self, text: str, section_count: int) -> str:
"""
Handle figure content
"""
try:
# Check if it's a file path starting with "figures/"
if text.startswith("figures/"):
# Convert to relative path from markdown directory to figures directory
relative_path = f"../{text}"
return f"\n\n"
# Check if it's already a markdown format image link
if text.startswith("\n\n"
elif ";" in text and "," in text:
return f"\n\n"
else:
# Assume it's raw base64, convert to data URI
img_format = "png"
data_uri = f"data:image/{img_format};base64,{text}"
return f"\n\n"
except Exception as e:
print(f"_handle_figure error: {str(e)}")
return f"*[Error processing figure: {str(e)}]*\n\n"
def _handle_table(self, text: str) -> str:
"""
Convert table content to markdown format
"""
try:
markdown_content = []
markdown_table = extract_table_from_html(text)
markdown_content.append(markdown_table + "\n")
return '\n'.join(markdown_content) + '\n\n'
except Exception as e:
print(f"_handle_table error: {str(e)}")
return f"*[Error processing table: {str(e)}]*\n\n"
def _handle_formula(self, text: str) -> str:
"""
Handle formula-specific content
"""
try:
text = text.strip('$').rstrip("\ ").replace(r'\upmu', r'\mu')
for key, value in self.replace_dict.items():
text = text.replace(key, value)
processed_text = '$$' + text + '$$'
return f"{processed_text}\n\n"
except Exception as e:
print(f"_handle_formula error: {str(e)}")
return f"*[Error processing formula: {str(e)}]*\n\n"
def convert(self, recognition_results: List[Dict[str, Any]]) -> str:
"""
Convert recognition results to markdown format
"""
try:
markdown_content = []
for section_count, result in enumerate(recognition_results):
try:
label = result.get('label', '')
text = result.get('text', '').strip()
# Skip empty text
if not text:
continue
# Handle different content types
if label in {'sec_0', 'sec_1', 'sec_2', 'sec_3', 'sec_4', 'sec_5'}:
markdown_content.append(self._handle_heading(text, label))
elif label == 'fig':
markdown_content.append(self._handle_figure(text, section_count))
elif label == 'tab':
markdown_content.append(self._handle_table(text))
elif label == 'equ':
markdown_content.append(self._handle_formula(text))
elif label == 'list':
markdown_content.append(self._handle_list_item(text))
elif label == 'code':
markdown_content.append(f"```bash\n{text}\n```\n\n")
else:
# Handle regular text (paragraphs, etc.)
processed_text = self._handle_text(text)
markdown_content.append(f"{processed_text}\n\n")
# TODO: distoraged page
except Exception as e:
print(f"Error processing item {section_count}: {str(e)}")
# Add a placeholder for the failed item
markdown_content.append(f"*[Error processing content]*\n\n")
# Join all content and apply post-processing
result = ''.join(markdown_content)
return result
except Exception as e:
print(f"convert error: {str(e)}")
return f"Error generating markdown content: {str(e)}"
================================================
FILE: utils/utils.py
================================================
"""
Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
SPDX-License-Identifier: MIT
"""
import io
import json
import os
import re
from dataclasses import dataclass
from typing import List, Tuple
import cv2
import numpy as np
import pymupdf
from PIL import Image
from qwen_vl_utils import smart_resize
from utils.markdown_utils import MarkdownConverter
def save_figure_to_local(pil_crop, save_dir, image_name, reading_order):
"""Save cropped figure to local file system
Args:
pil_crop: PIL Image object of the cropped figure
save_dir: Base directory to save results
image_name: Name of the source image/document
reading_order: Reading order of the figure in the document
Returns:
str: Filename of the saved figure
"""
try:
# Create figures directory if it doesn't exist
figures_dir = os.path.join(save_dir, "markdown", "figures")
# os.makedirs(figures_dir, exist_ok=True)
# Generate figure filename
figure_filename = f"{image_name}_figure_{reading_order:03d}.png"
figure_path = os.path.join(figures_dir, figure_filename)
# Save the figure
pil_crop.save(figure_path, format="PNG", quality=95)
# print(f"Saved figure: {figure_filename}")
return figure_filename
except Exception as e:
print(f"Error saving figure: {str(e)}")
# Return a fallback filename
return f"{image_name}_figure_{reading_order:03d}_error.png"
def convert_pdf_to_images(pdf_path, target_size=896):
"""Convert PDF pages to images
Args:
pdf_path: Path to PDF file
target_size: Target size for the longest dimension
Returns:
List of PIL Images
"""
images = []
try:
doc = pymupdf.open(pdf_path)
for page_num in range(len(doc)):
page = doc[page_num]
# Calculate scale to make longest dimension equal to target_size
rect = page.rect
scale = target_size / max(rect.width, rect.height)
# Render page as image
mat = pymupdf.Matrix(scale, scale)
pix = page.get_pixmap(matrix=mat)
# Convert to PIL Image
img_data = pix.tobytes("png")
pil_image = Image.open(io.BytesIO(img_data))
images.append(pil_image)
doc.close()
print(f"Successfully converted {len(images)} pages from PDF")
return images
except Exception as e:
print(f"Error converting PDF to images: {str(e)}")
return []
def save_combined_pdf_results(all_page_results, pdf_path, save_dir):
"""Save combined results for multi-page PDF with both JSON and Markdown
Args:
all_page_results: List of results for all pages
pdf_path: Path to original PDF file
save_dir: Directory to save results
Returns:
Path to saved combined JSON file
"""
# Create output filename based on PDF name
base_name = os.path.splitext(os.path.basename(pdf_path))[0]
# Prepare combined results
combined_results = {"source_file": pdf_path, "total_pages": len(all_page_results), "pages": all_page_results}
# Save combined JSON results
json_filename = f"{base_name}.json"
json_path = os.path.join(save_dir, "recognition_json", json_filename)
os.makedirs(os.path.dirname(json_path), exist_ok=True)
with open(json_path, "w", encoding="utf-8") as f:
json.dump(combined_results, f, indent=2, ensure_ascii=False)
# Generate and save combined markdown
try:
markdown_converter = MarkdownConverter()
# Combine all page results into a single list for markdown conversion
all_elements = []
for page_data in all_page_results:
page_elements = page_data.get("elements", [])
if page_elements:
# Add page separator if not the first page
if all_elements:
all_elements.append(
{"label": "page_separator", "text": f"\n\n---\n\n", "reading_order": len(all_elements)}
)
all_elements.extend(page_elements)
# Generate markdown content
markdown_content = markdown_converter.convert(all_elements)
# Save markdown file
markdown_filename = f"{base_name}.md"
markdown_path = os.path.join(save_dir, "markdown", markdown_filename)
os.makedirs(os.path.dirname(markdown_path), exist_ok=True)
with open(markdown_path, "w", encoding="utf-8") as f:
f.write(markdown_content)
# print(f"Combined markdown saved to: {markdown_path}")
except ImportError:
print("MarkdownConverter not available, skipping markdown generation")
except Exception as e:
print(f"Error generating markdown: {e}")
# print(f"Combined JSON results saved to: {json_path}")
return json_path
def extract_labels_from_string(text):
"""
from [202,217,921,325][para][author] extract para and author
"""
all_matches = re.findall(r'\[([^\]]+)\]', text)
labels = []
for match in all_matches:
if not re.match(r'^\d+,\d+,\d+,\d+$', match):
labels.append(match)
return labels
def parse_layout_string(bbox_str):
"""
Dolphin-V1.5 layout string parsing function
Parse layout string to extract bbox and category information
Supports multiple formats:
1. Original format: [x1,y1,x2,y2] label
2. New format: [x1,y1,x2,y2][label][PAIR_SEP] or [x1,y1,x2,y2][label][meta_info][PAIR_SEP]
"""
parsed_results = []
segments = bbox_str.split('[PAIR_SEP]')
new_segments = []
for seg in segments:
new_segments.extend(seg.split('[RELATION_SEP]'))
segments = new_segments
for segment in segments:
segment = segment.strip()
if not segment:
continue
coord_pattern = r'\[(\d*\.?\d+),(\d*\.?\d+),(\d*\.?\d+),(\d*\.?\d+)\]'
coord_match = re.search(coord_pattern, segment)
label_matches = extract_labels_from_string(segment)
if coord_match and label_matches:
coords = [float(coord_match.group(i)) for i in range(1, 5)]
label = label_matches[0].strip()
parsed_results.append((coords, label, label_matches[1:])) # label_matches[1:] 是 tags
return parsed_results
def process_coordinates(coords, pil_image):
original_w, original_h = pil_image.size[:2]
# use the same resize logic as the model
resized_pil = resize_img(pil_image)
resized_image = np.array(resized_pil)
resized_h, resized_w = resized_image.shape[:2]
resized_h, resized_w = smart_resize(resized_h, resized_w, factor=28, min_pixels=784, max_pixels=2560000)
w_ratio, h_ratio = original_w / resized_w, original_h / resized_h
x1 = int(coords[0] * w_ratio)
y1 = int(coords[1] * h_ratio)
x2 = int(coords[2] * w_ratio)
y2 = int(coords[3] * h_ratio)
x1 = max(0, min(x1, original_w - 1))
y1 = max(0, min(y1, original_h - 1))
x2 = max(x1 + 1, min(x2, original_w))
y2 = max(y1 + 1, min(y2, original_h))
return x1, y1, x2, y2
def setup_output_dirs(save_dir):
"""Create necessary output directories"""
os.makedirs(save_dir, exist_ok=True)
os.makedirs(os.path.join(save_dir, "markdown"), exist_ok=True)
os.makedirs(os.path.join(save_dir, "output_json"), exist_ok=True)
os.makedirs(os.path.join(save_dir, "markdown", "figures"), exist_ok=True)
os.makedirs(os.path.join(save_dir, "layout_visualization"), exist_ok=True)
def save_outputs(recognition_results, image, image_name, save_dir):
"""Save JSON and markdown outputs"""
# Save JSON file
json_path = os.path.join(save_dir, "output_json", f"{image_name}.json")
with open(json_path, "w", encoding="utf-8") as f:
json.dump(recognition_results, f, ensure_ascii=False, indent=2)
# Generate and save markdown file
markdown_converter = MarkdownConverter()
markdown_content = markdown_converter.convert(recognition_results)
markdown_path = os.path.join(save_dir, "markdown", f"{image_name}.md")
with open(markdown_path, "w", encoding="utf-8") as f:
f.write(markdown_content)
# visualize layout
# Save visualization (pass original PIL image for coordinate mapping)
vis_path = os.path.join(save_dir, "layout_visualization", f"{image_name}_layout.png")
visualize_layout(image, recognition_results, vis_path)
return json_path
def crop_margin(img: Image.Image) -> Image.Image:
"""Crop margins from image"""
try:
width, height = img.size
if width == 0 or height == 0:
print("Warning: Image has zero width or height")
return img
data = np.array(img.convert("L"))
data = data.astype(np.uint8)
max_val = data.max()
min_val = data.min()
if max_val == min_val:
return img
data = (data - min_val) / (max_val - min_val) * 255
gray = 255 * (data < 200).astype(np.uint8)
coords = cv2.findNonZero(gray) # Find all non-zero points (text)
if coords is None:
return img
a, b, w, h = cv2.boundingRect(coords) # Find minimum spanning bounding box
# Ensure crop coordinates are within image bounds
a = max(0, a)
b = max(0, b)
w = min(w, width - a)
h = min(h, height - b)
# Only crop if we have a valid region
if w > 0 and h > 0:
return img.crop((a, b, a + w, b + h))
return img
except Exception as e:
print(f"crop_margin error: {str(e)}")
return img # Return original image on error
def visualize_layout(image_path, layout_results, save_path, alpha=0.3):
"""Visualize layout detection results on the image
Args:
image_path: Path to the input image
layout_results: List of (bbox, label, tags) dict
save_path: Path to save the visualization
alpha: Transparency of the overlay (0-1, lower = more transparent)
"""
# Read image
if isinstance(image_path, str):
image = cv2.imread(image_path)
else:
# If it's already a PIL Image
image = cv2.cvtColor(np.array(image_path), cv2.COLOR_RGB2BGR)
if image is None:
raise ValueError(f"Failed to load image from {image_path}")
# Assign colors to all elements at once
element_colors = assign_colors_to_elements(len(layout_results))
# Create overlay
overlay = image.copy()
# Draw each layout element
for idx, layout_res in enumerate(layout_results):
if "bbox" not in layout_res:
return
bbox, label, reading_order, tags = layout_res["bbox"], layout_res["label"], layout_res["reading_order"], layout_res["tags"]
x1,y1,x2,y2 = bbox
# Get color for this element (assigned by order, not by label)
color = element_colors[idx]
# Draw filled rectangle with transparency
cv2.rectangle(overlay, (x1,y1), (x2,y2), color, -1)
# Draw border
cv2.rectangle(image, (x1,y1), (x2,y2), color, 3)
# Add label text with background at the top-left corner (outside the box)
label_text = f"{reading_order}: {label} | {tags}"
font = cv2.FONT_HERSHEY_SIMPLEX
font_scale = 0.5
thickness = 1
# Get text size
(text_width, text_height), baseline = cv2.getTextSize(
label_text, font, font_scale, thickness
)
# Position text above the box (outside)
text_x = x1
text_y = y1 - 5 # 5 pixels above the box
# If text would go outside the image at the top, put it inside the box instead
if text_y - text_height < 0:
text_y = y1 + text_height + 5
# Draw text background
cv2.rectangle(
image,
(text_x - 2, text_y - text_height - 2),
(text_x + text_width + 2, text_y + baseline + 2),
(255, 255, 255),
-1
)
# Draw text
cv2.putText(
image,
label_text,
(text_x, text_y),
font,
font_scale,
(0, 0, 0),
thickness
)
# Blend the overlay with the original image
result = cv2.addWeighted(overlay, alpha, image, 1 - alpha, 0)
# Save the result
cv2.imwrite(save_path, result)
# print(f"Layout visualization saved to {save_path}")
def get_color_palette():
"""Get a visually pleasing color palette for layout visualization
Returns:
List of BGR color tuples (semi-transparent, good for overlay)
"""
# Carefully selected color palette with good visual distinction
# Colors are chosen to be light, pleasant, and distinguishable
color_palette = [
(200, 255, 255), # Light cyan
(255, 200, 255), # Light magenta
(255, 255, 200), # Light yellow
(200, 255, 200), # Light green
(255, 220, 200), # Light orange
(220, 200, 255), # Light purple
(200, 240, 255), # Light sky blue
(255, 240, 220), # Light peach
(220, 255, 240), # Light mint
(255, 220, 240), # Light pink
(240, 255, 200), # Light lime
(240, 220, 255), # Light lavender
(200, 255, 240), # Light turquoise
(255, 240, 200), # Light apricot
(220, 240, 255), # Light periwinkle
(255, 200, 220), # Light rose
(220, 255, 220), # Light jade
(255, 230, 200), # Light salmon
(210, 230, 255), # Light cornflower
(255, 210, 230), # Light carnation
]
return color_palette
def assign_colors_to_elements(num_elements):
"""Assign colors to elements in order
Args:
num_elements: Number of elements to assign colors to
Returns:
List of color tuples, one for each element
"""
palette = get_color_palette()
colors = []
for i in range(num_elements):
# Cycle through the palette if we have more elements than colors
color_idx = i % len(palette)
colors.append(palette[color_idx])
return colors
def resize_img(image, max_size=1600, min_size=28):
width, height = image.size
if max(width, height) < max_size and min(width, height) >= 28:
return image
if max(width, height) > max_size:
if width > height:
new_width = max_size
new_height = int(height * (max_size / width))
else:
new_height = max_size
new_width = int(width * (max_size / height))
image = image.resize((new_width, new_height))
width, height = image.size
if min(width, height) < 28:
if width < height:
new_width = min_size
new_height = int(height * (min_size / width))
else:
new_height = min_size
new_width = int(width * (min_size / height))
image = image.resize((new_width, new_height))
return image
def calculate_iou_matrix(boxes):
"""Vectorized IoU matrix calculation [N, N]
Args:
boxes: List of bounding boxes in [x1, y1, x2, y2] format
Returns:
numpy.ndarray: IoU matrix of shape [N, N]
"""
boxes = np.array(boxes) # [N, 4]
# Calculate areas
areas = (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1]) # [N]
# Broadcast to calculate intersection
lt = np.maximum(boxes[:, None, :2], boxes[None, :, :2]) # [N, N, 2]
rb = np.minimum(boxes[:, None, 2:], boxes[None, :, 2:]) # [N, N, 2]
wh = np.clip(rb - lt, 0, None) # [N, N, 2]
inter = wh[:, :, 0] * wh[:, :, 1] # [N, N]
# Calculate IoU
union = areas[:, None] + areas[None, :] - inter
iou = inter / np.clip(union, 1e-6, None)
return iou
def check_bbox_overlap(layout_results_list, image, iou_threshold=0.1, overlap_box_ratio=0.25):
"""Check if bounding boxes have significant overlaps, indicating a distorted/photographed document
If more than 60% of boxes have overlaps (IoU > threshold with at least 1 other box),
treat as photographed document.
Args:
layout_results_list: List of (bbox, label, tags) tuples
image: PIL Image object
iou_threshold: IoU threshold to consider two boxes as overlapping (default: 0.3)
overlap_box_ratio: Ratio threshold of boxes with overlaps (default: 0.6, i.e., 60%)
Returns:
bool: True if significant overlap detected (should treat as distorted_page)
"""
if len(layout_results_list) <= 1:
return False
# Convert to absolute coordinates
bboxes = []
for bbox, label, tags in layout_results_list:
x1, y1, x2, y2 = process_coordinates(bbox, image)
bboxes.append([x1, y1, x2, y2])
# Vectorized IoU matrix calculation
iou_matrix = calculate_iou_matrix(bboxes)
# Check if each box has overlap with any other box (excluding itself)
overlap_mask = iou_matrix > iou_threshold
np.fill_diagonal(overlap_mask, False) # Exclude self
has_overlap = overlap_mask.any(axis=1) # Whether each box has overlap
# Count boxes with overlaps
overlap_count = has_overlap.sum()
total_boxes = len(bboxes)
overlap_ratio = overlap_count / total_boxes
# print(f"Overlap detection: {overlap_count}/{total_boxes} boxes have overlaps (ratio: {overlap_ratio:.2%})")
# If more than 60% boxes have overlaps, treat as photographed document
if overlap_ratio > overlap_box_ratio:
print(f"⚠️ High overlap detected ({overlap_ratio:.2%} > {overlap_box_ratio:.2%}), treating as distorted/photographed document")
return True
return False
if __name__ == "__main__":
bbox_str = "[210,136,910,172][sec_0][PAIR_SEP][202,217,921,325][para][author][PAIR_SEP][520,341,604,367][para][PAIR_SEP][290,404,384,432][sec_1][paper_abstract][PAIR_SEP][156,448,520,723][para][paper_abstract][PAIR_SEP][125,740,290,768][sec_1][PAIR_SEP][125,781,552,1143][para][PAIR_SEP][125,1144,552,1400][para][RELATION_SEP][573,406,1000,561][para][PAIR_SEP][573,581,1001,943][para][PAIR_SEP][573,962,1001,1222][para][PAIR_SEP][573,1241,1001,1475][para][PAIR_SEP][126,1410,551,1470][fnote][PAIR_SEP][21,499,63,1163][watermark][meta_num]"
print(parse_layout_string(bbox_str))
gitextract_yl96yq_r/
├── .gitignore
├── .pre-commit-config.yaml
├── LICENSE
├── README.md
├── README_CN.md
├── demo_element.py
├── demo_layout.py
├── demo_page.py
├── pyproject.toml
├── requirements.txt
└── utils/
├── markdown_utils.py
└── utils.py
SYMBOL INDEX (47 symbols across 5 files)
FILE: demo_element.py
class DOLPHIN (line 19) | class DOLPHIN:
method __init__ (line 20) | def __init__(self, model_id_or_path):
method chat (line 44) | def chat(self, prompt, image):
function process_element (line 121) | def process_element(image_path, model, element_type, save_dir=None):
function main (line 169) | def main():
FILE: demo_layout.py
class DOLPHIN (line 18) | class DOLPHIN:
method __init__ (line 19) | def __init__(self, model_id_or_path):
method chat (line 43) | def chat(self, prompt, image):
function process_layout (line 128) | def process_layout(input_path, model, save_dir):
function process_single_layout (line 162) | def process_single_layout(pil_image, model, save_dir, image_name):
function main (line 196) | def main():
FILE: demo_page.py
class DOLPHIN (line 18) | class DOLPHIN:
method __init__ (line 19) | def __init__(self, model_id_or_path):
method chat (line 43) | def chat(self, prompt, image):
function process_document (line 127) | def process_document(document_path, model, save_dir, max_batch_size=None):
function process_single_image (line 171) | def process_single_image(image, model, save_dir, image_name, max_batch_s...
function process_elements (line 201) | def process_elements(layout_results, image, model, max_batch_size, save_...
function process_element_batch (line 289) | def process_element_batch(elements, model, prompt, max_batch_size=None):
function main (line 323) | def main():
FILE: utils/markdown_utils.py
function extract_table_from_html (line 11) | def extract_table_from_html(html_string):
class MarkdownConverter (line 23) | class MarkdownConverter:
method __init__ (line 26) | def __init__(self):
method try_remove_newline (line 58) | def try_remove_newline(self, text: str) -> str:
method _handle_text (line 102) | def _handle_text(self, text: str) -> str:
method _process_formulas_in_text (line 118) | def _process_formulas_in_text(self, text: str) -> str:
method _remove_newline_in_heading (line 134) | def _remove_newline_in_heading(self, text: str) -> str:
method _handle_heading (line 153) | def _handle_heading(self, text: str, label: str) -> str:
method _handle_list_item (line 168) | def _handle_list_item(self, text: str) -> str:
method _handle_figure (line 178) | def _handle_figure(self, text: str, section_count: int) -> str:
method _handle_table (line 209) | def _handle_table(self, text: str) -> str:
method _handle_formula (line 223) | def _handle_formula(self, text: str) -> str:
method convert (line 238) | def convert(self, recognition_results: List[Dict[str, Any]]) -> str:
FILE: utils/utils.py
function save_figure_to_local (line 21) | def save_figure_to_local(pil_crop, save_dir, image_name, reading_order):
function convert_pdf_to_images (line 54) | def convert_pdf_to_images(pdf_path, target_size=896):
function save_combined_pdf_results (line 93) | def save_combined_pdf_results(all_page_results, pdf_path, save_dir):
function extract_labels_from_string (line 156) | def extract_labels_from_string(text):
function parse_layout_string (line 170) | def parse_layout_string(bbox_str):
function process_coordinates (line 202) | def process_coordinates(coords, pil_image):
function setup_output_dirs (line 223) | def setup_output_dirs(save_dir):
function save_outputs (line 232) | def save_outputs(recognition_results, image, image_name, save_dir):
function crop_margin (line 255) | def crop_margin(img: Image.Image) -> Image.Image:
function visualize_layout (line 291) | def visualize_layout(image_path, layout_results, save_path, alpha=0.3):
function get_color_palette (line 380) | def get_color_palette():
function assign_colors_to_elements (line 413) | def assign_colors_to_elements(num_elements):
function resize_img (line 432) | def resize_img(image, max_size=1600, min_size=28):
function calculate_iou_matrix (line 459) | def calculate_iou_matrix(boxes):
function check_bbox_overlap (line 487) | def check_bbox_overlap(layout_results_list, image, iou_threshold=0.1, ov...
Condensed preview — 12 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (92K chars).
[
{
"path": ".gitignore",
"chars": 2044,
"preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
},
{
"path": ".pre-commit-config.yaml",
"chars": 981,
"preview": "repos:\n # 1. isort - 自动排序 Python imports\n - repo: https://github.com/pycqa/isort\n rev: 6.0.1 # 使用固定版本号\n hooks:\n"
},
{
"path": "LICENSE",
"chars": 7378,
"preview": "Qwen RESEARCH LICENSE AGREEMENT\n\nQwen RESEARCH LICENSE AGREEMENT Release Date: September 19, 2024\n\nBy clicking to agree "
},
{
"path": "README.md",
"chars": 9409,
"preview": "<div align=\"center\">\n <img src=\"./assets/dolphin.png\" width=\"300\">\n</div>\n\n<div align=\"center\">\n <a href=\"https://arxi"
},
{
"path": "README_CN.md",
"chars": 6909,
"preview": "<div align=\"center\">\n <img src=\"./assets/dolphin.png\" width=\"300\">\n</div>\n\n<div align=\"center\">\n <a href=\"https://arxi"
},
{
"path": "demo_element.py",
"chars": 7678,
"preview": "\"\"\"\nCopyright (c) 2025 Bytedance Ltd. and/or its affiliates\nSPDX-License-Identifier: MIT\n\"\"\"\n\nimport argparse\nimport glo"
},
{
"path": "demo_layout.py",
"chars": 9051,
"preview": "\"\"\" \nCopyright (c) 2025 Bytedance Ltd. and/or its affiliates\nSPDX-License-Identifier: MIT\n\"\"\"\n\nimport argparse\nimport gl"
},
{
"path": "demo_page.py",
"chars": 14266,
"preview": "\"\"\" \nCopyright (c) 2025 Bytedance Ltd. and/or its affiliates\nSPDX-License-Identifier: MIT\n\"\"\"\n\nimport argparse\nimport gl"
},
{
"path": "pyproject.toml",
"chars": 175,
"preview": "[tool.black]\nline-length = 120\ninclude = '\\.pyi?$'\nexclude = '''\n/(\n \\.git\n | \\.hg\n | \\.mypy_cache\n | \\.tox\n | \\."
},
{
"path": "requirements.txt",
"chars": 255,
"preview": "datasets==3.6.0\ntorch==2.6.0\ntorchvision==0.21.0\ntransformers==4.51.0\ndeepspeed==0.16.4\ntriton==3.2.0\naccelerate==1.4.0\n"
},
{
"path": "utils/markdown_utils.py",
"chars": 10889,
"preview": "\"\"\" \nCopyright (c) 2025 Bytedance Ltd. and/or its affiliates\nSPDX-License-Identifier: MIT\n\"\"\"\n\nimport re\nimport base64\nf"
},
{
"path": "utils/utils.py",
"chars": 18548,
"preview": "\"\"\"\nCopyright (c) 2025 Bytedance Ltd. and/or its affiliates\nSPDX-License-Identifier: MIT\n\"\"\"\n\nimport io\nimport json\nimpo"
}
]
About this extraction
This page contains the full source code of the bytedance/Dolphin GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 12 files (85.5 KB), approximately 21.7k tokens, and a symbol index with 47 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.