Repository: bytedance/Dolphin Branch: master Commit: fcbf0334a9ea Files: 12 Total size: 85.5 KB Directory structure: gitextract_yl96yq_r/ ├── .gitignore ├── .pre-commit-config.yaml ├── LICENSE ├── README.md ├── README_CN.md ├── demo_element.py ├── demo_layout.py ├── demo_page.py ├── pyproject.toml ├── requirements.txt └── utils/ ├── markdown_utils.py └── utils.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] *$py.class # C extensions *.so # Distribution / packaging .Python build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ wheels/ *.egg-info/ .installed.cfg *.egg MANIFEST # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .nox/ .coverage *.cover *.py,cover .hypothesis/ .pytest_cache/ coverage.xml *.mo *.pot # Translations *.mo *.pot # Django stuff: *.log local_settings.py db.sqlite3 db.sqlite3-journal # Flask stuff: instance/ .webassets-cache # Scrapy stuff: .scrapy # Sphinx documentation docs/_build/ # PyBuilder target/ # Jupyter Notebook .ipynb_checkpoints # IPython profile_default/ ipython_config.py # pyenv .python-version # pipenv # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. # However, in case of collaboration, if having platform-specific dependencies or dependencies # having no cross-platform support, pipenv may install dependencies that don't work, or not # install all needed dependencies. #Pipfile.lock # PEP 582; used by e.g. github.com/David-OConnor/pyflow __pypackages__/ # Celery stuff celerybeat-schedule celerybeat.pid # SageMath parsed files *.sage.py # Environments .env .venv env/ venv/ ENV/ env.bak/ venv.bak/ # Spyder project settings .spyderproject .spyproject # Rope project settings .ropeproject # mkdocs documentation /site # mypy .mypy_cache/ .dmypy.json dmypy.json # Pyre type checker .pyre/ # pytype static type analyzer .pytype/ # Cython debug symbols cython_debug/ # PyCharm .idea/ *.iml # VS Code .vscode/ !.vscode/settings.json !.vscode/tasks.json !.vscode/launch.json !.vscode/extensions.json # macOS .DS_Store # Windows Thumbs.db ehthumbs.db Desktop.ini fusion_result.json kernel_meta/ ================================================ FILE: .pre-commit-config.yaml ================================================ repos: # 1. isort - 自动排序 Python imports - repo: https://github.com/pycqa/isort rev: 6.0.1 # 使用固定版本号 hooks: - id: isort name: isort (python) args: [--profile=black] # 与 Black 兼容的配置 language: python # 2. Black - 自动格式化 Python 代码 - repo: https://github.com/psf/black rev: 25.1.0 # 使用固定版本号 hooks: - id: black language: python # 3. flake8 - Python 静态检查 - repo: https://github.com/pycqa/flake8 rev: 7.2.0 hooks: - id: flake8 args: [--max-line-length=120, --ignore=E203] # 设置行长度为 120 additional_dependencies: [flake8-bugbear==24.12.12] # 可选:增强检查 # 4. pre-commit-hooks - 通用 Git 钩子 - repo: https://github.com/pre-commit/pre-commit-hooks rev: v5.0.0 hooks: - id: trailing-whitespace # 删除行尾空格 - id: end-of-file-fixer # 确保文件以换行符结束 - id: check-yaml # 验证 YAML 文件语法 - id: check-added-large-files # 阻止大文件提交 args: ["--maxkb=512"] ================================================ FILE: LICENSE ================================================ Qwen RESEARCH LICENSE AGREEMENT Qwen RESEARCH LICENSE AGREEMENT Release Date: September 19, 2024 By clicking to agree or by using or distributing any portion or element of the Qwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately. 1. Definitions a. This Qwen RESEARCH LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement. b. "We" (or "Us") shall mean Alibaba Cloud. c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use. d. "Third Parties" shall mean individuals or legal entities that are not under common control with us or you. e. "Qwen" shall mean the large language models, and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by us. f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Qwen and Documentation (and any portion thereof) made available under this Agreement. g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files. h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. i. "Non-Commercial" shall mean for research or evaluation purposes only. 2. Grant of Rights a. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials FOR NON-COMMERCIAL PURPOSES ONLY. b. If you are commercially using the Materials, you shall request a license from us. 3. Redistribution You may distribute copies or make the Materials, or derivative works thereof, available as part of a product or service that contains any of them, with or without modifications, and in Source or Object form, provided that you meet the following conditions: a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement; b. You shall cause any modified files to carry prominent notices stating that you changed the files; c. You shall retain in all copies of the Materials that you distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Qwen is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and d. You may add your own copyright statement to your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of your modifications, or for any such derivative works as a whole, provided your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement. 4. Rules of use a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials. b. If you use the Materials or any outputs or results therefrom to create, train, fine-tune, or improve an AI model that is distributed or made available, you shall prominently display “Built with Qwen” or “Improved using Qwen” in the related product documentation. 5. Intellectual Property a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications. b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials. c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licenses granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought. 6. Disclaimer of Warranty and Limitation of Liability a. We are not obligated to support, update, provide training for, or develop any further version of the Qwen Materials or to grant any license thereto. b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM. c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW IT’S CAUSED. d. You will defend, indemnify and hold harmless us from and against any claim by any third party arising out of or related to your use or distribution of the Materials. 7. Survival and Termination. a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 6 and 8 shall survive the termination of this Agreement. 8. Governing Law and Jurisdiction. a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement. b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement. 9. Other Terms and Conditions. a. Any arrangements, understandings, or agreements regarding the Material not stated herein are separate from and independent of the terms and conditions of this Agreement. You shall request a separate license from us, if you use the Materials in ways not expressly agreed to in this Agreement. b. We shall not be bound by any additional or different terms or conditions communicated by you unless expressly agreed. ================================================ FILE: README.md ================================================


# Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting Dolphin-v2 is an enhanced universal document parsing model that substantially improves upon the original Dolphin. It seamlessly handles any document type—whether digital-born or photographed—through a document-type-aware two-stage architecture with scalable anchor prompting. ## 📑 Overview Document image parsing is challenging due to diverse document types and complexly intertwined elements such as text paragraphs, figures, formulas, tables, and code blocks. Dolphin-v2 addresses these challenges through a document-type-aware two-stage approach: 1. **🔍 Stage 1**: Document type classification (digital vs. photographed) + layout analysis with reading order prediction 2. **🧩 Stage 2**: Hybrid parsing strategy - holistic parsing for photographed documents, parallel element-wise parsing for digital documents
Dolphin achieves promising performance across diverse page-level and element-level parsing tasks while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. ## 📅 Changelog - 🔥 **2025.12.12** Released *Dolphin-v2* model. Upgraded to 3B parameters with 21-element detection, attribute field extraction, dedicated formula/code parsing, and robust photographed document parsing. (Dolphin-1.5 moved to [v1.5 branch](https://github.com/bytedance/Dolphin/tree/v1.5)) - 🔥 **2025.10.16** Released *Dolphin-1.5* model. While maintaining the lightweight 0.3B architecture, this version achieves significant parsing improvements. (Dolphin 1.0 moved to [v1.0 branch](https://github.com/bytedance/Dolphin/tree/v1.0)) - 🔥 **2025.07.10** Released the *Fox-Page Benchmark*, a manually refined subset of the original [Fox dataset](https://github.com/ucaslcl/Fox). Download via: [Baidu Yun](https://pan.baidu.com/share/init?surl=t746ULp6iU5bUraVrPlMSw&pwd=fox1) | [Google Drive](https://drive.google.com/file/d/1yZQZqI34QCqvhB4Tmdl3X_XEvYvQyP0q/view?usp=sharing). - 🔥 **2025.06.30** Added [TensorRT-LLM support](https://github.com/bytedance/Dolphin/blob/master/deployment/tensorrt_llm/ReadMe.md) for accelerated inference! - 🔥 **2025.06.27** Added [vLLM support](https://github.com/bytedance/Dolphin/blob/master/deployment/vllm/ReadMe.md) for accelerated inference! - 🔥 **2025.06.13** Added multi-page PDF document parsing capability. - 🔥 **2025.05.21** Our demo is released at [link](http://115.190.42.15:8888/dolphin/). Check it out! - 🔥 **2025.05.20** The pretrained model and inference code of Dolphin are released. - 🔥 **2025.05.16** Our paper has been accepted by ACL 2025. Paper link: [arXiv](https://arxiv.org/abs/2505.14059). ## 📈 Performance
Comprehensive evaluation of document parsing on OmniDocBench (v1.5)
Model Size Overall↑ TextEdit FormulaCDM TableTEDS TableTEDS-S Read OrderEdit
Dolphin 0.3B 74.67 0.125 67.85 68.70 77.77 0.124
Dolphin-1.5 0.3B 85.06 0.085 79.44 84.25 88.06 0.071
Dolphin-v2 3B 89.78 0.054 87.63 87.02 90.48 0.054
## 🛠️ Installation 1. Clone the repository: ```bash git clone https://github.com/ByteDance/Dolphin.git cd Dolphin ``` 2. Install the dependencies: ```bash pip install -r requirements.txt ``` 3. Download the pre-trained models of *Dolphin-v2*: Visit our Huggingface [model card](https://huggingface.co/ByteDance/Dolphin-v2), or download model by: ```bash # Download the model from Hugging Face Hub git lfs install git clone https://huggingface.co/ByteDance/Dolphin-v2 ./hf_model # Or use the Hugging Face CLI pip install huggingface_hub huggingface-cli download ByteDance/Dolphin-v2 --local-dir ./hf_model ``` ## ⚡ Inference Dolphin provides two inference frameworks with support for two parsing granularities: - **Page-level Parsing**: Parse the entire document page into a structured JSON and Markdown format - **Element-level Parsing**: Parse individual document elements (text, table, formula) ### 📄 Page-level Parsing ```bash # Process a single document image python demo_page.py --model_path ./hf_model --save_dir ./results \ --input_path ./demo/page_imgs/page_1.png # Process a single document pdf python demo_page.py --model_path ./hf_model --save_dir ./results \ --input_path ./demo/page_imgs/page_6.pdf # Process all documents in a directory python demo_page.py --model_path ./hf_model --save_dir ./results \ --input_path ./demo/page_imgs # Process with custom batch size for parallel element decoding python demo_page.py --model_path ./hf_model --save_dir ./results \ --input_path ./demo/page_imgs \ --max_batch_size 8 ``` ### 🧩 Element-level Parsing ````bash # Process element images (specify element_type: table, formula, text, or code) python demo_element.py --model_path ./hf_model --save_dir ./results \ --input_path \ --element_type [table|formula|text|code] ```` ### 🎨 Layout Parsing ````bash # Process a single document image python demo_layout.py --model_path ./hf_model --save_dir ./results \ --input_path ./demo/page_imgs/page_1.png \ # Process a single PDF document python demo_layout.py --model_path ./hf_model --save_dir ./results \ --input_path ./demo/page_imgs/page_6.pdf \ # Process all documents in a directory python demo_layout.py --model_path ./hf_model --save_dir ./results \ --input_path ./demo/page_imgs ```` ## 🌟 Key Features - 🔄 Two-stage analyze-then-parse approach based on a single VLM - 📊 Promising performance on document parsing tasks - 🔍 Natural reading order element sequence generation - 🧩 Heterogeneous anchor prompting for different document elements - ⏱️ Efficient parallel parsing mechanism - 🤗 Support for Hugging Face Transformers for easier integration ## 📮 Notice **Call for Bad Cases:** If you have encountered any cases where the model performs poorly, we would greatly appreciate it if you could share them in the issue. We are continuously working to optimize and improve the model. ## 💖 Acknowledgement We would like to acknowledge the following open-source projects that provided inspiration and reference for this work: - [OmniDocBench](https://github.com/opendatalab/OmniDocBench) - [Donut](https://github.com/clovaai/donut/) - [Nougat](https://github.com/facebookresearch/nougat) - [GOT](https://github.com/Ucas-HaoranWei/GOT-OCR2.0) - [MinerU](https://github.com/opendatalab/MinerU/tree/master) - [Swin](https://github.com/microsoft/Swin-Transformer) - [Hugging Face Transformers](https://github.com/huggingface/transformers) ## 📝 Citation If you find this code useful for your research, please use the following BibTeX entry. ```bibtex @article{feng2025dolphin, title={Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting}, author={Feng, Hao and Wei, Shu and Fei, Xiang and Shi, Wei and Han, Yingdong and Liao, Lei and Lu, Jinghui and Wu, Binghong and Liu, Qi and Lin, Chunhui and others}, journal={arXiv preprint arXiv:2505.14059}, year={2025} } ``` ## Star History [![Star History Chart](https://api.star-history.com/svg?repos=bytedance/Dolphin&type=Date)](https://www.star-history.com/#bytedance/Dolphin&Date) ================================================ FILE: README_CN.md ================================================


# Dolphin: 基于异构锚点提示的文档图像解析 Dolphin(**Do**cument Image **P**arsing via **H**eterogeneous Anchor Prompt**in**g)是一个创新的多模态文档图像解析模型(**0.3B**),采用"分析-解析"的两阶段范式。本仓库包含Dolphin的演示代码和预训练模型。 ## 📑 概述 由于文档图像中文本段落、图表、公式和表格等元素的复杂交织,文档图像解析具有挑战性。Dolphin通过两阶段方法解决这些挑战: 1. **🔍 第一阶段**:通过按自然阅读顺序生成元素序列进行全面的页面级布局分析 2. **🧩 第二阶段**:使用异构锚点和任务特定提示高效并行解析文档元素
Dolphin在多样化的页面级和元素级解析任务中取得了优异的性能,同时通过其轻量级架构和并行解析机制确保了卓越的效率。 ## 📅 更新日志 - 🔥 **2025.12.12** *Dolphin-v2* 开源!支持 21 类元素检测、属性字段提取、代码专用解析,以及拍照文档解析。(原1.5版本已迁移至[v1.5分支](https://github.com/bytedance/Dolphin/tree/v1.5)) - 🔥 **2025.10.16** *Dolphin-1.5* 开源!在保持轻量级0.3B架构的同时,该版本实现了显著的解析性能提升。(原1.0版本已迁移至[v1.0分支](https://github.com/bytedance/Dolphin/tree/v1.0)) - 🔥 **2025.07.10** *Fox-Page* 基准测试开源。这是原始 [Fox 数据集](https://github.com/ucaslcl/Fox) 人工矫正标注后的版本。下载地址:[百度网盘](https://pan.baidu.com/share/init?surl=t746ULp6iU5bUraVrPlMSw&pwd=fox1) | [Google Drive](https://drive.google.com/file/d/1yZQZqI34QCqvhB4Tmdl3X_XEvYvQyP0q/view?usp=sharing)。 - 🔥 **2025.06.30** 新增[TensorRT-LLM](https://github.com/bytedance/Dolphin/blob/master/deployment/tensorrt_llm/ReadMe.md)支持,提升推理速度! - 🔥 **2025.06.27** 新增[vLLM](https://github.com/bytedance/Dolphin/blob/master/deployment/vllm/ReadMe.md)支持,提升推理速度! - 🔥 **2025.06.13** 新增多页PDF文档解析功能。 - 🔥 **2025.05.21** 我们的演示已在 [链接](http://115.190.42.15:8888/dolphin/) 发布。快来体验吧! - 🔥 **2025.05.20** Dolphin的预训练模型和推理代码已发布。 - 🔥 **2025.05.16** 我们的论文已被ACL 2025接收。论文链接:[arXiv](https://arxiv.org/abs/2505.14059)。 ## 📈 性能表现
OmniDocBench (v1.5) 测试基准上评估结果
模型 参数 总体↑ 文本Edit 公式CDM 表格TEDS 表格TEDS-S 阅读顺序Edit
Dolphin 0.3B 74.67 0.125 67.85 68.70 77.77 0.124
Dolphin-1.5 0.3B 85.06 0.085 79.44 84.25 88.06 0.071
Dolphin-v2 0.3B 89.78 0.054 87.63 87.02 90.48 0.054
## 🛠️ 安装 1. 克隆仓库: ```bash git clone https://github.com/ByteDance/Dolphin.git cd Dolphin ``` 2. 安装依赖: ```bash pip install -r requirements.txt ``` 3. 使用以下选项之一下载 *Dolphin-v2* 的预训练模型: 访问我们的Huggingface [模型卡片](https://huggingface.co/ByteDance/Dolphin-v2),或通过以下方式下载模型: ```bash # 从Hugging Face Hub下载模型 git lfs install git clone https://huggingface.co/ByteDance/Dolphin-v2 ./hf_model # 或使用Hugging Face CLI pip install huggingface_hub huggingface-cli download ByteDance/Dolphin-v2 --local-dir ./hf_model ``` ## ⚡ 推理 Dolphin提供两个推理框架,支持两种解析粒度: - **页面级解析**:将整个文档页面解析为结构化的JSON和Markdown格式 - **元素级解析**:解析单个文档元素(文本、表格、公式) ### 📄 页面级解析 ```bash # 处理单个文档图像 python demo_page.py --model_path ./hf_model --save_dir ./results \ --input_path ./demo/page_imgs/page_1.png # 处理单个文档PDF python demo_page.py --model_path ./hf_model --save_dir ./results \ --input_path ./demo/page_imgs/page_6.pdf # 处理目录中的所有文档 python demo_page.py --model_path ./hf_model --save_dir ./results \ --input_path ./demo/page_imgs # 使用自定义批次大小进行并行元素解码 python demo_page.py --model_path ./hf_model --save_dir ./results \ --input_path ./demo/page_imgs \ --max_batch_size 8 ``` ### 🧩 元素级解析 ````bash # 解析块图像 (支持块图像类型: table, formula, text, or code) python demo_element.py --model_path ./hf_model --save_dir ./results \ --input_path \ --element_type [table|formula|text|code] ```` ### 🎨 元素定位及阅读顺序解析 ````bash # 处理单个文档图像 python demo_layout.py --model_path ./hf_model --save_dir ./results \ --input_path ./demo/page_imgs/page_1.png \ # 处理单个文档PDF python demo_layout.py --model_path ./hf_model --save_dir ./results \ --input_path ./demo/page_imgs/page_6.pdf \ # 处理目录中的所有文档 python demo_layout.py --model_path ./hf_model --save_dir ./results \ --input_path ./demo/page_imgs ```` ## 🌟 主要特性 - 🔄 基于单一VLM的两阶段分析-解析方法 - 📊 在文档解析任务上的优异性能 - 🔍 自然阅读顺序元素序列生成 - 🧩 针对不同文档元素的异构锚点提示 - ⏱️ 高效的并行解析机制 - 🤗 支持Hugging Face Transformers,便于集成 ## 📮 通知 **征集不良案例:** 如果您遇到模型表现不佳的案例,我们非常欢迎您在issue中分享。我们正在持续优化和改进模型。 ## 💖 致谢 我们要感谢以下开源项目为本工作提供的灵感和参考: - [OmniDocBench](https://github.com/opendatalab/OmniDocBench) - [Donut](https://github.com/clovaai/donut/) - [Nougat](https://github.com/facebookresearch/nougat) - [GOT](https://github.com/Ucas-HaoranWei/GOT-OCR2.0) - [MinerU](https://github.com/opendatalab/MinerU/tree/master) - [Swin](https://github.com/microsoft/Swin-Transformer) - [Hugging Face Transformers](https://github.com/huggingface/transformers) ## 📝 引用 如果您在研究中发现此代码有用,请使用以下BibTeX条目。 ```bibtex @article{feng2025dolphin, title={Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting}, author={Feng, Hao and Wei, Shu and Fei, Xiang and Shi, Wei and Han, Yingdong and Liao, Lei and Lu, Jinghui and Wu, Binghong and Liu, Qi and Lin, Chunhui and others}, journal={arXiv preprint arXiv:2505.14059}, year={2025} } ``` ## 星标历史 [![Star History Chart](https://api.star-history.com/svg?repos=bytedance/Dolphin&type=Date)](https://www.star-history.com/#bytedance/Dolphin&Date) ================================================ FILE: demo_element.py ================================================ """ Copyright (c) 2025 Bytedance Ltd. and/or its affiliates SPDX-License-Identifier: MIT """ import argparse import glob import os import cv2 import torch from PIL import Image from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration from qwen_vl_utils import process_vision_info from utils.utils import * class DOLPHIN: def __init__(self, model_id_or_path): """Initialize the Hugging Face model Args: model_id_or_path: Path to local model or Hugging Face model ID """ # Load model from local path or Hugging Face hub self.processor = AutoProcessor.from_pretrained(model_id_or_path) self.model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_id_or_path) self.model.eval() # Set device and precision self.device = "cuda" if torch.cuda.is_available() else "cpu" self.model.to(self.device) if self.device == "cuda": self.model = self.model.bfloat16() else: self.model = self.model.float() # set tokenizer self.tokenizer = self.processor.tokenizer self.tokenizer.padding_side = "left" def chat(self, prompt, image): # Check if we're dealing with a batch is_batch = isinstance(image, list) if not is_batch: # Single image, wrap it in a list for consistent processing images = [image] prompts = [prompt] else: # Batch of images images = image prompts = prompt if isinstance(prompt, list) else [prompt] * len(images) assert len(images) == len(prompts) # preprocess all images processed_images = [resize_img(img) for img in images] # generate all messages all_messages = [] for img, question in zip(processed_images, prompts): messages = [ { "role": "user", "content": [ { "type": "image", "image": img, }, {"type": "text", "text": question} ], } ] all_messages.append(messages) # prepare all texts texts = [ self.processor.apply_chat_template( msgs, tokenize=False, add_generation_prompt=True ) for msgs in all_messages ] # collect all image inputs all_image_inputs = [] all_video_inputs = None for msgs in all_messages: image_inputs, video_inputs = process_vision_info(msgs) all_image_inputs.extend(image_inputs) # prepare model inputs inputs = self.processor( text=texts, images=all_image_inputs if all_image_inputs else None, videos=all_video_inputs if all_video_inputs else None, padding=True, return_tensors="pt", ) inputs = inputs.to(self.model.device) # inference generated_ids = self.model.generate( **inputs, max_new_tokens=4096, # repetition_penalty=1.05 ) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] results = self.processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) # Return a single result for single image input if not is_batch: return results[0] return results def process_element(image_path, model, element_type, save_dir=None): """Process a single element image (text, table, formula) Args: image_path: Path to the element image model: HFModel model instance element_type: Type of element ('text', 'table', 'formula') save_dir: Directory to save results (default: same as input directory) Returns: Parsed content of the element and recognition results """ # Load and prepare image pil_image = Image.open(image_path).convert("RGB") # pil_image = crop_margin(pil_image) # Select appropriate prompt based on element type if element_type == "table": prompt = "Parse the table in the image." label = "tab" elif element_type == "formula": prompt = "Read formula in the image." label = "equ" elif element_type == "code": prompt = "Read code in the image." label = "code" else: # Default to text prompt = "Read text in the image." label = "para" # Process the element result = model.chat(prompt, pil_image) # Create recognition result in the same format as the document parser recognition_results = [ { "label": label, "text": result.strip(), } ] # Save results if save_dir is provided save_outputs(recognition_results, pil_image, os.path.basename(image_path).split(".")[0], save_dir) print(f"Results saved to {save_dir}") return result, recognition_results def main(): parser = argparse.ArgumentParser(description="Element-level processing using DOLPHIN model") parser.add_argument("--model_path", default="./hf_model", help="Path to Hugging Face model") parser.add_argument("--input_path", type=str, required=True, help="Path to input image or directory of images") parser.add_argument( "--element_type", type=str, choices=["text", "table", "formula", "code"], default="text", help="Type of element to process (text, table, formula)", ) parser.add_argument( "--save_dir", type=str, default=None, help="Directory to save parsing results (default: same as input directory)", ) parser.add_argument("--print_results", action="store_true", help="Print recognition results to console") args = parser.parse_args() # Load Model model = DOLPHIN(args.model_path) # Set save directory save_dir = args.save_dir or ( args.input_path if os.path.isdir(args.input_path) else os.path.dirname(args.input_path) ) setup_output_dirs(save_dir) # Collect Images if os.path.isdir(args.input_path): image_files = [] for ext in [".jpg", ".jpeg", ".png", ".JPG", ".JPEG", ".PNG"]: image_files.extend(glob.glob(os.path.join(args.input_path, f"*{ext}"))) image_files = sorted(image_files) else: if not os.path.exists(args.input_path): raise FileNotFoundError(f"Input path {args.input_path} does not exist") image_files = [args.input_path] total_samples = len(image_files) print(f"\nTotal samples to process: {total_samples}") # Process images one by one for image_path in image_files: print(f"\nProcessing {image_path}") try: result, recognition_result = process_element( image_path=image_path, model=model, element_type=args.element_type, save_dir=save_dir, ) if args.print_results: print("\nRecognition result:") print(result) print("-" * 40) except Exception as e: print(f"Error processing {image_path}: {str(e)}") continue if __name__ == "__main__": main() ================================================ FILE: demo_layout.py ================================================ """ Copyright (c) 2025 Bytedance Ltd. and/or its affiliates SPDX-License-Identifier: MIT """ import argparse import glob import os import torch from PIL import Image from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration from qwen_vl_utils import process_vision_info from utils.utils import * class DOLPHIN: def __init__(self, model_id_or_path): """Initialize the Hugging Face model Args: model_id_or_path: Path to local model or Hugging Face model ID """ # Load model from local path or Hugging Face hub self.processor = AutoProcessor.from_pretrained(model_id_or_path) self.model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_id_or_path) self.model.eval() # Set device and precision self.device = "cuda" if torch.cuda.is_available() else "cpu" self.model.to(self.device) if self.device == "cuda": self.model = self.model.bfloat16() else: self.model = self.model.float() # set tokenizer self.tokenizer = self.processor.tokenizer self.tokenizer.padding_side = "left" def chat(self, prompt, image): # Check if we're dealing with a batch is_batch = isinstance(image, list) if not is_batch: # Single image, wrap it in a list for consistent processing images = [image] prompts = [prompt] else: # Batch of images images = image prompts = prompt if isinstance(prompt, list) else [prompt] * len(images) assert len(images) == len(prompts) # preprocess all images processed_images = [resize_img(img) for img in images] # generate all messages all_messages = [] for img, question in zip(processed_images, prompts): messages = [ { "role": "user", "content": [ { "type": "image", "image": img, }, {"type": "text", "text": question} ], } ] all_messages.append(messages) # prepare all texts texts = [ self.processor.apply_chat_template( msgs, tokenize=False, add_generation_prompt=True ) for msgs in all_messages ] # collect all image inputs all_image_inputs = [] all_video_inputs = None for msgs in all_messages: image_inputs, video_inputs = process_vision_info(msgs) all_image_inputs.extend(image_inputs) # prepare model inputs inputs = self.processor( text=texts, images=all_image_inputs if all_image_inputs else None, videos=all_video_inputs if all_video_inputs else None, padding=True, return_tensors="pt", ) inputs = inputs.to(self.model.device) # inference generated_ids = self.model.generate( **inputs, max_new_tokens=4096, do_sample=False, temperature=None, # repetition_penalty=1.05 ) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] results = self.processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) # Return a single result for single image input if not is_batch: return results[0] return results def process_layout(input_path, model, save_dir): """Process layout detection for image or PDF Args: input_path: Path to input image or PDF model: DOLPHIN model instance save_dir: Directory to save results """ file_ext = os.path.splitext(input_path)[1].lower() if file_ext == '.pdf': # Convert PDF to images images = convert_pdf_to_images(input_path) if not images: raise Exception(f"Failed to convert PDF {input_path} to images") # Process each page for page_idx, pil_image in enumerate(images): print(f"\nProcessing page {page_idx + 1}/{len(images)}") # Generate output name for this page base_name = os.path.splitext(os.path.basename(input_path))[0] page_name = f"{base_name}_page_{page_idx + 1:03d}" # Process layout for this page process_single_layout(pil_image, model, save_dir, page_name) else: # Process regular image file pil_image = Image.open(input_path).convert("RGB") base_name = os.path.splitext(os.path.basename(input_path))[0] process_single_layout(pil_image, model, save_dir, base_name) def process_single_layout(pil_image, model, save_dir, image_name): """Process layout for a single image Args: pil_image: PIL Image object model: DOLPHIN model instance save_dir: Directory to save results image_name: Name for the output files """ # Parse layout print("Parsing layout and reading order...") layout_results = model.chat("Parse the reading order of this document.", pil_image) # Parse the layout string layout_results_list = parse_layout_string(layout_results) if not layout_results_list or not (layout_results.startswith("[") and layout_results.endswith("]")): layout_results_list = [([0, 0, *pil_image.size], 'distorted_page', [])] # map bbox to original image coordinates recognition_results = [] reading_order = 0 for bbox, label, tags in layout_results_list: x1, y1, x2, y2 = process_coordinates(bbox, pil_image) recognition_results.append({ "label": label, "bbox": [x1, y1, x2, y2], "text": "", # empty for now "reading_order": reading_order, "tags": tags, }) reading_order += 1 json_path = save_outputs(recognition_results, pil_image, image_name, save_dir) def main(): parser = argparse.ArgumentParser(description="Layout detection and visualization using DOLPHIN model") parser.add_argument("--model_path", default="./hf_model", help="Path to Hugging Face model") parser.add_argument( "--input_path", type=str, required=True, help="Path to input image/PDF or directory of files" ) parser.add_argument( "--save_dir", type=str, default=None, help="Directory to save results (default: same as input directory)", ) args = parser.parse_args() # Load Model print("Loading model...") model = DOLPHIN(args.model_path) # Set save directory save_dir = args.save_dir or ( args.input_path if os.path.isdir(args.input_path) else os.path.dirname(args.input_path) ) # Create save directory if it doesn't exist os.makedirs(save_dir, exist_ok=True) # Collect files if os.path.isdir(args.input_path): # Support both image and PDF files file_extensions = [".jpg", ".jpeg", ".png", ".JPG", ".JPEG", ".PNG", ".pdf", ".PDF"] input_files = [] for ext in file_extensions: input_files.extend(glob.glob(os.path.join(args.input_path, f"*{ext}"))) input_files = sorted(input_files) else: if not os.path.exists(args.input_path): raise FileNotFoundError(f"Input path {args.input_path} does not exist") # Check if it's a supported file type file_ext = os.path.splitext(args.input_path)[1].lower() supported_exts = ['.jpg', '.jpeg', '.png', '.pdf'] if file_ext not in supported_exts: raise ValueError(f"Unsupported file type: {file_ext}. Supported types: {supported_exts}") input_files = [args.input_path] total_files = len(input_files) print(f"\nTotal files to process: {total_files}") # Process files for file_path in input_files: print(f"\n{'='*60}") print(f"Processing: {file_path}") print('='*60) try: process_layout( input_path=file_path, model=model, save_dir=save_dir, ) print(f"\n✓ Processing completed for {file_path}") except Exception as e: print(f"\n✗ Error processing {file_path}: {str(e)}") continue print(f"\n{'='*60}") print(f"All processing completed. Results saved to {save_dir}") print('='*60) if __name__ == "__main__": main() ================================================ FILE: demo_page.py ================================================ """ Copyright (c) 2025 Bytedance Ltd. and/or its affiliates SPDX-License-Identifier: MIT """ import argparse import glob import os import torch from PIL import Image from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration from qwen_vl_utils import process_vision_info from utils.utils import * class DOLPHIN: def __init__(self, model_id_or_path): """Initialize the Hugging Face model Args: model_id_or_path: Path to local model or Hugging Face model ID """ # Load model from local path or Hugging Face hub self.processor = AutoProcessor.from_pretrained(model_id_or_path) self.model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_id_or_path) self.model.eval() # Set device and precision self.device = "cuda" if torch.cuda.is_available() else "cpu" self.model.to(self.device) if self.device == "cuda": self.model = self.model.bfloat16() else: self.model = self.model.float() # set tokenizer self.tokenizer = self.processor.tokenizer self.tokenizer.padding_side = "left" def chat(self, prompt, image): # Check if we're dealing with a batch is_batch = isinstance(image, list) if not is_batch: # Single image, wrap it in a list for consistent processing images = [image] prompts = [prompt] else: # Batch of images images = image prompts = prompt if isinstance(prompt, list) else [prompt] * len(images) assert len(images) == len(prompts) # preprocess all images processed_images = [resize_img(img) for img in images] # generate all messages all_messages = [] for img, question in zip(processed_images, prompts): messages = [ { "role": "user", "content": [ { "type": "image", "image": img, }, {"type": "text", "text": question} ], } ] all_messages.append(messages) # prepare all texts texts = [ self.processor.apply_chat_template( msgs, tokenize=False, add_generation_prompt=True ) for msgs in all_messages ] # collect all image inputs all_image_inputs = [] all_video_inputs = None for msgs in all_messages: image_inputs, video_inputs = process_vision_info(msgs) all_image_inputs.extend(image_inputs) # prepare model inputs inputs = self.processor( text=texts, images=all_image_inputs if all_image_inputs else None, videos=all_video_inputs if all_video_inputs else None, padding=True, return_tensors="pt", ) inputs = inputs.to(self.model.device) # inference generated_ids = self.model.generate( **inputs, max_new_tokens=4096, do_sample=False, temperature=None, # repetition_penalty=1.05 ) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] results = self.processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) # Return a single result for single image input if not is_batch: return results[0] return results def process_document(document_path, model, save_dir, max_batch_size=None): """Parse documents with two stages - Handles both images and PDFs""" file_ext = os.path.splitext(document_path)[1].lower() if file_ext == '.pdf': # Convert PDF to images images = convert_pdf_to_images(document_path) if not images: raise Exception(f"Failed to convert PDF {document_path} to images") all_results = [] # Process each page for page_idx, pil_image in enumerate(images): print(f"Processing page {page_idx + 1}/{len(images)}") # Generate output name for this page base_name = os.path.splitext(os.path.basename(document_path))[0] page_name = f"{base_name}_page_{page_idx + 1:03d}" # Process this page (don't save individual page results) json_path, recognition_results = process_single_image( pil_image, model, save_dir, page_name, max_batch_size, save_individual=False ) # Add page information to results page_results = { "page_number": page_idx + 1, "elements": recognition_results } all_results.append(page_results) # Save combined results for multi-page PDF combined_json_path = save_combined_pdf_results(all_results, document_path, save_dir) return combined_json_path, all_results else: # Process regular image file pil_image = Image.open(document_path).convert("RGB") base_name = os.path.splitext(os.path.basename(document_path))[0] return process_single_image(pil_image, model, save_dir, base_name, max_batch_size) def process_single_image(image, model, save_dir, image_name, max_batch_size=None, save_individual=True): """Process a single image (either from file or converted from PDF page) Args: image: PIL Image object model: DOLPHIN model instance save_dir: Directory to save results image_name: Name for the output file max_batch_size: Maximum batch size for processing save_individual: Whether to save individual results (False for PDF pages) Returns: Tuple of (json_path, recognition_results) """ # Stage 1: Page-level layout and reading order parsing layout_output = model.chat("Parse the reading order of this document.", image) # print(layout_output) # Stage 2: Element-level content parsing recognition_results = process_elements(layout_output, image, model, max_batch_size, save_dir, image_name) # Save outputs only if requested (skip for PDF pages) json_path = None if save_individual: # Create a dummy image path for save_outputs function json_path = save_outputs(recognition_results, image, image_name, save_dir) return json_path, recognition_results def process_elements(layout_results, image, model, max_batch_size, save_dir=None, image_name=None): """Parse all document elements with parallel decoding""" layout_results_list = parse_layout_string(layout_results) if not layout_results_list or not (layout_results.startswith("[") and layout_results.endswith("]")): layout_results_list = [([0, 0, *image.size], 'distorted_page', [])] # Check for bbox overlap - if too many overlaps, treat as distorted page elif len(layout_results_list) > 1 and check_bbox_overlap(layout_results_list, image): print("Falling back to distorted_page mode due to high bbox overlap") layout_results_list = [([0, 0, *image.size], 'distorted_page', [])] tab_elements = [] equ_elements = [] code_elements = [] text_elements = [] figure_results = [] reading_order = 0 # Collect elements and group for bbox, label, tags in layout_results_list: try: if label == "distorted_page": x1, y1, x2, y2 = 0, 0, *image.size pil_crop = image else: # get coordinates in the original image x1, y1, x2, y2 = process_coordinates(bbox, image) # crop the image pil_crop = image.crop((x1, y1, x2, y2)) if pil_crop.size[0] > 3 and pil_crop.size[1] > 3: if label == "fig": figure_filename = save_figure_to_local(pil_crop, save_dir, image_name, reading_order) figure_results.append({ "label": label, "text": f"![Figure](figures/{figure_filename})", "figure_path": f"figures/{figure_filename}", "bbox": [x1, y1, x2, y2], "reading_order": reading_order, "tags": tags, }) else: # Prepare element information element_info = { "crop": pil_crop, "label": label, "bbox": [x1, y1, x2, y2], "reading_order": reading_order, "tags": tags, } if label == "tab": tab_elements.append(element_info) elif label == "equ": equ_elements.append(element_info) elif label == "code": code_elements.append(element_info) else: text_elements.append(element_info) reading_order += 1 except Exception as e: print(f"Error processing bbox with label {label}: {str(e)}") continue recognition_results = figure_results.copy() if tab_elements: results = process_element_batch(tab_elements, model, "Parse the table in the image.", max_batch_size) recognition_results.extend(results) if equ_elements: results = process_element_batch(equ_elements, model, "Read formula in the image.", max_batch_size) recognition_results.extend(results) if code_elements: results = process_element_batch(code_elements, model, "Read code in the image.", max_batch_size) recognition_results.extend(results) if text_elements: results = process_element_batch(text_elements, model, "Read text in the image.", max_batch_size) recognition_results.extend(results) recognition_results.sort(key=lambda x: x.get("reading_order", 0)) return recognition_results def process_element_batch(elements, model, prompt, max_batch_size=None): """Process elements of the same type in batches""" results = [] # Determine batch size batch_size = len(elements) if max_batch_size is not None and max_batch_size > 0: batch_size = min(batch_size, max_batch_size) # Process in batches for i in range(0, len(elements), batch_size): batch_elements = elements[i:i+batch_size] crops_list = [elem["crop"] for elem in batch_elements] # Use the same prompt for all elements in the batch prompts_list = [prompt] * len(crops_list) # Batch inference batch_results = model.chat(prompts_list, crops_list) # Add results for j, result in enumerate(batch_results): elem = batch_elements[j] results.append({ "label": elem["label"], "bbox": elem["bbox"], "text": result.strip(), "reading_order": elem["reading_order"], "tags": elem["tags"], }) return results def main(): parser = argparse.ArgumentParser(description="Document parsing based on DOLPHIN") parser.add_argument("--model_path", default="./hf_model", help="Path to Hugging Face model") parser.add_argument("--input_path", type=str, default="./demo", help="Path to input image/PDF or directory of files") parser.add_argument( "--save_dir", type=str, default=None, help="Directory to save parsing results (default: same as input directory)", ) parser.add_argument( "--max_batch_size", type=int, default=4, help="Maximum number of document elements to parse in a single batch (default: 4)", ) args = parser.parse_args() # Load Model model = DOLPHIN(args.model_path) # Collect Document Files (images and PDFs) if os.path.isdir(args.input_path): # Support both image and PDF files file_extensions = [".jpg", ".jpeg", ".png", ".JPG", ".JPEG", ".PNG", ".pdf", ".PDF"] document_files = [] for ext in file_extensions: document_files.extend(glob.glob(os.path.join(args.input_path, f"*{ext}"))) document_files = sorted(document_files) else: if not os.path.exists(args.input_path): raise FileNotFoundError(f"Input path {args.input_path} does not exist") # Check if it's a supported file type file_ext = os.path.splitext(args.input_path)[1].lower() supported_exts = ['.jpg', '.jpeg', '.png', '.pdf'] if file_ext not in supported_exts: raise ValueError(f"Unsupported file type: {file_ext}. Supported types: {supported_exts}") document_files = [args.input_path] save_dir = args.save_dir or ( args.input_path if os.path.isdir(args.input_path) else os.path.dirname(args.input_path) ) setup_output_dirs(save_dir) total_samples = len(document_files) print(f"\nTotal files to process: {total_samples}") # Process All Document Files for file_path in document_files: print(f"\nProcessing {file_path}") try: json_path, recognition_results = process_document( document_path=file_path, model=model, save_dir=save_dir, max_batch_size=args.max_batch_size, ) print(f"Processing completed. Results saved to {save_dir}") except Exception as e: print(f"Error processing {file_path}: {str(e)}") continue if __name__ == "__main__": main() ================================================ FILE: pyproject.toml ================================================ [tool.black] line-length = 120 include = '\.pyi?$' exclude = ''' /( \.git | \.hg | \.mypy_cache | \.tox | \.venv | _build | buck-out | build | dist )/ ''' ================================================ FILE: requirements.txt ================================================ datasets==3.6.0 torch==2.6.0 torchvision==0.21.0 transformers==4.51.0 deepspeed==0.16.4 triton==3.2.0 accelerate==1.4.0 torchcodec==0.2 decord==0.6.0 Levenshtein==0.27.1 qwen_vl_utils matplotlib jieba opencv-python bs4 albumentations==1.4.0 pymupdf==1.26 ================================================ FILE: utils/markdown_utils.py ================================================ """ Copyright (c) 2025 Bytedance Ltd. and/or its affiliates SPDX-License-Identifier: MIT """ import re import base64 from typing import List, Dict, Any, Optional def extract_table_from_html(html_string): """Extract and clean table tags from HTML string""" try: table_pattern = re.compile(r'.*?', re.DOTALL) tables = table_pattern.findall(html_string) tables = [re.sub(r']*>', '', table) for table in tables] return '\n'.join(tables) except Exception as e: print(f"extract_table_from_html error: {str(e)}") return f"
Error extracting table: {str(e)}
" class MarkdownConverter: """Convert structured recognition results to Markdown format""" def __init__(self): # Define heading levels for different section types self.heading_levels = { 'sec_0': '#', 'sec_1': '##', 'sec_2': '###', 'sec_3': '###', 'sec_4': '###', 'sec_5': '###', } # Define which labels need special handling self.special_labels = { 'sec_0', 'sec_1', 'sec_2', 'sec_3', 'sec_4', 'sec_5', 'list', 'equ', 'tab', 'fig' } # Define replacements for special formulas self.replace_dict = { '\\bm': '\mathbf ', '\eqno': '\quad ', '\quad': '\quad ', '\leq': '\leq ', '\pm': '\pm ', '\\varmathbb': '\mathbb ', '\in fty': '\infty', '\mu': '\mu ', '\cdot': '\cdot ', '\langle': '\langle ', '\pm': '\pm ' } def try_remove_newline(self, text: str) -> str: try: # Preprocess text to handle line breaks text = text.strip() text = text.replace('-\n', '') # Handle Chinese text line breaks def is_chinese(char): return '\u4e00' <= char <= '\u9fff' lines = text.split('\n') processed_lines = [] # Process all lines except the last one for i in range(len(lines)-1): current_line = lines[i].strip() next_line = lines[i+1].strip() # Always add the current line, but determine if we need a newline if current_line: # If current line is not empty if next_line: # If next line is not empty # For Chinese text handling if is_chinese(current_line[-1]) and is_chinese(next_line[0]): processed_lines.append(current_line) else: processed_lines.append(current_line + ' ') else: # Next line is empty, add current line with newline processed_lines.append(current_line + '\n') else: # Current line is empty, add an empty line processed_lines.append('\n') # Add the last line if lines and lines[-1].strip(): processed_lines.append(lines[-1].strip()) text = ''.join(processed_lines) return text except Exception as e: print(f"try_remove_newline error: {str(e)}") return text # Return original text on error def _handle_text(self, text: str) -> str: """ Process regular text content, preserving paragraph structure """ try: if not text: return "" # Process formulas in text before handling other text processing text = self._process_formulas_in_text(text) text = self.try_remove_newline(text) return text except Exception as e: print(f"_handle_text error: {str(e)}") return text # Return original text on error def _process_formulas_in_text(self, text: str) -> str: """ Process mathematical formulas in text by iteratively finding and replacing formulas. - Identify inline and block formulas - Replace newlines within formulas with \\ """ try: text = text.replace(r'\upmu', r'\mu') for key, value in self.replace_dict.items(): text = text.replace(key, value) return text except Exception as e: print(f"_process_formulas_in_text error: {str(e)}") return text # Return original text on error def _remove_newline_in_heading(self, text: str) -> str: """ Remove newline in heading """ try: # Handle Chinese text line breaks def is_chinese(char): return '\u4e00' <= char <= '\u9fff' # Check if the text contains Chinese characters if any(is_chinese(char) for char in text): return text.replace('\n', '') else: return text.replace('\n', ' ') except Exception as e: print(f"_remove_newline_in_heading error: {str(e)}") return text def _handle_heading(self, text: str, label: str) -> str: """ Convert section headings to appropriate markdown format """ try: level = self.heading_levels.get(label, '#') text = text.strip() text = self._remove_newline_in_heading(text) text = self._handle_text(text) return f"{level} {text}\n\n" except Exception as e: print(f"_handle_heading error: {str(e)}") return f"# Error processing heading: {text}\n\n" def _handle_list_item(self, text: str) -> str: """ Convert list items to markdown list format """ try: return f"- {text.strip()}\n" except Exception as e: print(f"_handle_list_item error: {str(e)}") return f"- Error processing list item: {text}\n" def _handle_figure(self, text: str, section_count: int) -> str: """ Handle figure content """ try: # Check if it's a file path starting with "figures/" if text.startswith("figures/"): # Convert to relative path from markdown directory to figures directory relative_path = f"../{text}" return f"![Figure {section_count}]({relative_path})\n\n" # Check if it's already a markdown format image link if text.startswith("!["): # Already in markdown format, return directly return f"{text}\n\n" # If it's still base64 format, maintain original logic if text.startswith("data:image/"): return f"![Figure {section_count}]({text})\n\n" elif ";" in text and "," in text: return f"![Figure {section_count}]({text})\n\n" else: # Assume it's raw base64, convert to data URI img_format = "png" data_uri = f"data:image/{img_format};base64,{text}" return f"![Figure {section_count}]({data_uri})\n\n" except Exception as e: print(f"_handle_figure error: {str(e)}") return f"*[Error processing figure: {str(e)}]*\n\n" def _handle_table(self, text: str) -> str: """ Convert table content to markdown format """ try: markdown_content = [] markdown_table = extract_table_from_html(text) markdown_content.append(markdown_table + "\n") return '\n'.join(markdown_content) + '\n\n' except Exception as e: print(f"_handle_table error: {str(e)}") return f"*[Error processing table: {str(e)}]*\n\n" def _handle_formula(self, text: str) -> str: """ Handle formula-specific content """ try: text = text.strip('$').rstrip("\ ").replace(r'\upmu', r'\mu') for key, value in self.replace_dict.items(): text = text.replace(key, value) processed_text = '$$' + text + '$$' return f"{processed_text}\n\n" except Exception as e: print(f"_handle_formula error: {str(e)}") return f"*[Error processing formula: {str(e)}]*\n\n" def convert(self, recognition_results: List[Dict[str, Any]]) -> str: """ Convert recognition results to markdown format """ try: markdown_content = [] for section_count, result in enumerate(recognition_results): try: label = result.get('label', '') text = result.get('text', '').strip() # Skip empty text if not text: continue # Handle different content types if label in {'sec_0', 'sec_1', 'sec_2', 'sec_3', 'sec_4', 'sec_5'}: markdown_content.append(self._handle_heading(text, label)) elif label == 'fig': markdown_content.append(self._handle_figure(text, section_count)) elif label == 'tab': markdown_content.append(self._handle_table(text)) elif label == 'equ': markdown_content.append(self._handle_formula(text)) elif label == 'list': markdown_content.append(self._handle_list_item(text)) elif label == 'code': markdown_content.append(f"```bash\n{text}\n```\n\n") else: # Handle regular text (paragraphs, etc.) processed_text = self._handle_text(text) markdown_content.append(f"{processed_text}\n\n") # TODO: distoraged page except Exception as e: print(f"Error processing item {section_count}: {str(e)}") # Add a placeholder for the failed item markdown_content.append(f"*[Error processing content]*\n\n") # Join all content and apply post-processing result = ''.join(markdown_content) return result except Exception as e: print(f"convert error: {str(e)}") return f"Error generating markdown content: {str(e)}" ================================================ FILE: utils/utils.py ================================================ """ Copyright (c) 2025 Bytedance Ltd. and/or its affiliates SPDX-License-Identifier: MIT """ import io import json import os import re from dataclasses import dataclass from typing import List, Tuple import cv2 import numpy as np import pymupdf from PIL import Image from qwen_vl_utils import smart_resize from utils.markdown_utils import MarkdownConverter def save_figure_to_local(pil_crop, save_dir, image_name, reading_order): """Save cropped figure to local file system Args: pil_crop: PIL Image object of the cropped figure save_dir: Base directory to save results image_name: Name of the source image/document reading_order: Reading order of the figure in the document Returns: str: Filename of the saved figure """ try: # Create figures directory if it doesn't exist figures_dir = os.path.join(save_dir, "markdown", "figures") # os.makedirs(figures_dir, exist_ok=True) # Generate figure filename figure_filename = f"{image_name}_figure_{reading_order:03d}.png" figure_path = os.path.join(figures_dir, figure_filename) # Save the figure pil_crop.save(figure_path, format="PNG", quality=95) # print(f"Saved figure: {figure_filename}") return figure_filename except Exception as e: print(f"Error saving figure: {str(e)}") # Return a fallback filename return f"{image_name}_figure_{reading_order:03d}_error.png" def convert_pdf_to_images(pdf_path, target_size=896): """Convert PDF pages to images Args: pdf_path: Path to PDF file target_size: Target size for the longest dimension Returns: List of PIL Images """ images = [] try: doc = pymupdf.open(pdf_path) for page_num in range(len(doc)): page = doc[page_num] # Calculate scale to make longest dimension equal to target_size rect = page.rect scale = target_size / max(rect.width, rect.height) # Render page as image mat = pymupdf.Matrix(scale, scale) pix = page.get_pixmap(matrix=mat) # Convert to PIL Image img_data = pix.tobytes("png") pil_image = Image.open(io.BytesIO(img_data)) images.append(pil_image) doc.close() print(f"Successfully converted {len(images)} pages from PDF") return images except Exception as e: print(f"Error converting PDF to images: {str(e)}") return [] def save_combined_pdf_results(all_page_results, pdf_path, save_dir): """Save combined results for multi-page PDF with both JSON and Markdown Args: all_page_results: List of results for all pages pdf_path: Path to original PDF file save_dir: Directory to save results Returns: Path to saved combined JSON file """ # Create output filename based on PDF name base_name = os.path.splitext(os.path.basename(pdf_path))[0] # Prepare combined results combined_results = {"source_file": pdf_path, "total_pages": len(all_page_results), "pages": all_page_results} # Save combined JSON results json_filename = f"{base_name}.json" json_path = os.path.join(save_dir, "recognition_json", json_filename) os.makedirs(os.path.dirname(json_path), exist_ok=True) with open(json_path, "w", encoding="utf-8") as f: json.dump(combined_results, f, indent=2, ensure_ascii=False) # Generate and save combined markdown try: markdown_converter = MarkdownConverter() # Combine all page results into a single list for markdown conversion all_elements = [] for page_data in all_page_results: page_elements = page_data.get("elements", []) if page_elements: # Add page separator if not the first page if all_elements: all_elements.append( {"label": "page_separator", "text": f"\n\n---\n\n", "reading_order": len(all_elements)} ) all_elements.extend(page_elements) # Generate markdown content markdown_content = markdown_converter.convert(all_elements) # Save markdown file markdown_filename = f"{base_name}.md" markdown_path = os.path.join(save_dir, "markdown", markdown_filename) os.makedirs(os.path.dirname(markdown_path), exist_ok=True) with open(markdown_path, "w", encoding="utf-8") as f: f.write(markdown_content) # print(f"Combined markdown saved to: {markdown_path}") except ImportError: print("MarkdownConverter not available, skipping markdown generation") except Exception as e: print(f"Error generating markdown: {e}") # print(f"Combined JSON results saved to: {json_path}") return json_path def extract_labels_from_string(text): """ from [202,217,921,325][para][author] extract para and author """ all_matches = re.findall(r'\[([^\]]+)\]', text) labels = [] for match in all_matches: if not re.match(r'^\d+,\d+,\d+,\d+$', match): labels.append(match) return labels def parse_layout_string(bbox_str): """ Dolphin-V1.5 layout string parsing function Parse layout string to extract bbox and category information Supports multiple formats: 1. Original format: [x1,y1,x2,y2] label 2. New format: [x1,y1,x2,y2][label][PAIR_SEP] or [x1,y1,x2,y2][label][meta_info][PAIR_SEP] """ parsed_results = [] segments = bbox_str.split('[PAIR_SEP]') new_segments = [] for seg in segments: new_segments.extend(seg.split('[RELATION_SEP]')) segments = new_segments for segment in segments: segment = segment.strip() if not segment: continue coord_pattern = r'\[(\d*\.?\d+),(\d*\.?\d+),(\d*\.?\d+),(\d*\.?\d+)\]' coord_match = re.search(coord_pattern, segment) label_matches = extract_labels_from_string(segment) if coord_match and label_matches: coords = [float(coord_match.group(i)) for i in range(1, 5)] label = label_matches[0].strip() parsed_results.append((coords, label, label_matches[1:])) # label_matches[1:] 是 tags return parsed_results def process_coordinates(coords, pil_image): original_w, original_h = pil_image.size[:2] # use the same resize logic as the model resized_pil = resize_img(pil_image) resized_image = np.array(resized_pil) resized_h, resized_w = resized_image.shape[:2] resized_h, resized_w = smart_resize(resized_h, resized_w, factor=28, min_pixels=784, max_pixels=2560000) w_ratio, h_ratio = original_w / resized_w, original_h / resized_h x1 = int(coords[0] * w_ratio) y1 = int(coords[1] * h_ratio) x2 = int(coords[2] * w_ratio) y2 = int(coords[3] * h_ratio) x1 = max(0, min(x1, original_w - 1)) y1 = max(0, min(y1, original_h - 1)) x2 = max(x1 + 1, min(x2, original_w)) y2 = max(y1 + 1, min(y2, original_h)) return x1, y1, x2, y2 def setup_output_dirs(save_dir): """Create necessary output directories""" os.makedirs(save_dir, exist_ok=True) os.makedirs(os.path.join(save_dir, "markdown"), exist_ok=True) os.makedirs(os.path.join(save_dir, "output_json"), exist_ok=True) os.makedirs(os.path.join(save_dir, "markdown", "figures"), exist_ok=True) os.makedirs(os.path.join(save_dir, "layout_visualization"), exist_ok=True) def save_outputs(recognition_results, image, image_name, save_dir): """Save JSON and markdown outputs""" # Save JSON file json_path = os.path.join(save_dir, "output_json", f"{image_name}.json") with open(json_path, "w", encoding="utf-8") as f: json.dump(recognition_results, f, ensure_ascii=False, indent=2) # Generate and save markdown file markdown_converter = MarkdownConverter() markdown_content = markdown_converter.convert(recognition_results) markdown_path = os.path.join(save_dir, "markdown", f"{image_name}.md") with open(markdown_path, "w", encoding="utf-8") as f: f.write(markdown_content) # visualize layout # Save visualization (pass original PIL image for coordinate mapping) vis_path = os.path.join(save_dir, "layout_visualization", f"{image_name}_layout.png") visualize_layout(image, recognition_results, vis_path) return json_path def crop_margin(img: Image.Image) -> Image.Image: """Crop margins from image""" try: width, height = img.size if width == 0 or height == 0: print("Warning: Image has zero width or height") return img data = np.array(img.convert("L")) data = data.astype(np.uint8) max_val = data.max() min_val = data.min() if max_val == min_val: return img data = (data - min_val) / (max_val - min_val) * 255 gray = 255 * (data < 200).astype(np.uint8) coords = cv2.findNonZero(gray) # Find all non-zero points (text) if coords is None: return img a, b, w, h = cv2.boundingRect(coords) # Find minimum spanning bounding box # Ensure crop coordinates are within image bounds a = max(0, a) b = max(0, b) w = min(w, width - a) h = min(h, height - b) # Only crop if we have a valid region if w > 0 and h > 0: return img.crop((a, b, a + w, b + h)) return img except Exception as e: print(f"crop_margin error: {str(e)}") return img # Return original image on error def visualize_layout(image_path, layout_results, save_path, alpha=0.3): """Visualize layout detection results on the image Args: image_path: Path to the input image layout_results: List of (bbox, label, tags) dict save_path: Path to save the visualization alpha: Transparency of the overlay (0-1, lower = more transparent) """ # Read image if isinstance(image_path, str): image = cv2.imread(image_path) else: # If it's already a PIL Image image = cv2.cvtColor(np.array(image_path), cv2.COLOR_RGB2BGR) if image is None: raise ValueError(f"Failed to load image from {image_path}") # Assign colors to all elements at once element_colors = assign_colors_to_elements(len(layout_results)) # Create overlay overlay = image.copy() # Draw each layout element for idx, layout_res in enumerate(layout_results): if "bbox" not in layout_res: return bbox, label, reading_order, tags = layout_res["bbox"], layout_res["label"], layout_res["reading_order"], layout_res["tags"] x1,y1,x2,y2 = bbox # Get color for this element (assigned by order, not by label) color = element_colors[idx] # Draw filled rectangle with transparency cv2.rectangle(overlay, (x1,y1), (x2,y2), color, -1) # Draw border cv2.rectangle(image, (x1,y1), (x2,y2), color, 3) # Add label text with background at the top-left corner (outside the box) label_text = f"{reading_order}: {label} | {tags}" font = cv2.FONT_HERSHEY_SIMPLEX font_scale = 0.5 thickness = 1 # Get text size (text_width, text_height), baseline = cv2.getTextSize( label_text, font, font_scale, thickness ) # Position text above the box (outside) text_x = x1 text_y = y1 - 5 # 5 pixels above the box # If text would go outside the image at the top, put it inside the box instead if text_y - text_height < 0: text_y = y1 + text_height + 5 # Draw text background cv2.rectangle( image, (text_x - 2, text_y - text_height - 2), (text_x + text_width + 2, text_y + baseline + 2), (255, 255, 255), -1 ) # Draw text cv2.putText( image, label_text, (text_x, text_y), font, font_scale, (0, 0, 0), thickness ) # Blend the overlay with the original image result = cv2.addWeighted(overlay, alpha, image, 1 - alpha, 0) # Save the result cv2.imwrite(save_path, result) # print(f"Layout visualization saved to {save_path}") def get_color_palette(): """Get a visually pleasing color palette for layout visualization Returns: List of BGR color tuples (semi-transparent, good for overlay) """ # Carefully selected color palette with good visual distinction # Colors are chosen to be light, pleasant, and distinguishable color_palette = [ (200, 255, 255), # Light cyan (255, 200, 255), # Light magenta (255, 255, 200), # Light yellow (200, 255, 200), # Light green (255, 220, 200), # Light orange (220, 200, 255), # Light purple (200, 240, 255), # Light sky blue (255, 240, 220), # Light peach (220, 255, 240), # Light mint (255, 220, 240), # Light pink (240, 255, 200), # Light lime (240, 220, 255), # Light lavender (200, 255, 240), # Light turquoise (255, 240, 200), # Light apricot (220, 240, 255), # Light periwinkle (255, 200, 220), # Light rose (220, 255, 220), # Light jade (255, 230, 200), # Light salmon (210, 230, 255), # Light cornflower (255, 210, 230), # Light carnation ] return color_palette def assign_colors_to_elements(num_elements): """Assign colors to elements in order Args: num_elements: Number of elements to assign colors to Returns: List of color tuples, one for each element """ palette = get_color_palette() colors = [] for i in range(num_elements): # Cycle through the palette if we have more elements than colors color_idx = i % len(palette) colors.append(palette[color_idx]) return colors def resize_img(image, max_size=1600, min_size=28): width, height = image.size if max(width, height) < max_size and min(width, height) >= 28: return image if max(width, height) > max_size: if width > height: new_width = max_size new_height = int(height * (max_size / width)) else: new_height = max_size new_width = int(width * (max_size / height)) image = image.resize((new_width, new_height)) width, height = image.size if min(width, height) < 28: if width < height: new_width = min_size new_height = int(height * (min_size / width)) else: new_height = min_size new_width = int(width * (min_size / height)) image = image.resize((new_width, new_height)) return image def calculate_iou_matrix(boxes): """Vectorized IoU matrix calculation [N, N] Args: boxes: List of bounding boxes in [x1, y1, x2, y2] format Returns: numpy.ndarray: IoU matrix of shape [N, N] """ boxes = np.array(boxes) # [N, 4] # Calculate areas areas = (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1]) # [N] # Broadcast to calculate intersection lt = np.maximum(boxes[:, None, :2], boxes[None, :, :2]) # [N, N, 2] rb = np.minimum(boxes[:, None, 2:], boxes[None, :, 2:]) # [N, N, 2] wh = np.clip(rb - lt, 0, None) # [N, N, 2] inter = wh[:, :, 0] * wh[:, :, 1] # [N, N] # Calculate IoU union = areas[:, None] + areas[None, :] - inter iou = inter / np.clip(union, 1e-6, None) return iou def check_bbox_overlap(layout_results_list, image, iou_threshold=0.1, overlap_box_ratio=0.25): """Check if bounding boxes have significant overlaps, indicating a distorted/photographed document If more than 60% of boxes have overlaps (IoU > threshold with at least 1 other box), treat as photographed document. Args: layout_results_list: List of (bbox, label, tags) tuples image: PIL Image object iou_threshold: IoU threshold to consider two boxes as overlapping (default: 0.3) overlap_box_ratio: Ratio threshold of boxes with overlaps (default: 0.6, i.e., 60%) Returns: bool: True if significant overlap detected (should treat as distorted_page) """ if len(layout_results_list) <= 1: return False # Convert to absolute coordinates bboxes = [] for bbox, label, tags in layout_results_list: x1, y1, x2, y2 = process_coordinates(bbox, image) bboxes.append([x1, y1, x2, y2]) # Vectorized IoU matrix calculation iou_matrix = calculate_iou_matrix(bboxes) # Check if each box has overlap with any other box (excluding itself) overlap_mask = iou_matrix > iou_threshold np.fill_diagonal(overlap_mask, False) # Exclude self has_overlap = overlap_mask.any(axis=1) # Whether each box has overlap # Count boxes with overlaps overlap_count = has_overlap.sum() total_boxes = len(bboxes) overlap_ratio = overlap_count / total_boxes # print(f"Overlap detection: {overlap_count}/{total_boxes} boxes have overlaps (ratio: {overlap_ratio:.2%})") # If more than 60% boxes have overlaps, treat as photographed document if overlap_ratio > overlap_box_ratio: print(f"⚠️ High overlap detected ({overlap_ratio:.2%} > {overlap_box_ratio:.2%}), treating as distorted/photographed document") return True return False if __name__ == "__main__": bbox_str = "[210,136,910,172][sec_0][PAIR_SEP][202,217,921,325][para][author][PAIR_SEP][520,341,604,367][para][PAIR_SEP][290,404,384,432][sec_1][paper_abstract][PAIR_SEP][156,448,520,723][para][paper_abstract][PAIR_SEP][125,740,290,768][sec_1][PAIR_SEP][125,781,552,1143][para][PAIR_SEP][125,1144,552,1400][para][RELATION_SEP][573,406,1000,561][para][PAIR_SEP][573,581,1001,943][para][PAIR_SEP][573,962,1001,1222][para][PAIR_SEP][573,1241,1001,1475][para][PAIR_SEP][126,1410,551,1470][fnote][PAIR_SEP][21,499,63,1163][watermark][meta_num]" print(parse_layout_string(bbox_str))