ModelScope)]](https://www.modelscope.cn/models/OpenDataLab/PDF-Extract-Kit-1.0)
🔥🔥🔥 [MinerU: Efficient Document Content Extraction Tool Based on PDF-Extract-Kit](https://github.com/opendatalab/MinerU)
👋 join us on Discord and WeChat
## Overview `PDF-Extract-Kit` is a powerful open-source toolkit designed to efficiently extract high-quality content from complex and diverse PDF documents. Here are its main features and advantages: - **Integration of Leading Document Parsing Models**: Incorporates state-of-the-art models for layout detection, formula detection, formula recognition, OCR, and other core document parsing tasks. - **High-Quality Parsing Across Diverse Documents**: Fine-tuned with diverse document annotation data to deliver high-quality results across various complex document types. - **Modular Design**: The flexible modular design allows users to easily combine and construct various applications by modifying configuration files and minimal code, making application building as straightforward as stacking blocks. - **Comprehensive Evaluation Benchmarks**: Provides diverse and comprehensive PDF evaluation benchmarks, enabling users to choose the most suitable model based on evaluation results. **Experience PDF-Extract-Kit now and unlock the limitless potential of PDF documents!** > **Note:** PDF-Extract-Kit is designed for high-quality document processing and functions as a model toolbox. > If you are interested in extracting high-quality document content (e.g., converting PDFs to Markdown), please use [MinerU](https://github.com/opendatalab/MinerU), which combines the high-quality predictions from PDF-Extract-Kit with specialized engineering optimizations for more convenient and efficient content extraction. > If you're a developer looking to create engaging applications such as document translation, document Q&A, or document assistants, you'll find it very convenient to build your own projects using PDF-Extract-Kit. In particular, we will periodically update the PDF-Extract-Kit/project directory with interesting applications, so stay tuned! **We welcome researchers and engineers from the community to contribute outstanding models and innovative applications by submitting PRs to become contributors to the PDF-Extract-Kit project.** ## Model Overview | **Task Type** | **Description** | **Models** | |-------------------|---------------------------------------------------------------------------------|-------------------------------| | **Layout Detection** | Locate different elements in a document: including images, tables, text, titles, formulas | `DocLayout-YOLO_ft`, `YOLO-v10_ft`, `LayoutLMv3_ft` | | **Formula Detection** | Locate formulas in documents: including inline and block formulas | `YOLOv8_ft` | | **Formula Recognition** | Recognize formula images into LaTeX source code | `UniMERNet` | | **OCR** | Extract text content from images (including location and recognition) | `PaddleOCR` | | **Table Recognition** | Recognize table images into corresponding source code (LaTeX/HTML/Markdown) | `PaddleOCR+TableMaster`, `StructEqTable` | | **Reading Order** | Sort and concatenate discrete text paragraphs | Coming Soon! | ## News and Updates - `2024.10.22` 🎉🎉🎉 We are excited to announce that table recognition model [StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B), which supports output LaTeX, HTML and MarkdDown formats has been officially integrated into `PDF-Extract-Kit 1.0`. Please refer to the [table recognition algorithm documentation](https://pdf-extract-kit.readthedocs.io/en/latest/algorithm/table_recognition.html) for usage instructions! - `2024.10.17` 🎉🎉🎉 We are excited to announce that the more accurate and faster layout detection model, [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO), has been officially integrated into `PDF-Extract-Kit 1.0`. Please refer to the [layout detection algorithm documentation](https://pdf-extract-kit.readthedocs.io/en/latest/algorithm/layout_detection.html) for usage instructions! - `2024.10.10` 🎉🎉🎉 The official release of `PDF-Extract-Kit 1.0`, rebuilt with modularity for more convenient and flexible model usage! Please switch to the [release/0.1.1](https://github.com/opendatalab/PDF-Extract-Kit/tree/release/0.1.1) branch for the old version. - `2024.08.01` 🎉🎉🎉 Added the [StructEqTable](demo/TabRec/StructEqTable/README_TABLE.md) module for table content extraction. Welcome to use it! - `2024.07.01` 🎉🎉🎉 We released `PDF-Extract-Kit`, a comprehensive toolkit for high-quality PDF content extraction, including `Layout Detection`, `Formula Detection`, `Formula Recognition`, and `OCR`. ## Performance Demonstration Many current open-source SOTA models are trained and evaluated on academic datasets, achieving high-quality results only on single document types. To enable models to achieve stable and robust high-quality results on diverse documents, we constructed diverse fine-tuning datasets and fine-tuned some SOTA models to obtain practical parsing models. Below are some visual results of the models. ### Layout Detection We trained robust `Layout Detection` models using diverse PDF document annotations. Our fine-tuned models achieve accurate extraction results on diverse PDF documents such as papers, textbooks, research reports, and financial reports, and demonstrate high robustness to challenges like blurring and watermarks. The visualization example below shows the inference results of the fine-tuned LayoutLMv3 model.  ### Formula Detection Similarly, we collected and annotated documents containing formulas in both English and Chinese, and fine-tuned advanced formula detection models. The visualization result below shows the inference results of the fine-tuned YOLO formula detection model:  ### Formula Recognition [UniMERNet](https://github.com/opendatalab/UniMERNet) is an algorithm designed for diverse formula recognition in real-world scenarios. By constructing large-scale training data and carefully designed results, it achieves excellent recognition performance for complex long formulas, handwritten formulas, and noisy screenshot formulas. ### Table Recognition [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) is a high efficiency toolkit that can converts table images into LaTeX/HTML/MarkDown. The latest version, powered by the InternVL2-1B foundation model, improves Chinese recognition accuracy and expands multi-format output options. #### For more visual and inference results of the models, please refer to the [PDF-Extract-Kit tutorial documentation](xxx). ## Evaluation Metrics Coming Soon! ## Usage Guide ### Environment Setup ```bash conda create -n pdf-extract-kit-1.0 python=3.10 conda activate pdf-extract-kit-1.0 pip install -r requirements.txt ``` > **Note:** If your device does not support GPU, please install the CPU version dependencies using `requirements-cpu.txt` instead of `requirements.txt`. > **Note:** Current Doclayout-YOLO only supports installation from pypi,if error raises during DocLayout-YOLO installation,please install through `pip3 install doclayout-yolo==0.0.2 --extra-index-url=https://pypi.org/simple` . ### Model Download Please refer to the [Model Weights Download Tutorial](https://pdf-extract-kit.readthedocs.io/en/latest/get_started/pretrained_model.html) to download the required model weights. Note: You can choose to download all the weights or select specific ones. For detailed instructions, please refer to the tutorial. ### Running Demos #### Layout Detection Model ```bash python scripts/layout_detection.py --config=configs/layout_detection.yaml ``` Layout detection models support **DocLayout-YOLO** (default model), YOLO-v10, and LayoutLMv3. For YOLO-v10 and LayoutLMv3, please refer to [Layout Detection Algorithm](https://pdf-extract-kit.readthedocs.io/en/latest/algorithm/layout_detection.html). You can view the layout detection results in the `outputs/layout_detection` folder. #### Formula Detection Model ```bash python scripts/formula_detection.py --config=configs/formula_detection.yaml ``` You can view the formula detection results in the `outputs/formula_detection` folder. #### OCR Model ```bash python scripts/ocr.py --config=configs/ocr.yaml ``` You can view the OCR results in the `outputs/ocr` folder. #### Formula Recognition Model ```bash python scripts/formula_recognition.py --config=configs/formula_recognition.yaml ``` You can view the formula recognition results in the `outputs/formula_recognition` folder. #### Table Recognition Model ```bash python scripts/table_parsing.py --config configs/table_parsing.yaml ``` You can view the table recognition results in the `outputs/table_parsing` folder. > **Note:** For more details on using the model, please refer to the[PDF-Extract-Kit-1.0 Tutorial](https://pdf-extract-kit.readthedocs.io/en/latest/get_started/pretrained_model.html). > This project focuses on using models for `high-quality` content extraction from `diverse` documents and does not involve reconstructing extracted content into new documents, such as PDF to Markdown. For such needs, please refer to our other GitHub project: [MinerU](https://github.com/opendatalab/MinerU). ## To-Do List - [x] **Table Parsing**: Develop functionality to convert table images into corresponding LaTeX/Markdown format source code. - [ ] **Chemical Equation Detection**: Implement automatic detection of chemical equations. - [ ] **Chemical Equation/Diagram Recognition**: Develop models to recognize and parse chemical equations and diagrams. - [ ] **Reading Order Sorting Model**: Build a model to determine the correct reading order of text in documents. **PDF-Extract-Kit** aims to provide high-quality PDF content extraction capabilities. We encourage the community to propose specific and valuable needs and welcome everyone to participate in continuously improving the PDF-Extract-Kit tool to advance research and industry development. ## License This project is open-sourced under the [AGPL-3.0](LICENSE) license. Since this project uses YOLO code and PyMuPDF for file processing, these components require compliance with the AGPL-3.0 license. Therefore, to ensure adherence to the licensing requirements of these dependencies, this repository as a whole adopts the AGPL-3.0 license. ## Acknowledgement - [LayoutLMv3](https://github.com/microsoft/unilm/tree/master/layoutlmv3): Layout detection model - [UniMERNet](https://github.com/opendatalab/UniMERNet): Formula recognition model - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy): Table recognition model - [YOLO](https://github.com/ultralytics/ultralytics): Formula detection model - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR): OCR model - [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO): Layout detection model ## Citation If you find our models / code / papers useful in your research, please consider giving ⭐ and citations 📝, thx :) ```bibtex @article{wang2024mineru, title={MinerU: An Open-Source Solution for Precise Document Content Extraction}, author={Wang, Bin and Xu, Chao and Zhao, Xiaomeng and Ouyang, Linke and Wu, Fan and Zhao, Zhiyuan and Xu, Rui and Liu, Kaiwen and Qu, Yuan and Shang, Fukai and others}, journal={arXiv preprint arXiv:2409.18839}, year={2024} } @misc{zhao2024doclayoutyoloenhancingdocumentlayout, title={DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception}, author={Zhiyuan Zhao and Hengrui Kang and Bin Wang and Conghui He}, year={2024}, eprint={2410.12628}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2410.12628}, } @misc{wang2024unimernet, title={UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition}, author={Bin Wang and Zhuangcheng Gu and Chao Xu and Bo Zhang and Botian Shi and Conghui He}, year={2024}, eprint={2404.15254}, archivePrefix={arXiv}, primaryClass={cs.CV} } @article{he2024opendatalab, title={Opendatalab: Empowering general artificial intelligence with open datasets}, author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua}, journal={arXiv preprint arXiv:2407.13773}, year={2024} } ``` ## Star History
ModelScope)]](https://www.modelscope.cn/models/OpenDataLab/PDF-Extract-Kit-1.0)
🔥🔥🔥 [MinerU:基于PDF-Extract-Kit的高效文档内容提取工具](https://github.com/opendatalab/MinerU)
👋 join us on Discord and WeChat
## 整体介绍 `PDF-Extract-Kit` 是一款功能强大的开源工具箱,旨在从复杂多样的 PDF 文档中高效提取高质量内容。以下是其主要功能和优势: - **集成文档解析主流模型**:汇聚布局检测、公式检测、公式识别、OCR等文档解析核心任务的众多SOTA模型; - **多样性文档下高质量解析结果**:结合多样性文档标注数据在进行模型微调,在复杂多样的文档下提供高质量解析结果; - **模块化设计**:模块化设计使用户可以通过修改配置文件及少量代码即可自由组合构建各种应用,让应用构建像搭积木一样简便; - **全面评测基准**:提供多样性全面的PDF评测基准,用户可根据评测结果选择最适合自己的模型。 **立即体验 PDF-Extract-Kit,解锁 PDF 文档的无限潜力!** > **注意:** PDF-Extract-Kit 专注于高质量文档处理,适合作为模型工具箱使用。 > 如果你想提取高质量文档内容(PDF转Markdown),请直接使用[MinerU](https://github.com/opendatalab/MinerU),MinerU结合PDF-Extract-Kit的高质量预测结果,进行了专门的工程优化,使得PDF文档内容提取更加便捷高效; > 如果你是一位开发者,希望搭建更多有意思的应用(如文档翻译,文档问答,文档助手等),基于PDF-Extract-Kit自行进行DIY将会十分便捷。特别地,我们会在`PDF-Extract-Kit/project`下面不定期更新一些有趣的应用,敬请期待! **我们欢迎社区研究员和工程师贡献优秀模型和创新应用,通过提交 PR 成为 PDF-Extract-Kit 的贡献者。** ## 模型概览 | **任务类型** | **任务描述** | **模型** | |--------------|---------------------------------------------------------------------------------|------------------------------| | **布局检测** | 定位文档中不同元素位置:包含图像、表格、文本、标题、公式等 | `DocLayout-YOLO_ft`, `YOLO-v10_ft`, `LayoutLMv3_ft` | | **公式检测** | 定位文档中公式位置:包含行内公式和行间公式 | `YOLOv8_ft` | | **公式识别** | 识别公式图像为latex源码 | `UniMERNet` | | **OCR** | 提取图像中的文本内容(包括定位和识别) | `PaddleOCR` | | **表格识别** | 识别表格图像为对应源码(Latex/HTML/Markdown) | `PaddleOCR+TableMaster`,`StructEqTable` | | **阅读顺序** | 将离散的文本段落进行排序拼接 | Coming Soon ! | ## 新闻和更新 - `2024.10.22` 🎉🎉🎉 支持LaTex和HTML等多种输出格式的表格模型[StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B)正式接入`PDF-Extract-Kit 1.0`,请参考[表格识别算法文档](https://pdf-extract-kit.readthedocs.io/zh-cn/latest/algorithm/table_recognition.html)进行使用! - `2024.10.17` 🎉🎉🎉 检测结果更准确,速度更快的布局检测模型`DocLayout-YOLO`正式接入`PDF-Extract-Kit 1.0`,请参考[布局检测算法文档](https://pdf-extract-kit.readthedocs.io/zh-cn/latest/algorithm/layout_detection.html)进行使用! - `2024.10.10` 🎉🎉🎉 基于模块化重构的`PDF-Extract-Kit 1.0`正式版本正式发布,模型使用更加便捷灵活!老版本请切换至[release/0.1.1](https://github.com/opendatalab/PDF-Extract-Kit/tree/release/0.1.1)分支进行使用。 - `2024.08.01` 🎉🎉🎉 新增了[StructEqTable](demo/TabRec/StructEqTable/README_TABLE.md)表格识别模块用于表格内容提取,欢迎使用! - `2024.07.01` 🎉🎉🎉 我们发布了`PDF-Extract-Kit`,一个用于高质量PDF内容提取的综合工具包,包括`布局检测`、`公式检测`、`公式识别`和`OCR`。 ## 效果展示 当前的一些开源SOTA模型多基于学术数据集进行训练评测,仅能在单一的文档类型上获取高质量结果。为了使得模型能够在多样性文档上也能获得稳定鲁棒的高质量结果,我们构建多样性的微调数据集,并在一些SOTA模型上微调已得到可实用解析模型。下边是一些模型的可视化结果。 ### 布局检测 结合多样性PDF文档标注,我们训练了鲁棒的`布局检测`模型。在论文、教材、研报、财报等多样性的PDF文档上,我们微调后的模型都能得到准确的提取结果,对于扫描模糊、水印等情况也有较高鲁棒性。下面可视化示例是经过微调后的LayoutLMv3模型的推理结果。  ### 公式检测 同样的,我们收集了包含公式的中英文文档进行标注,基于先进的公式检测模型进行微调,下面可视化结果是微调后的YOLO公式检测模型的推理结果:  ### 公式识别 [UniMERNet](https://github.com/opendatalab/UniMERNet)是针对真实场景下多样性公式识别的算法,通过构建大规模训练数据及精心设计的结果,使得其可以对复杂长公式、手写公式、含噪声的截图公式均有不错的识别效果。 ### 表格识别 [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)是一个高效表格内容提取工具,能够将表格图像转换为LaTeX/HTML/Markdown格式,最新版本使用InternVL2-1B基础模型,提高了中文识别准确度并增加了多格式输出能力。 #### 更多模型的可视化结果及推理结果可以参考[PDF-Extract-Kit教程文档](xxx) ## 评测指标 Coming Soon! ## 使用教程 ### 环境安装 ```bash conda create -n pdf-extract-kit-1.0 python=3.10 conda activate pdf-extract-kit-1.0 pip install -r requirements.txt ``` > **注意:** 如果你的设备不支持 GPU,请使用 `requirements-cpu.txt` 安装 CPU 版本的依赖。 > **注意:** 目前doclayout-yolo仅支持从pypi源安装,如果出现doclayout-yolo无法安装,请通过 `pip3 install doclayout-yolo==0.0.2 --extra-index-url=https://pypi.org/simple` 安装。 ### 模型下载 参考[模型权重下载教程](https://pdf-extract-kit.readthedocs.io/zh-cn/latest/get_started/pretrained_model.html)下载所需模型权重。注:可以选择全部下载,也可以选择部分下载,具体操作参考教程。 ### Demo运行 #### 布局检测模型 ```bash python scripts/layout_detection.py --config=configs/layout_detection.yaml ``` 布局检测模型支持**DocLayout-YOLO**(默认模型),YOLO-v10,以及LayoutLMv3。对于YOLO-v10和LayoutLMv3的布局检测,请参考[Layout Detection Algorithm](https://pdf-extract-kit.readthedocs.io/zh-cn/latest/algorithm/layout_detection.html)。你可以在 `outputs/layout_detection` 文件夹下查看布局检测结果。 #### 公式检测模型 ```bash python scripts/formula_detection.py --config=configs/formula_detection.yaml ``` 你可以在 `outputs/formula_detection` 文件夹下查看公式检测结果。 #### 文本识别(OCR)模型 ```bash python scripts/ocr.py --config=configs/ocr.yaml ``` 你可以在 `outputs/ocr` 文件夹下查看OCR结果。 #### 公式识别模型 ```bash python scripts/formula_recognition.py --config=configs/formula_recognition.yaml ``` 你可以在 `outputs/formula_recognition` 文件夹下查看公式识别结果。 #### 表格识别模型 ```bash python scripts/table_parsing.py --config configs/table_parsing.yaml ``` 你可以在 `outputs/table_parsing` 文件夹下查看表格内容识别结果。 > **注意:** 更多模型使用细节请查看[PDF-Extract-Kit-1.0 中文教程](https://pdf-extract-kit.readthedocs.io/zh-cn/latest/get_started/pretrained_model.html). > 本项目专注使用模型对`多样性`文档进行`高质量`内容提取,不涉及提取后内容拼接成新文档,如PDF转Markdown。如果有此类需求,请参考我们另一个Github项目: [MinerU](https://github.com/opendatalab/MinerU) ## 待办事项 - [x] **表格解析**:开发能够将表格图像转换成对应的LaTeX/Markdown格式源码的功能。 - [ ] **化学方程式检测**:实现对化学方程式的自动检测。 - [ ] **化学方程式/图解识别**:开发识别并解析化学方程式的模型。 - [ ] **阅读顺序排序模型**:构建模型以确定文档中文本的正确阅读顺序。 **PDF-Extract-Kit** 旨在提供高质量PDF文件的提取能力。我们鼓励社区提出具体且有价值的需求,并欢迎大家共同参与,以不断改进PDF-Extract-Kit工具,推动科研及产业发展。 ## 协议 本项目采用 [AGPL-3.0](LICENSE) 协议开源。 由于本项目中使用了 YOLO 代码和 PyMuPDF 进行文件处理,这些组件都需要遵循 AGPL-3.0 协议。因此,为了确保遵守这些依赖项的许可证要求,本仓库整体采用 AGPL-3.0 协议。 ## 致谢 - [LayoutLMv3](https://github.com/microsoft/unilm/tree/master/layoutlmv3): 布局检测模型 - [UniMERNet](https://github.com/opendatalab/UniMERNet): 公式识别模型 - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy): 表格识别模型 - [YOLO](https://github.com/ultralytics/ultralytics): 公式检测模型 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR): OCR模型 - [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO): 布局检测模型 ## Citation 如果你觉得我们模型/代码/技术报告对你有帮助,请给我们⭐和引用📝,谢谢 :) ```bibtex @article{wang2024mineru, title={MinerU: An Open-Source Solution for Precise Document Content Extraction}, author={Wang, Bin and Xu, Chao and Zhao, Xiaomeng and Ouyang, Linke and Wu, Fan and Zhao, Zhiyuan and Xu, Rui and Liu, Kaiwen and Qu, Yuan and Shang, Fukai and others}, journal={arXiv preprint arXiv:2409.18839}, year={2024} } @misc{wang2024unimernet, title={UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition}, author={Bin Wang and Zhuangcheng Gu and Chao Xu and Bo Zhang and Botian Shi and Conghui He}, year={2024}, eprint={2404.15254}, archivePrefix={arXiv}, primaryClass={cs.CV} } @misc{zhao2024doclayoutyoloenhancingdocumentlayout, title={DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception}, author={Zhiyuan Zhao and Hengrui Kang and Bin Wang and Conghui He}, year={2024}, eprint={2410.12628}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2410.12628}, } @article{he2024opendatalab, title={Opendatalab: Empowering general artificial intelligence with open datasets}, author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua}, journal={arXiv preprint arXiv:2407.13773}, year={2024} } ``` ## Star历史| Model | Description | Characteristics | Model weight | Config file |
|---|---|---|---|---|
| DocLayout-YOLO | Improved based on YOLO-v10: 1. Generate diverse pre-training data,enhance generalization ability across multiple document types 2. Model architecture improvement, improve perception ability on scale-varing instances Details in DocLayout-YOLO |
Speed:Fast, Accuracy:High | doclayout_yolo_ft.pt | layout_detection.yaml |
| YOLO-v10 | Base YOLO-v10 model | Speed:Fast, Accuracy:Moderate | yolov10l_ft.pt | layout_detection_yolo.yaml |
| LayoutLMv3 | Base LayoutLMv3 model | Speed:Slow, Accuracy:High | layoutlmv3_ft | layout_detection_layoutlmv3.yaml |
High-Quality Document Parsing Toolkit
Tutorial ------------- .. toctree:: :maxdepth: 2 :caption: Getting Started get_started/installation.rst get_started/pretrained_model.rst get_started/quickstart.rst .. toctree:: :maxdepth: 2 :caption: Core Algorithm Modules algorithm/layout_detection.rst algorithm/formula_detection.rst algorithm/formula_recognition.rst algorithm/ocr.rst algorithm/table_recognition.rst algorithm/reading_order.rst .. toctree:: :maxdepth: 2 :caption: Task Extensions task_extend/code.rst task_extend/doc.rst task_extend/evaluation.rst .. toctree:: :maxdepth: 2 :caption: Supported Models models/supported.md .. toctree:: :maxdepth: 2 :caption: Model Performance Evaluation evaluation/layout_detection.rst evaluation/formula_detection.rst evaluation/formula_recognition.rst evaluation/ocr.rst evaluation/table_recognition.rst evaluation/reading_order.rst evaluation/pdf_extract.rst .. toctree:: :maxdepth: 2 :caption: PDF Projects project/pdf_extract.md project/doc_translate.md project/speed_up.md ================================================ FILE: docs/en/make.bat ================================================ @ECHO OFF pushd %~dp0 REM Command file for Sphinx documentation if "%SPHINXBUILD%" == "" ( set SPHINXBUILD=sphinx-build ) set SOURCEDIR=. set BUILDDIR=_build %SPHINXBUILD% >NUL 2>NUL if errorlevel 9009 ( echo. echo.The 'sphinx-build' command was not found. Make sure you have Sphinx echo.installed, then set the SPHINXBUILD environment variable to point echo.to the full path of the 'sphinx-build' executable. Alternatively you echo.may add the Sphinx directory to PATH. echo. echo.If you don't have Sphinx installed, grab it from echo.https://www.sphinx-doc.org/ exit /b 1 ) if "%1" == "" goto help %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% goto end :help %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% :end popd ================================================ FILE: docs/en/models/supported.md ================================================ # The Supported Models ================================================ FILE: docs/en/notes/changelog.md ================================================ # Changelog ## v1.0.0 (2024-10-10) The PDF-Extract-Kit-1.0 has been refactored with a more streamlined and user-friendly modular design! 🔥🔥🔥 ## v0.1.0 (2024-07-01) Official release of PDF-Extract-Kit! 🔥🔥🔥 ### Highlights - PDF-Extract-Kit-1.0 offers a high-quality layout detection model, DocLayout-YOLO. ================================================ FILE: docs/en/project/doc_translate.rst ================================================ ================= Document Translation Project ================= XXXX XXXX ================================================ FILE: docs/en/project/pdf_extract.rst ================================================ ================= Document Content Extraction Project ================= Introduction ==================== Document content extraction aiming to extract all information of document file and convert it to computer readable result(such as markdown file). It's subtasks including layout detection, formula detection, formula recognition, OCR and other tasks. Project Usage ==================== With the environment properly set up, simply run the project by executing ``project/pdf2markdown/scripts/run_project.py`` . .. code:: shell $ python project/pdf2markdown/scripts/run_project.py --config project/pdf2markdown/configs/pdf2markdown.yaml Project Configuration -------------------- .. code:: yaml inputs: assets/demo/formula_detection outputs: outputs/pdf2markdown visualize: True merge2markdown: True tasks: layout_detection: model: layout_detection_yolo model_config: img_size: 1024 conf_thres: 0.25 iou_thres: 0.45 model_path: models/Layout/YOLO/doclayout_yolo_ft.pt formula_detection: model: formula_detection_yolo model_config: img_size: 1280 conf_thres: 0.25 iou_thres: 0.45 batch_size: 1 model_path: models/MFD/YOLO/yolo_v8_ft.pt formula_recognition: model: formula_recognition_unimernet model_config: batch_size: 128 cfg_path: pdf_extract_kit/configs/unimernet.yaml model_path: models/MFR/unimernet_tiny ocr: model: ocr_ppocr model_config: lang: ch show_log: True det_model_dir: models/OCR/PaddleOCR/det/ch_PP-OCRv4_det rec_model_dir: models/OCR/PaddleOCR/rec/ch_PP-OCRv4_rec det_db_box_thresh: 0.3 - inputs/outputs: Define the input path and the output path, respectively. - visualize: Whether to visualize the project results. Visualized results will be saved in the outputs directory. - merge2markdown: Whether to merge the results into markdown documents. Only simple single-column text is supported. For markdown conversion of more complex layout documents, please refer to `MinerU| 模型 | 简述 | 特点 | 模型权重 | 配置文件 |
|---|---|---|---|---|
| DocLayout-YOLO | 基于YOLO-v10模型改进: 1. 生成多样性预训练数据,提升对多种类型文档泛化性 2. 模型结构改进,提升对多尺度目标感知能力 详见DocLayout-YOLO |
速度快、精度高 | doclayout_yolo_ft.pt | layout_detection.yaml |
| YOLO-v10 | 基础YOLO-v10模型 | 速度快,精度一般 | yolov10l_ft.pt | layout_detection_yolo.yaml |
| LayoutLMv3 | 基础LayoutLMv3模型 | 速度慢,精度较好 | layoutlmv3_ft | layout_detection_layoutlmv3.yaml |
高质量文档解析工具箱
文档 ------------- .. toctree:: :maxdepth: 2 :caption: 快速上手 get_started/installation.rst get_started/pretrained_model.rst get_started/quickstart.rst .. toctree:: :maxdepth: 2 :caption: 基础算法模块 algorithm/layout_detection.rst algorithm/formula_detection.rst algorithm/formula_recognition.rst algorithm/ocr.rst algorithm/table_recognition.rst algorithm/reading_order.rst .. toctree:: :maxdepth: 2 :caption: 新任务拓展 task_extend/code.rst task_extend/doc.rst task_extend/evaluation.rst .. toctree:: :maxdepth: 2 :caption: 支持的模型列表 models/supported.md .. toctree:: :maxdepth: 2 :caption: 模型性能评测 evaluation/layout_detection.rst evaluation/formula_detection.rst evaluation/formula_recognition.rst evaluation/ocr.rst evaluation/table_recognition.rst evaluation/reading_order.rst evaluation/pdf_extract.rst .. toctree:: :maxdepth: 2 :caption: PDF项目 project/pdf_extract.md project/doc_translate.md project/speed_up.md ================================================ FILE: docs/zh_cn/make.bat ================================================ @ECHO OFF pushd %~dp0 REM Command file for Sphinx documentation if "%SPHINXBUILD%" == "" ( set SPHINXBUILD=sphinx-build ) set SOURCEDIR=. set BUILDDIR=_build %SPHINXBUILD% >NUL 2>NUL if errorlevel 9009 ( echo. echo.The 'sphinx-build' command was not found. Make sure you have Sphinx echo.installed, then set the SPHINXBUILD environment variable to point echo.to the full path of the 'sphinx-build' executable. Alternatively you echo.may add the Sphinx directory to PATH. echo. echo.If you don't have Sphinx installed, grab it from echo.https://www.sphinx-doc.org/ exit /b 1 ) if "%1" == "" goto help %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% goto end :help %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% :end popd ================================================ FILE: docs/zh_cn/models/supported.md ================================================ # 已支持的模型 Comming soon! ================================================ FILE: docs/zh_cn/notes/changelog.md ================================================ # 变更日志 ## v0.2.0 (2024.09.30) PDF-Extract-Kit 代码重构,模块化设计更加简洁易用! 🔥🔥🔥 ## v0.1.0 (2024.07.01) PDF-Extract-Kit 正式发布!🔥🔥🔥 ### 亮点 - PDF-Extract-Kit提供高质量布局检测模型 DocLayout-YOLO - PDF-Extract-Kit提供高质量公式检测模型 YOLOv8 ================================================ FILE: docs/zh_cn/project/doc_translate.rst ================================================ ================= 文档翻译项目 ================= Comming soon! ================================================ FILE: docs/zh_cn/project/pdf_extract.rst ================================================ ================= 文档内容提取项目 ================= 简介 ==================== 文档内容提取是利用布局检测,公式检测,公式识别,OCR等模型,提取文档中的信息,并转换为markdown文本。 项目使用 ==================== 在配置好环境的情况下,直接执行 ``project/pdf2markdown/scripts/run_project.py`` 即可运行文档内容提取项目。 .. code:: shell $ python project/pdf2markdown/scripts/run_project.py --config project/pdf2markdown/configs/pdf2markdown.yaml 项目配置 -------------------- .. code:: yaml inputs: assets/demo/formula_detection outputs: outputs/pdf2markdown visualize: True merge2markdown: True tasks: layout_detection: model: layout_detection_yolo model_config: img_size: 1024 conf_thres: 0.25 iou_thres: 0.45 model_path: models/Layout/YOLO/doclayout_yolo_ft.pt formula_detection: model: formula_detection_yolo model_config: img_size: 1280 conf_thres: 0.25 iou_thres: 0.45 batch_size: 1 model_path: models/MFD/YOLO/yolo_v8_ft.pt formula_recognition: model: formula_recognition_unimernet model_config: batch_size: 128 cfg_path: pdf_extract_kit/configs/unimernet.yaml model_path: models/MFR/unimernet_tiny ocr: model: ocr_ppocr model_config: lang: ch show_log: True det_model_dir: models/OCR/PaddleOCR/det/ch_PP-OCRv4_det rec_model_dir: models/OCR/PaddleOCR/rec/ch_PP-OCRv4_rec det_db_box_thresh: 0.3 - inputs/outputs: 分别定义输入文件路径和输出路径 - visualize: 是否对模型结果进行可视化,可视化结果会保存在outputs目录下。 - merge2markdown: 是否将结果合并为markdown文档,这里只支持简单的单栏文本从上往下进行拼接,更复杂布局文档的markdown转换请参考 `MinerU